The Actual Procedure and Process of Computer Adaptive Testing (CAT)

CAT

2026-05-05

15 min read

irufano

Ctrl+F

Contents

The Actual Procedure and Process of Computer Adaptive Testing (CAT) image

Thumbnail Credit

Overview

Computerized Adaptive Testing (CAT) is a form of computer-based testing that adapts in real-time to each test-taker's ability level. According to Weiss (IACAT), its objective is "to select, for each examinee, the set of test questions from a pre-calibrated item bank that simultaneously most effectively and efficiently measures that person on the trait" [1].

Unlike conventional fixed-form tests where all examinees answer the same predetermined questions, CAT dynamically selects items based on the test-taker's estimated ability as they progress through the test [2]. The result is a test that is smarter, shorter, fairer, and more precise [3].

Part 1: The Five Implementation Steps

According to Seo (2017), there are five steps for implementing a CAT system [4]:

Step 1 — Determine Feasibility

Before building a CAT, the test program must evaluate whether CAT is appropriate:

Is there a sufficient volume of test-takers to calibrate items? (Rule of thumb: at least 200–300 per item)
Is the construct unidimensional (measuring one trait)? IRT-based CAT assumes this
Is the item bank large enough? (Recommended: at least 3× the intended test length) [3]
Are there sufficient resources to build and maintain the platform?

If the test volume is very small (e.g., fewer than 100 test-takers per year), it may be impossible to build a usable item bank. It is not unusual for a testing program to start with a fixed-form test and later transition to CAT once the item bank is well established [5].

Step 2 — Establish an Item Bank

The item bank is the foundation of CAT. It must:

Contain a large pool of pre-calibrated items with known IRT parameters
Cover the full range of the ability scale (very easy to very hard)
Be organized with content categories for content balancing
Items must be pilot-tested on a representative sample before entering the live bank

"You can't just write a few items and subjectively rank them as Easy, Medium, or Hard. Instead, you need to write a large number of items and then pilot them on a representative sample of examinees." [3]

Step 3 — Pretest, Calibrate, and Link Item Parameters

Each item must be statistically calibrated through:

Pilot testing on real examinees
IRT calibration — estimating the $a$ (discrimination), $b$ (difficulty), and $c$ (guessing) parameters
Linking — placing all items on the same IRT scale so they are comparable

This is done through statistical analysis using actual candidate response data [4].

Step 4 — Specify the Five CAT Algorithm Components

The test developer must define all five components of the CAT algorithm (see Part 2 below) [4]:

Item bank structure
Starting item/point
Item selection rule
Scoring procedure
Termination criterion

Step 5 — Deploy the CAT

After specifying all components, the CAT is deployed. Ongoing management includes:

Content balancing
Item analysis and refreshment
Standard setting
Item exposure monitoring
Item bank updates [4]

Part 2: The Five Core Algorithm Components

Component 1: Item Bank

The item bank is a pool of pre-calibrated items, each with known IRT parameters [4][5].

Key requirements:

Size: Typically at least 5–10× the test length to allow adequate item selection and exposure control
Coverage: Items spanning the full difficulty range ( $- 3 \leq θ \leq 3$ )
Content balance: Items organized by content domain/category
Parameters: Each item has at minimum a difficulty ( $b$ ) parameter; ideally also discrimination ( $a$ ) and guessing ( $c$ ) for the 3PL model

Component 2: Starting Point

The starting point determines where the test begins on the ability scale. There are three common options [1][3]:

Option	Description	Use Case
Fixed value	Everyone starts at $\hat{θ}_{0} = 0$ (population mean)	Most common for general tests
Randomized	Start within a narrow range, e.g., $\hat{θ}_{0} \in [- 0.5, 0.5]$	Improves test security and item exposure
Predicted value	Based on prior data or external information	When prior ability estimate is available

"At the initial stages of a CAT, when only a single item or two has been administered, the next item is usually selected by a step rule — if the first item was answered correctly, the examinee's original theta estimate is increased by some amount (e.g., 0.50); if the first item was answered incorrectly, the original theta estimate is decreased by the same amount." [1]

Component 3: Item Selection Rule (Algorithm)

This is the heart of the CAT. After each response, the algorithm selects the next best item from the unused items in the bank [5].

The item selection process involves three sub-components [5]:

3a. Item Selection Criterion

The most common methods are Maximum Fisher Information and b-Matching [5][6]:

Maximum Fisher Information (most common)

Selects the item $i^{*}$ providing the highest statistical information at the current ability estimate $\hat{θ}$ :

i^{*} = i \in / U ar g max I_{i} (\hat{θ})

Where $I_{i} (\hat{θ})$ is the Item Information Function (IIF) evaluated at $\hat{θ}$ , and $U$ is the set of already-used items.

For the 3-Parameter Logistic (3PL) model, the IIF is:

I_{i} (θ) = \frac{a _{i}^{2} [ P _{i} ( θ ) - c _{i} ] ^{2}}{( 1 - c _{i} ) ^{2}} \cdot \frac{1 - P _{i} ( θ )}{P _{i} ( θ )}

Where $P_{i} (θ)$ is the probability of a correct response given by the 3PL model:

P_{i} (θ) = c_{i} + \frac{1 - c _{i}}{1 + e ^{- a_{i} (θ - b_{i})}}

b-Matching (difficulty matching)

Selects the item whose difficulty $b_{i}$ is closest to the current theta estimate:

i^{*} = i \in / U ar g min ∣ b_{i} - \hat{θ} ∣

Simple but effective; does not require full IIF calculation.

Other criteria: $a$ -stratification, weighted likelihood, Kullback-Leibler information [5].

3b. Content Balancing

Ensures the test covers required content domains proportionally — not just statistically optimal items. This addresses the concern of educators and subject matter experts who require balanced content coverage [2][5].

3c. Item Exposure Control

Prevents certain items from being administered too frequently, which would compromise test security [5]. Common methods:

Randomesque method — randomly selects from the top- $k$ most informative items
Sympson-Hetter method — probabilistically suppresses overexposed items
Fade-away method — gradually reduces exposure of frequently used items

"Selecting the right method for each of the 3 components of the item selection process — content balancing, the item selection criterion, and item exposure control — is not straightforward and cannot be considered separately for each of these 3 components because of the unique interactions among them." [5]

Component 4: Scoring Procedure (Theta Estimation)

After each response, the test-taker's ability estimate $\hat{θ}$ is updated [2][6].

The scoring algorithm takes all previous responses into account — not just the most recent one. Let $u = (u_{1}, u_{2}, \dots, u_{n})$ denote the vector of responses where $u_{j} = 1$ if correct and $u_{j} = 0$ if incorrect.

Maximum Likelihood Estimation (MLE)

Finds the $\hat{θ}$ that maximizes the likelihood of the observed response pattern:

\hat{θ}_{M L E} = θ ar g max L (θ ∣ u)

The likelihood function is:

L (θ ∣ u) = j = 1 \prod n P_{j} (θ)^{u_{j}} [1 - P_{j} (θ)]^{1 - u_{j}}

In practice, the log-likelihood is maximized:

ℓ (θ) = j = 1 \sum n [u_{j} ln P_{j} (θ) + (1 - u_{j}) ln (1 - P_{j} (θ))]

Pro: Unbiased estimate
Con: Undefined when all responses are correct or all incorrect [1]

Bayesian EAP (Expected A Posteriori)

Combines the likelihood with a prior distribution $π (θ)$ (typically standard normal $N (0, 1)$ ):

\hat{θ}_{E A P} = \frac{\int _{- \infty}^{\infty} θ \cdot L ( θ ∣ u ) \cdot π ( θ ) d θ}{\int _{- \infty}^{\infty} L ( θ ∣ u ) \cdot π ( θ ) d θ}

Pro: Works even with all-correct or all-incorrect response patterns
Pro: Provides a natural standard error estimate
Con: Slightly biased toward the prior mean in early items [6]

Bayesian MAP (Maximum A Posteriori)

Takes the mode of the posterior distribution:

\hat{θ}_{M A P} = θ ar g max [L (θ ∣ u) \cdot π (θ)]

Intermediate between MLE and EAP in terms of bias and variance.

"The algorithm then selects the most informative item from the calibrated item bank based on this estimate. After the examinee responds, the ability estimate is updated using maximum likelihood or Bayesian methods, and the cycle continues until a stopping criterion is met." [6]

Standard Error of Measurement (SEM)

The precision of the theta estimate is tracked via the Test Information Function:

TIF (θ) = j = 1 \sum n I_{j} (θ)

The Standard Error of Measurement is:

S E (\hat{θ}) = \frac{1}{TIF ( θ ^ )} = \frac{1}{\sum _{j = 1}^{n} I _{j} ( θ ^ )}

The SEM decreases as more items are administered. CAT continues until $S E (\hat{θ})$ falls below a specified threshold.

Component 5: Termination Criterion

The stopping rule decides when to end the test. Common criteria [2][4][6]:

Criterion	Condition	Advantage
Fixed length	Stop after $N$ items	Simple, equal testing time
Fixed precision	Stop when $S E (\hat{θ}) < ϵ$	Precision-based; adaptive length
Combined	Stop when $S E (\hat{θ}) < ϵ$ OR $n \geq N_{ma x}$	Balances precision and efficiency
Classification	Stop when ability is clearly above/below a cut score $θ_{c}$	Used in pass/fail exams (e.g., NCLEX)
Time limit	Stop after maximum time $T$	Practical constraint

For classification-based stopping (e.g., pass/fail), the decision rule is:

Decision = ⎩ ⎨ ⎧ Pass Fail Continue if \hat{θ} - z_{α /2} \cdot S E (\hat{θ}) > θ_{c} if \hat{θ} + z_{α /2} \cdot S E (\hat{θ}) < θ_{c} otherwise

Where $θ_{c}$ is the passing cut score and $z_{α /2}$ is the critical value at significance level $α$ .

"Fixed-length tests administer a predetermined number of items, while precision-based stopping continues until the standard error of measurement falls below a threshold." [6]

For high-stakes exams like the NCLEX:
"This pattern continues until you run out of time or until the computer identifies your competency level as above or below the passing standard." [7]

Part 3: The Full CAT Runtime Process

Once deployed, the following iterative process runs for each test-taker [2][33]:

Part 4: IRT Models Used in CAT

CAT is built on Item Response Theory. The three most common models are [10]:

1PL — Rasch Model

Only difficulty $b$ varies between items; discrimination is fixed at $a = 1$ and guessing $c = 0$ :

P_{i} (θ) = \frac{1}{1 + e ^{- (θ - b_{i})}}

2PL Model

Both difficulty $b$ and discrimination $a$ vary; no guessing ( $c = 0$ ):

P_{i} (θ) = \frac{1}{1 + e ^{- a_{i} (θ - b_{i})}}

3PL Model (most common in high-stakes CAT)

All three parameters vary — discrimination $a_{i}$ , difficulty $b_{i}$ , and guessing $c_{i}$ :

P_{i} (θ) = c_{i} + \frac{1 - c _{i}}{1 + e ^{- a_{i} (θ - b_{i})}}

Where:

$θ \in (- \infty, + \infty)$ , typically $[- 3, 3]$ — test-taker ability
$a_{i} > 0$ — item discrimination (steepness of the curve)
$b_{i} \in [- 3, 3]$ — item difficulty (location of the curve)
$c_{i} \in [0, 1]$ — pseudo-guessing parameter (lower asymptote)

Part 5: Historical Origins

The adaptive testing concept is not new. Its origins can be traced to Alfred Binet's IQ test (1905), which used an adaptive procedure: items were organized by age-difficulty level, and the examiner would probe upward or downward based on each child's responses [1].

The key historical milestones are [1][8]:

Year	Event
1905	Binet's adaptive IQ test — first adaptive testing procedure
1952	Lord observes that ability scores are test-independent (unlike observed scores)
1960	Rasch describes the one-parameter logistic IRT model
1973	Weiss proposes the "stradaptive" computer-delivered test
1980	Lord publishes the foundational IRT textbook
1994	NCLEX (nursing licensure exam) adopts CAT — first large-scale operational use
2007	National Registry of Emergency Medical Technicians adopts CAT

Part 6: Practical Example — How CAT Selects Items

The following example illustrates the item selection process step by step [3]:

Suppose we have five items in the bank and the starting theta is $\hat{θ}_{0} = 0.0$ .

Round 1:

Compute $I_{i} (0.0)$ for all items → Item 4 has highest information
Test-taker answers incorrectly ( $u_{1} = 0$ )
Run MLE/EAP → new estimate: $\hat{θ}_{1} = - 2.0$
Check termination: $S E (\hat{θ}_{1})$ too large, continue

Round 2:

Compute $I_{i} (- 2.0)$ for remaining items → Item 2 has highest information
Test-taker answers correctly ( $u_{2} = 1$ )
Update theta → $\hat{θ}_{2} = - 0.8$
Check termination: not done yet

Round 3:

Item 2 and Item 4 already used ( $U = {2, 4}$ )
Next best available at $\hat{θ}_{2} = - 0.8$ → Item 1
Test-taker answers correctly ( $u_{3} = 1$ , item is easy)
Update theta → $\hat{θ}_{3} = - 0.2$
Continue...

This demonstrates how CAT homes in on the test-taker's true ability through successive approximation.

Part 7: Advantages and Challenges

Advantages

Advantage	Detail
Efficiency	Typically 50% fewer items needed for same precision
Equal precision	$S E (\hat{θ})$ is controlled uniformly across all ability levels
Fairness	Each test-taker gets items appropriate to their level
Immediate results	Scoring is done in real-time
Security	Unique item sets make sharing answers less useful
Adaptive length	Test ends when $S E (\hat{θ}) < ϵ$

[2][4][6]

Challenges

Challenge	Detail
Item bank development	Requires large pilot studies to calibrate $a$ , $b$ , $c$ parameters
Cost	Expensive to build and maintain the platform
Test-taker experience	Test-takers may feel discouraged if items seem consistently hard
Content balance	Purely statistical selection (argmax $I_{i} (\hat{θ})$ ) may neglect content coverage
Item exposure	Items with high $I_{i} (θ)$ at common $θ$ values are overused
Unidimensionality	IRT assumes a single latent trait $θ$ ; multidimensional constructs are harder

[2][4][5]

"A trade-off must be made between the psychological experiences of test-takers and measurement efficiency." [2]

Part 8: Real-World Applications

CAT is used in many high-stakes assessments worldwide [1][4]:

Exam	Domain	CAT Since
NCLEX-RN / NCLEX-PN	Nursing licensure (USA)	1994
GRE General Test	Graduate school admissions	1994
GMAT	Business school admissions	1997
NREMT	Emergency Medical Technicians	2007
TOEFL iBT	English language proficiency	adaptive sections
ASVAB-CAT	US Military enlistment	adaptive version

Summary

The CAT process can be summarized in three phases:

Before the test (Development):

Determine feasibility
Write and pilot items
Calibrate IRT parameters ( $a_{i}$ , $b_{i}$ , $c_{i}$ )
Build and validate the item bank
Define all five algorithm components

During the test (Runtime loop):

Set starting point $\hat{θ}_{0}$
Select $i^{*} = ar g max I_{i} (\hat{θ})$ from unused items
Administer item, record response $u \in {0, 1}$
Update $\hat{θ}$ using MLE or EAP; compute $S E (\hat{θ})$
Check stopping rule → repeat or stop

After the test (Reporting):

Report final $\hat{θ}$ and $S E (\hat{θ})$
Apply score transformation if needed
Log responses for item bank maintenance

References

[1] Weiss, D. J. (n.d.). Introduction to CAT. International Association for Computerized Adaptive Testing (IACAT). https://iacat.org/introduction-to-cat/

[2] Kim, J., & Chung, H. (2017). The impacts of computer adaptive testing from a variety of perspectives. PMC. https://pmc.ncbi.nlm.nih.gov/articles/PMC5549015/

[3] Assessment Systems Corporation. (2025). Computerized Adaptive Testing (CAT): Software, Meaning, Example. https://assess.com/computerized-adaptive-testing/

[4] Seo, D. G. (2017). Overview and current management of computerized adaptive testing in licensing/certification examinations. Journal of Educational Evaluation for Health Professions, 14, 17. https://doi.org/10.3352/jeehp.2017.14.17

[5] Kim, D., & Chung, H. (2018). Components of the item selection algorithm in computerized adaptive testing. Journal of Educational Evaluation for Health Professions. https://pmc.ncbi.nlm.nih.gov/articles/PMC5968224/

[6] Cogn-IQ. (2026). Adaptive Testing in Psychometrics — Definition & Examples. https://www.cogn-iq.org/learn/theory/adaptive-testing/

[7] UWorld Nursing. (2025). What Is NCLEX Computerized Adaptive Testing (CAT)? https://nursing.uworld.com/blog/nclex-computer-adaptive-test/

[8] Janssen, R., & De Boeck, P. (2010). Computerized adaptive testing: implementation issues. arXiv. https://arxiv.org/pdf/1012.0042

[9] Wikipedia contributors. (2026). Computerized adaptive testing. Wikipedia. https://en.wikipedia.org/wiki/Computerized_adaptive_testing

[10] Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates.

Tags:

CAT

Adaptive Test