Reliability is the cornerstone of measurement quality. In classical test theory, reliability is defined as the ratio of true-score variance to observed-score variance. Because any observed score X is the sum of a true score T and random error E, the reliability coefficient ρ_XX′ tells us what proportion of the variability in observed scores reflects genuine individual differences rather than measurement noise. A reliability of 1.0 means all variation is true variation; a reliability of 0.0 means all variation is error.
The Classical Decomposition
E(E) = 0, Cov(T, E) = 0
σ²_X = σ²_T + σ²_E
ρ_XX′ = σ²_T / σ²_X = 1 − σ²_E / σ²_X
The model assumes that errors are uncorrelated with true scores and have an expected value of zero. These assumptions are minimal — they are essentially definitions — which gives CTT its remarkable generality. However, this generality comes at a cost: true scores and error variances are not directly observable and must be estimated through specific study designs.
Types of Reliability
Test-retest reliability administers the same instrument on two occasions and correlates the scores. It estimates stability over time but confounds measurement error with genuine change. Parallel-forms reliability correlates scores from two equivalent test forms, removing memory effects but requiring the construction of truly parallel forms. Internal consistency estimates reliability from a single administration by examining inter-item relationships.
Spearman-Brown: ρ_kk′ = kρ / (1 + (k − 1)ρ)
where k = number of items (or factor by which test is lengthened)
Cronbach's alpha is the most widely reported reliability coefficient. It equals the mean of all possible split-half reliabilities (after Spearman-Brown correction) and provides a lower bound to reliability when items are not τ-equivalent. When items are congeneric — differing in their relationships to the true score — coefficient omega (ω) provides a more accurate estimate by using factor-analytic weights.
Reliability translates directly into the precision of individual scores through the standard error of measurement: SEM = σ_X × √(1 − ρ_XX′). A 95% confidence interval around an observed score is approximately X ± 1.96 × SEM. This quantity is arguably more useful than the reliability coefficient itself, because it is on the score metric and directly informs decisions about individuals. A test may have high reliability in one population (where true-score variance is large) and low reliability in another (where true-score variance is small), even though the SEM is identical.
Beyond Classical Reliability
Modern psychometrics has extended these ideas considerably. Item response theory provides item-level and ability-level information functions that generalize the concept of reliability to specific points on the score continuum. Generalizability theory partitions error variance into multiple facets. Yet the core insight of reliability theory — that all measurement is imperfect, and the degree of imperfection can be quantified — remains as central today as when Spearman first articulated it in 1904.
The choice among reliability methods depends on the purpose of measurement. Test-retest reliability is essential when stability is the question. Internal consistency is appropriate for homogeneous scales. Generalizability coefficients are preferred when multiple error sources must be disentangled. In every case, the reliability coefficient provides the critical link between observed scores and the theoretical constructs those scores are intended to represent.