Differential Item Functioning

Differential item functioning (DIF) exists when the probability of a correct (or endorsed) response to an item differs between two groups of examinees who have the same level of the latent trait being measured. DIF does not necessarily imply bias — the item may be measuring a relevant secondary dimension — but it signals that the item functions differently across groups and warrants careful scrutiny. DIF detection is a critical step in developing fair and equitable assessments.

Statistical Methods for DIF Detection

Mantel-Haenszel DIF Statistic α_MH = [Σ_k A_k / T_k] / [Σ_k B_k / T_k]

where at score level k:
A_k = n_Rk1 × n_Fk0, B_k = n_Rk0 × n_Fk1
T_k = total at level k

MH D-DIF = −2.35 × ln(α_MH)

The Mantel-Haenszel (MH) procedure is the most widely used method for detecting uniform DIF. It conditions on the total test score (as a proxy for ability) and computes a common odds ratio across score levels. The MH statistic tests whether this odds ratio differs significantly from 1.0. The D-DIF metric transforms the log odds ratio to an ETS delta scale. Items are classified as negligible (A: |D-DIF| < 1.0), moderate (B: 1.0 ≤ |D-DIF| < 1.5), or large (C: |D-DIF| ≥ 1.5) DIF.

Logistic Regression and IRT-Based Methods

Logistic regression extends the MH approach by testing for both uniform and nonuniform DIF. The model predicts item response from ability (total score), group membership, and their interaction. A significant main effect of group indicates uniform DIF (constant across ability); a significant interaction indicates nonuniform DIF (the DIF effect varies by ability level).

Logistic Regression DIF Model logit P(X = 1) = β₀ + β₁θ + β₂G + β₃(θ × G)

β₂ significant → uniform DIF
β₃ significant → nonuniform DIF
where G = group indicator (0 = reference, 1 = focal)

IRT-Based DIF

IRT-based methods compare item parameter estimates across groups. In Lord's chi-square test, item parameters are estimated separately in each group and the difference is tested using the asymptotic covariance matrix. In the likelihood-ratio approach (Thissen, Steinberg, & Wainer, 1993), a compact model (parameters constrained equal across groups) is compared to an augmented model (parameters free to vary) for each item. IRT methods have the advantage of separating difficulty and discrimination differences and are particularly powerful for detecting nonuniform DIF.

Practical Considerations

DIF analysis requires careful attention to the matching criterion. Using total test score as the matching variable is convenient but can be contaminated if DIF items are included in the score. Purification procedures iteratively remove identified DIF items from the matching variable and re-run the analysis. The choice of reference and focal groups, sample sizes, and effect size criteria all affect DIF detection rates.

Interpreting DIF requires distinguishing between statistical DIF and substantive bias. An item may show DIF because it measures a secondary construct that differs across groups — for example, a word problem in mathematics that requires advanced vocabulary. If this secondary construct is relevant to the intended measurement, the DIF may be acceptable. Expert review of flagged items, combined with statistical evidence, is essential for determining whether DIF items should be removed, revised, or retained. Modern fairness guidelines require routine DIF analysis as part of any high-stakes test development process.

Statistical Methods for DIF Detection

Logistic Regression and IRT-Based Methods

Practical Considerations

References

External Links

Statistical Methods for DIF Detection

Logistic Regression and IRT-Based Methods

Practical Considerations

Related Topics

References

External Links