A Review of Test Equating Methods with a Special Focus on IRT-Based Approaches

Valentina Sansivieri, Marie Wiberg, Mariagiulia Matteucci


The overall aim of this work is to review test equating methods with a particularly detailed description of item response theory (IRT) equating. Test score equating is used to compare different test scores from different test forms. Several methods have been developed to conduct equating: traditional methods, kernel method, and IRT equating. We synthetically explain the traditional equating methods which include mean equating, linear equating and equipercentile equating and which have been developed under all the possible data collection designs. We also briefly describe the idea of the kernel method: this is a unified approach to test equating for which recent interesting developments have been proposed. Then we focus on IRT equating, by describing old and new methods: in particular, we define IRT observed-score kernel equating and IRT observed-score equating using covariates, as well as other recent proposals in this field. We conclude the review by describing strengths and weaknesses of the different discussed approaches and by identifying future research topics.


Test equating; IRT test equating; Item response theory

Full Text:

PDF (English)


ACT (2007). ACT Technical Manual. Act Inc., Iowa City.

B. ANDERSSON (2016). Asymptotic standard errors of observed-score equating with politomous IRT models. Journal of Educational Measurement, 53, no. 4, pp. 459–477.

B. ANDERSSON, K. BRÄNBERG, M. WIBERG (2013). Performing the kernel method of test equating with the package kequate. Journal of Statistical Software, 55, no. 6, pp. 1–25.

B. ANDERSSON, M. WIBERG (2016). Item response theory observed-score kernel equating. Psychometrika, 82, no. 1, pp. 48–66.

W. ANGOFF (1971). Scales, norms and equivalent scores. In R. L. THORNDIKE (ed.), Educational Measurement, American Council on Educations, Washington DC, pp. 508–600.

M. BATTAUZ (2013). IRT test equating in complex linkage plans. Psychometrika, 78, no. 3, pp. 464–480.

M. BATTAUZ (2015). equateIRT: an R package for IRT test equating. Journal of Statistical Software, 68, no. 7, pp. 1–22.

M. BATTAUZ (2017). Multiple equating of separate IRT calibrations. Psychometrika, 82, no. 3, pp. 610–636.

A. BIRNBAUM (1968). Some latent trait models and their use in inferring an examinees' ability. In F. LORD, M. NOVICK (eds.), Statistical Theories of Mental Test Scores, Addison-Wesley, Reading (Mass.), pp. 397–479.

K. BRÄNBERG, M. WIBERG (2011). Observed score linear equating with covariates. Journal of Educational Measurement, 48, no. 4, pp. 419–440.

H. I. BRAUN (1982). Observed score test equating: a mathematical analysis of some ETS equating procedures. In P.HOLLAND, D.RUBIN (eds.), Test Equating,Academic Press, New York, pp. 9–49.

B. G. BROSSMAN, W.-C. LEE (2013). Observed score and true score equating procedures for multidimensional item response theory. Applied Psychological Measurement, 37, no. 6, pp. 460–481.

R. CHALMERS (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, no. 6, pp. 1–29.

J. GONZÁLEZ, M. WIBERG (2017). Applying Test Equating Methods Using R. Springer, New York.

J.GONZÁLEZ, M.WIBERG, VON DAVIER A.A. (2016). A note on the Poisson’s binomial distribution in item response theory. Applied Psychological Measurement, 40, no. 4, pp. 302–310.

T. HAEBARA (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, pp. 144–149.

B.HANSON, A. BÉGUIN (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, no. 1, pp. 3–24.

Y. HE, Z. CUI, S. OSTERLIND (2015). New robust scale transformation methods in the presence of outlying common items. Applied Psychological Measurement, 39, no. 8, pp. 613–626.

P. HOLLAND, W. STRAWDERMAN (2011). How to average equating functions if you must. In A. VON DAVIER (ed.), Statistical Models for Test Equating, Scaling, and Linking, Springer, New York, pp. 89–107.

C. L.HULIN, F. DRASGOW, C. K. PARSONS (1983). Item Response Theory: Application to Psychological Measurement. Dorsey Pr, Homewood, IL.

M. KENDALL, A. STUART (1977). The Advanced Theory of Statistics. Macmillan, New York.

M. KOLEN (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, pp. 1–11.

M. KOLEN, R. BRENNAN (2014). Test Equating, Scaling, and Linking: Methods and Practices. Springer-Verlag, New York, 3rd ed.

G. LEE, W. LEE (2016). Bi-factor MIRT observed-score equating for mixed-format tests. Applied Measurement in Education, 29, no. 3, pp. 224–241.

Y.-H. LEE, A. VON DAVIER (2011). Equating through alternative kernels. In A. VON DAVIER (ed.), Statistical Models for Test Equating, Scaling, and Linking, Springer,New York, pp. 159–173.

F. M. LORD (1980). Applications of Item Response Theory to Practical Testing Problems. Erlbaum, Hillsdale, NJ.

F. M. LORD, M. S.WINGERSKY (1984). Comparison of IRT true-score and equipercentile observed-score equatings. Applied Psychological Measurement, 8, pp. 452–461.

B. H. LOYD, H. D. HOOVER (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, pp. 179–193.

P.-E. LYRÉN, R. K. HAMBLETON (2011). Consequences of violated equating assumptions under the equivalent groups design. International Journal of Testing, 11, no. 4, pp. 308–323.

G. L. MARCO (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, pp. 139–160.

H. OGASAWARA (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review, Otaru University of Commerce, 51, no. 1, pp. 1–23.

H. OGASAWARA (2001). Standard errors of item response theory equating/linking by response function methods. Applied Psychological Measurement, 25, pp. 53–67.

H. OGASAWARA (2003). Asymptotic standard errors of IRT observed-score equating methods. Psychometrika, 68, pp. 193–211.

N. S. RAJU (1988). The area between two item characteristic curves. Psychometrika, 53, pp. 495–502.

M. RECKASE (2009). Multidimensional Item Response Theory. Springer, New York.

F. RIJMEN, Y. QU, A. VON DAVIER (2011). Hypothesis testing of equating differences in the kernel equating framework. In A. VON DAVIER (ed.), Statistical Models for Test Equating, Scaling, and Linking, Springer, New York, pp. 317–326.

V. SANSIVIERI, M.WIBERG (2017). Item response theory equating with the non-equivalent groups with covariates design. In L. A. VAN DER ARK ET AL. (ed.), Quantitative Psychology. IMPS 2016. Springer Proceedings in Mathematics & Statistics, vol 196, Springer, Cham, pp. 275–285.

M. STOCKING, F. LORD (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, pp. 201–210.

W. TAO, Y. CAO (2016). An extension of IRT-based equating to the dichotomous testlet response theory model. Applied Measurement in Education, 29, no. 2, pp. 108–121.

L. TAY, D. NEWMAN, J. VERMUNT (2011). Using mixed-measurement item response theory with covariates (MM-IRT-C) to ascertain observed and unobserved measurement Equivalence. Organizational Research Methods, 1, no. 14, pp. 147–176.

A. A. VON DAVIER, P.W.HOLLAND, D. T. THAYER (2004). The Kernel Method of Test Equating. Springer-Verlag, New York.

M. WIBERG, K. BRÄNBERG (2015). Kernel equating under the non-equivalent groups with covariates design. Applied Psychological Measurement, 39, no. 5, pp. 1–13.

M. WIBERG, W. VAN DER LINDEN, A. VON DAVIER (2014). Local kernel observed-score equating. Journal of Educational Measurement, 51, no. 1, pp. 57–74.

DOI: 10.6092/issn.1973-2201/7066