Regression analysis with linked data: problems and possible solutions


  • Andrea Tancredi Università di Roma “La Sapienza”
  • Brunero Liseo Università di Roma “La Sapienza”



Bayesian regression, Hit-miss model, Metropolis-Hastings algorithm, Record linkage


In this paper we have described and extended some recent proposals on a general Bayesian methodology for performing record linkage and making inference using  the resulting matched units. In particular, we have framed the record linkage process into a formal statistical model which  comprises both the matching variables and the other variables included at the inferential stage. This way, the researcher is able to account for the matching process uncertainty in inferential procedures based on probabilistically linked data, and at the same time, he/she is also able to generate a feedback propagation of the information between the working statistical model and the record linkage stage.

We have argued that this feedback effect is both  essential to eliminate potential biases that otherwise would characterize the resulting linked data inference, and able to improve record linkage performances. The practical implementation of the procedure is based on the use of standard Bayesian computational techniques, such as Markov Chain Monte Carlo algorithms. Although the methodology is quite general, we have restricted our analysis to the popular and important case of  multiple linear regression set-up for expository convenience.


J. D. BANFIELD, A. E. RAFTERY (1993). Model-based gaussian and non-gaussian clustering. Biometrics, pp. 803–821.

T. BELIN, D. RUBIN (1995). A method for calibrating false - match rates in record linkage. Journal of the American Statistical Association, 90, pp. 694–707.

J. COPAS, F. HILTON (1990). Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society, A, 153, pp. 287–320.

I. FELLEGI, A. SUNTER (1969). A theory of record linkage. Journal of the American Statistical Association, 64, pp. 1183–1210.

M. FORTINI, B. LISEO, A. NUCCITELLI, M. SCANU (2001). On Bayesian record linkage. Research in Official Statistics, 4, pp. 185–198.

H. GOLDSTEIN, K. HARRON, A. WADE (2012). The analysis of record-linked data using multiple imputation with data value priors. Statistics in Medicine, 31, no. 28, pp. 3481–3493.

P. J. GREEN, K. V. MARDIA (2006). Bayesian alignment using hierarchical models, with application in protein bioinformatics. Biometrika, 93, pp. 235–254.

R. GUTMAN, C. C. AFENDULIS, A. M. ZASLAVSKY (2013). A Bayesian procedure for file linking to analyze end-of-life medical costs. Journal of the American Statistical Association, 108, no. 501, pp. 34–47.

R. HALL, R. C. STEORTS, S. E. FIENBERG (2013). Bayesian parametric and nonparametric inference for multiple record linkage. Working paper, Carnagie Mellon University.

K. HARRON, H. GOLDSTEIN, A. WADE, B. MULLER-PEBODY, R. PARSLOW, R. GILBERT (2013). Linkage, evaluation and analysis of national electronic healthcare data: application to providing enhanced blood-stream infection surveillance in paediatric intensive care. PloS one, 8, no. 12, p. e85278.

M. HOF, A. ZWINDERMAN (2012). Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables. Statistics in Medicine, 31, no. 30, pp. 4231–4242.

M. HOF, A. ZWINDERMAN (2015). A mixture model for the analysis of data derived from record linkage. Statistics in medicine, 34, no. 1, pp. 74–92.

M. JARO (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84, pp. 414–420.

G. KIM, R. CHAMBERS (2012). Regression analysis under incomplete linkage. Computational Statistics & Data Analysis, 56, no. 9, pp. 2756–2770.

P. LAHIRI, M. D. LARSEN (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, pp. 222–230.

M. LARSEN (2005). Advances in record linkage theory: Hierarchical Bayesian record linkage theory. Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 3277–3283.

M. D. LARSEN, D. RUBIN (2001). Iterative automated record linkage using mixture models. Journal of the American Statistical Association, 96, pp. 32–41.

D. LINDLEY (1977). A problem in forensic science. Biometrika, 64, pp. 207–213.

B. LISEO, A. TANCREDI (2011). Bayesian estimation of population size via linkage of multivariate normal data sets. Journal of Official Statistics, 27, pp. 491–505.

J. NETER, E. S. MAYNES, R. RAMANATHAN (1965). The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association, 60, no. 312, pp. 1005–1027.

F. SCHEUREN, W. E. WINKLER (1993). Regression analysis of data files that are computer matched. Survey Methodology, 19, pp. 39–58.

F. SCHEUREN, W. E. WINKLER (1997). Regression analysis of data files that are computer matched, Part II. Survey Methodology, 23, pp. 157–165.

A. TANCREDI, B. LISEO (2011). A hierarchical Bayesian approach to record linkage and population size problems. Annals of Applied Statistics, 5, pp. 1553–1585.

W. WINKLER (1995). Matching and record linkage. In Buisness Survey Methods, Wiley, New York, pp. 355–384. B. G. Cox, D. A. Binder, B. N. Chinnappa, A. Christianson, M.J. Colledge and P.S. Kott Editors.




How to Cite

Tancredi, A., & Liseo, B. (2015). Regression analysis with linked data: problems and possible solutions. Statistica, 75(1), 19–35.