skip to content

Handling identifier error rate variation in data linkage of large administrative data sources

Presented by: 
Katie Harron London School of Hygiene and Tropical Medicine
Tuesday 13th September 2016 - 13:30 to 14:00
INI Seminar Room 1
Co-authors: Gareth Hagger-Johnson (Administrative Data Research Centre for England, University College London), Ruth Gilbert (Institute of Child Health, University College London), Harvey Goldstein (University of Bristol and University College London)

Background: Linkage of administrative data with no unique identifier often relies on probabilistic linkage. Variation in data quality on individual or organisational levels can adversely affect match weight estimation, and potentially introduce selection bias to the linked data if subgroups of individuals are more likely to link than others. We quantified individual and organisational variation in identifier error in a large administrative dataset (Hospital Episode Statistics; HES) and incorporated this information within a match probability estimation model. Methods: A stratified sample of 10,000 admission records were extracted from 2011/2012 HES for three cohorts of ages 0-1, 5-6 and 18-19 years. A reference standard was created by linking via NHS number with the Personal Demographic Service for up-to-date identifiers. Based on aggregate tables, we calculated identifier error rates for sex, date of birth and postcode and investigated whether these errors were dependent on individual characteristics and evaluated variation between organisations. We used a log-linear model to estimate match probabilities, and used a simulation study to compare readmission rates based on traditional match weights. Results: Match probabilities differed significantly according to age (p<0.0001), ethnicity (p=0.0005) and sex (p<0.0001). Match probabilities were lower for males compared with females (odds ratio 0.84; 95% CI 0.81-0.86); lower for older cohorts compared with infants (OR 0.39; 95% CI 0.37-0.40 and 0.37; 95% CI 0.36-0.39 respectively) and lowest for Asian ethnicity (odds ratio 0.89; 95% CI 0.84-0.94 compared with White ethnicity). Results from the simulation study will be presented. Discussion: We provide empirical evidence on identifier error variation in a widely-used administrative dataset. We propose that modelling identifier error rates and covariates, and incorporating these characteristics into match probability estimation, can improve the quality of linkage.
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons