skip to content
 

Timetable (DLAW02)

Data linkage: techniques, challenges and applications

Monday 12th September 2016 to Friday 16th September 2016

Monday 12th September 2016
09:00 to 09:50 Registration
09:50 to 10:00 Welcome from Christie Marr (INI Deputy Director) INI 1
10:00 to 11:00 Rainer Schnell (City University, London); (Universität Duisburg-Essen)
Hardening Bloom Filter PPRL by modifying identifier encodings
Co-author: Christian Borgs (University of Duisburg Essen)

Using appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time. However, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall in large databases. 
INI 1
11:00 to 11:30 Morning Coffee
11:30 to 12:00 Gerrit Bloothooft (Universiteit Utrecht); (Universiteit Utrecht)
Historical life cycle reconstruction by indexing
Co-authors: Jelte van Boheemen (Utrecht University), Marijn Schraagen (Utrecht University)

Historical information about individuals is usually scattered across many sources. An integrated use of all available information is then needed to reconstruct their life cycles. Rather than comparing records between pairs of sources, it will be shown to be computationally effective to combine all data in a single table. In such a table, each record summarizes the information that can be deducted for a person who shows up in a source event. The idea is that this table should be ordered in such a way that consecutive records describe the life cycle events of an unique individual, for one individual after another, where each individual has its own ID. To arrive at this situation, it is necessary to filter and index the table in two ways, depending on the possible roles of an individual: the first as ego in focus (at birth, marriage and decease), the second as parent at the same life events of children. The results of both indexes (in terms of preliminary record clusters and IDs) should be combined, while resulting clusters should be tested for validity of the life cycle.

The success of such a procedure strongly depends on the available data and its quality. The Dutch civil registration, introduced by the French in 1811 and now largely digitized, provides quite optimal conditions. Remaining problems of data fuzziness can be circumvented by name standardization (to various levels of name reduction) and by testing different sequences of the available information in records for indexing. Both approaches are only effective when there is more information available then needed to identify an individual uniquely – which in many cases seems to be the case for the Dutch civil registration. An example of the procedure will be given for data for the province of Zeeland, while options for application of the method to older data of (much) less quality and completeness will be discussed. The latter touches upon the limits of historical life cycle reconstruction.
INI 1
12:00 to 12:30 Patrick Ball (Human Rights Data Analysis Group); (Human Rights Data Analysis Group)
Deduplicating databases of deaths in war: advances in adaptive blocking, pairwise classification, and clustering
Violent inter-state and civil wars are documented with lists of the casualties, each of which constitutes a partial, non-probability sample of the universe of deaths. There are often several lists, with duplicate entries within each list and among the lists, requiring record linkage to dedeuplicate the lists to create a unique enumeration of the known dead.

This talk will explore how we do record linkage, including: new advances in generating and learning from training data; an adaptive blocking approach; pairwise classification with string, date, and integer features and several classifiers; and a hybrid clustering method. Assessment metrics will be proposed for each stage, with real-world results from deduplicating more than 420,000 records of Syrian people killed since 2011.

INI 1
12:30 to 13:30 Lunch @ Wolfson Court
13:30 to 14:30 Bill Winkler (U.S. Census Bureau)
Computational Methods for Linking Sets of National Files
A combination of faster hardware and new computational algorithms makes it possible to link two or more national files having suitable quasi-identifying information such as name, address, date-of-birth and other non-uniquely identifying information far faster than methods of a decade earlier. The methods (Winkler, Yancey, and Porter 2010) were used for matching 10^17 pairs (300 million x 300 million) using 40 cpus of an SGI machine (with 2006 Itanium chips) in less than 30 hours during the 2010 U.S. Decennial Census. The methods are 50 times as fast as PSwoosh parallel software (Kawai et al. 2006) from Stanford University. The methods are ~10 times as fast as recent parallel software that applies new methods of load balancing (Rahm and Kolb 2013, Yan et al. 2013, Karapiperis and Verykios 2014). This talk will describe how this software bypasses the needs for system sorts and provides highly optimized search-retrieval-comparison for a narrow range of situations needed for record linkage.

Related Links
INI 1
14:30 to 15:30 Amy O'Hara (U.S. Census Bureau)
The U.S. Census Bureau’s Linkage Infrastructure: Overview and New Challenges
The U.S. Census Bureau makes extensive use of administrative records and other third-party data to produce statistics on our population and economy.  We use these data to evaluate survey quality, to produce new statistics about the population and economy, and to support evaluation of federal and state programs. Our success hinges on our ability to link external data with data already held at Census, usually at the individual person or housing unit level. We carry out this linkage using a standardized set of practices that has been in place since the early 2000s. This presentation focuses on the lifecycle of the production data linkage carried out at the U.S. Census Bureau, including authorities to obtain identified data, the types of data acquired, ingest and initial processing, linkage practices, evaluations of linkage quality, and the documentation, governance, and uses of linked data files. The presentation will conclude with a discussion of new demands on Census’s linked data infrastructure, and the need to modernize and further streamline governance and processes.
INI 1
15:30 to 16:00 Afternoon Tea
16:00 to 17:00 Intra-disciplinary “speed-dating” INI 1
16:00 to 17:00 'speed-dating' sessions
17:00 to 18:00 Welcome Wine Reception & Poster Session at INI
Tuesday 13th September 2016
09:30 to 10:00 Hye-Chung Kum (Texas A&M University); (UNC Chapel Hill and RIMS)
Privacy Preserving Interactive Record Linkage (PPIRL)
Record linkage to integrate uncoordinated databases is critical to population informatics research. Balancing privacy protection against the need for high quality record linkage requires a human-machine hybrid system to manage uncertainty in the ever changing streams of chaotic big data safely. We review the literature in record linkage and privacy. In the computer science literature, private record linkage, which investigates how to apply a known linkage function safely is the most published area. However, in practice, the linkage function is rarely known. Thus, there are many data linkage centers whose main role is to be the trusted third party to determine the linkage function manually and link data for research via a master population list for a designated region. Most recently, a more flexible computerized third party linkage platform, Secure Decoupled Linkage (SDLink), has been proposed based on (1) decoupling data via encryption, (3) obfuscation via chaffing (addi ng fake data) and universe manipulation, and (3) minimum incremental information disclosure via recoding. Based on this findings, we formalize a new framework for privacy preserving interactive record linkage (PPIRL) with tractable privacy and utility properties. Human based third-party linkage centers for privacy preserving record linkage are the accepted norm internationally. We find that a computer based third-party platform that can precisely control the information disclosed at the micro level and allow frequent human interaction during the linkage process, is an effective human-machine hybrid system that significantly improves on the linkage center model both in terms of privacy and utility.

Related Links
INI 1
10:00 to 10:30 James Boyd
Technical Challenges associated with record linkage
Background: The task of record linkage is increasingly being undertaken by dedicated record linkage units with secure environments and specialised linkage personnel. In addition to the complexity of undertaking record linkage, these units face additional technical challenges in providing record linkage ‘as a service’. The extent of this functionality, and approaches to solving these issues has had little focus in record linkage literature.

Methods: This session identifies and discusses the range of functions that are required or undertaken in the provision of record linkage services. These include managing routine, on-going linkage; storing and handling changing data; handling different linkage scenarios; accommodating ever increasing datasets. Current linkage methods also include analysis of data attributes such as data completeness, consistency, constancy and field discriminating power. This information is used to develop appropriate linkage strategies.

Results: In order to maximise matching quality and efficiency, linkage systems must address real-world operational requirements to manage linked data over time. By maintaining a full history of links, and storing pairwise information, many of the challenges around handling ‘open’ records, and providing automated managed extractions are solved. Automation of linkage processes (including clerical processes) is another way of ensuring consistency of results and scalability of service. Several of these solutions have been implemented as part of developments by the PHRN Centre for Data Linkage in Australia.

Conclusions: Increasing demand for and complexity of, linkage services present challenges to linkage units as they seek to offer accurate and efficient services to government and the research community. Linkage units need to be both flexible and scalable to meet this demand. It is hoped that the solutions presented will help overcome these difficulties.
INI 1
10:30 to 11:00 Evan Roberts (University of Minnesota)
Record linkage with complete-count historical census data
Many areas of social science research benefit from being able to follow individuals and families across time, to observe changes in social behavior across at least part of the life course. Since the 1920s, and particularly since World War II longitudinal social surveys have become a common tool of social scientists. Despite their many benefits these surveys only allow us to study a limited number of birth cohorts, and few of these cohorts are entirely deceased. Comparison across multiple cohorts, and across long periods of the life course is not always possible as social scientists must follow their cohorts in real time.

Historical data on past populations allows us to reconstruct life-course panels for past cohorts. In the past few years complete transcriptions of census data from sequential censuses has become available in several countries including Britain, Canada, Iceland, Norway, Sweden, and the United States. The Minnesota Population Center is developing tools to create large datasets of people linked between at least two censuses. There are multiple challenges in creating this form of historical data, centering around the lack of unique identifiers. People must be identified by a combination of characteristics recorded with error including names, birthplaces, date of birth, and ethnic background. Although linkage rates are low by comparison to modern longitudinal surveys it has proved possible to create samples that are reasonably representative of the origin or terminal population. This paper describes the sources used in creating linked census datasets, the domain-specific issues in data linkage, and demonstrates some of the applications of historical longitudinal data in studying social mobility, and mortality in the past.

Related Links
INI 1
11:00 to 11:30 Morning Coffee
11:30 to 12:00 Bradley Malin (Vanderbilt University)
A LifeCycle Model for Privacy Preserving Record Linkage
Individuals increasingly leave behind information in resources managed by disparate organizations.  There is an interest in making this information available for a wide array of endeavors (e.g., policy assessment, discovery-based research, and surveillance activities).  Given the distribution of data, it is critical to ensure that it is sufficiently integrated before conducting any statistical investigation to prevent duplication (and thus overcounting of events) and fragmentation (and thus undercounting of events).  This problem is resolved through record linkage procedures, techniques that have been refined for over half a century.  However, these methods often rely on explicit- or potentially identifying features, which often conflict with the expectations of privacy regulations.  More recently, privacy preserving record linkage (PPRL) methods have been proposed which rely on randomized transformations of data, as well as cryptographically secure processing methods. However, it is often unclear how the various steps of a record lifecycle, including standardization, parameter estimation, blocking, record pair comparison, and ccommunication between all of the various parties involved in the process can take place. In this talk, I will review recent developments in PPRL methods, discuss how they have been engineered into working software systems, and provide examples of how they have been applied in several distributed networks in the healthcare community to facilitate biomedical research and epidemiological investigations.
INI 1
12:00 to 12:30 Luiza Antonie (University of Guelph)
Tracking people over time in 19th century Canada: Challenges, Bias and Results
Co-author: Kris Inwood (University of Guelph)

Linking multiple databases to create longitudinal data is an important research problem with multiple applications. Longitudinal data allows analysts to perform studies that would be unfeasible otherwise. In this talk, I discuss a system we designed to link historical census databases in order to create longitudinal data that allow tracking people over time. Data imprecision in historical census data and the lack of unique personal identifiers make this task a challenging one. We design and employ a record linkage system that incorporates a supervised learning module for classifying pairs of records as matches and non-matches. In addition, we disambiguate ambiguous links by taking into account the family context. We report results on linking four Canadian census collections, from 1871 to 1901, and identify and discuss the impact on precision and bias when family context is employed. We show that our system performs large scale linkage producing high quality links and generat ing sufficient longitudinal data to allow meaningful social science studies. 
INI 1
12:30 to 13:30 Lunch @ Wolfson Court
13:30 to 14:00 Katie Harron (London School of Hygiene and Tropical Medicine)
Handling identifier error rate variation in data linkage of large administrative data sources
Co-authors: Gareth Hagger-Johnson (Administrative Data Research Centre for England, University College London), Ruth Gilbert (Institute of Child Health, University College London), Harvey Goldstein (University of Bristol and University College London)

Background: Linkage of administrative data with no unique identifier often relies on probabilistic linkage. Variation in data quality on individual or organisational levels can adversely affect match weight estimation, and potentially introduce selection bias to the linked data if subgroups of individuals are more likely to link than others. We quantified individual and organisational variation in identifier error in a large administrative dataset (Hospital Episode Statistics; HES) and incorporated this information within a match probability estimation model. Methods: A stratified sample of 10,000 admission records were extracted from 2011/2012 HES for three cohorts of ages 0-1, 5-6 and 18-19 years. A reference standard was created by linking via NHS number with the Personal Demographic Service for up-to-date identifiers. Based on aggregate tables, we calculated identifier error rates for sex, date of birth and postcode and investigated whether these errors were dependent on individual characteristics and evaluated variation between organisations. We used a log-linear model to estimate match probabilities, and used a simulation study to compare readmission rates based on traditional match weights. Results: Match probabilities differed significantly according to age (p<0.0001), ethnicity (p=0.0005) and sex (p<0.0001). Match probabilities were lower for males compared with females (odds ratio 0.84; 95% CI 0.81-0.86); lower for older cohorts compared with infants (OR 0.39; 95% CI 0.37-0.40 and 0.37; 95% CI 0.36-0.39 respectively) and lowest for Asian ethnicity (odds ratio 0.89; 95% CI 0.84-0.94 compared with White ethnicity). Results from the simulation study will be presented. Discussion: We provide empirical evidence on identifier error variation in a widely-used administrative dataset. We propose that modelling identifier error rates and covariates, and incorporating these characteristics into match probability estimation, can improve the quality of linkage.
INI 1
14:00 to 14:30 Ruth Gilbert (University College London)
GUILD: GUidance for Information about Linked Datasets
Co-authors: Lafferty, Hagger-Johnson, Harron, Smith, Zhang, Dibben, Goldstein

Linkage of large administrative datasets often involves different teams at the various steps of the linkage pathway. Information is rarely shared throughout the pathway about processes that might contribute to linkage error and potentially bias results. However, improved awareness about the type of information that should be shared could lead to greater transparency and more robust methods, including analyses that take into account linkage error.  

The Administrative Data Research Centre for England convened a series of 3 face-to-face meetings between data linkage experts to develop the GUILD guidance (GUidance for Information about Linked Datasets). GUILD recommends key items of information that should be shared at 4 steps in the data linkage pathway: data provision (how data was generated, extracted, processed and quality controlled), data linkage, analyses of linked data, and report writing. The guidance aims to improve transparency in data linkage processes and reporting of analyses, and to improve the validity of results based on linked data.  GUILD guidance is desgined to be used by data providers, linkers, and analysts, but will also be relevant to policy makers, funders and legislators responsible for widening use of linked data for research, services and policy. The GUILD recommendations will be presented and discussed. 
INI 1
14:30 to 15:00 Dinusha Vatsalan (Australian National University)
Advanced Techniques for Privacy-Preserving Linking of Multiple Large Databases
Co-author: Peter Christen (The Australian National University)

In the era of Big Data the collection of person-specific data disseminated in diverse databases provides enormous opportunities for businesses and governments by exploiting data linked across these databases. Linked data empowers quality analysis and decision making that is not possible on individual databases. Therefore, linking databases is increasingly being required in many application areas, including healthcare, government services, crime and fraud detection, national security, and business applications. Linking data from different databases requires comparison of quasi-identifiers (QIDs), such as names and addresses. These QIDs are personal identifying attributes that contain sensitive and confidential information about the entities represented in these databases. The exchange or sharing of QIDs across organisations for linkage is often prohibited due to laws and business policies. Privacy-preserving record linkage (PPRL) has been an active research area over the past two decades addressing this problem through the development of techniques that facilitate the linkage on masked (encoded) records such that no private or confidential information needs to be revealed.

Most of the work in PPRL thus far has concentrated on linking two databases only. Linking multiple databases has only recently received more attention as it is being required in a variety of application areas. We have developed several advanced techniques for practical PPRL of multiple large databases addressing the scalability, linkage quality, and privacy challenges. Our approaches perform linkage on masked records using Bloom filter encoding, which is a widely used masking technique for PPRL. In this talk, we will first highlight the challenges of PPRL of multiple databases, then describe our developed approaches, and then discuss future research directions required to leverage the huge potential that linked data from multiple databases can provide for businesses and government services.
INI 1
15:00 to 15:30 Harvey Goldstein (University of Bristol); (University College London)
A scaling approach to record linkage
Co-authors: Mario Cortina-Borja (UCL), Katie Harron (LSHTM)

With increasing availability of large data sets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. The very large size of such datasets and the variation in the quality of the identifiers used to carry out linkage means that existing approaches based upon ‘probabilistic’ models can make heavy computational demands. They are also based upon questionable assumptions. In this paper we suggest a new approach, based upon a scaling algorithm, that is computationally fast, requires only moderate amounts of storage and has intuitive appeal. A comparison with existing methods is given. 
INI 1
15:30 to 16:00 Afternoon Tea
16:00 to 17:00 Intra-disciplinary “speed-dating” INI 1
16:00 to 17:00 'speed-dating' sessions
Wednesday 14th September 2016
09:00 to 10:00 Erhard Rahm (Universität Leipzig)
Big data integration: challenges and new approaches
Data integration is a key challenge for Big Data applications to semantically enrich and combine large sets of heterogeneous data for enhanced data analysis. In many cases, there is also a need to deal with a very high number of data sources, e.g., product offers from many e-commerce websites. We will discuss approaches to deal with the key data integration tasks of (large-scale) entity resolution and schema matching. In particular, we discuss parallel blocking and entity resolution on Hadoop platforms together with load balancing techniques to deal with data skew. We also discuss challenges and recent approaches for holistic data integration of many data sources, e.g., to create knowledge graphs or to make use of huge collections of web tables.
INI 1
10:00 to 10:30 Vassilios Verykios (Hellenic Open University)
Space Embedding of Records for Privacy Preserving Linkage
Massive amounts of data, collected by a wide variety of organizations, need to be integrated and matched in order to facilitate data analyses that may be highly beneficial to businesses, governments, and academia. Record linkage, also known as entity resolution, is the process of identifying records that refer to the same real-world entity from disparate data sets. Privacy Preserving Record Linkage (PPRL) techniques are employed to perform the linkage process in a secure manner, when the data that need to be matched are sensitive. In PPRL, input records undergo an anonymization process that embeds the records into a space, where the underlying data can be matched but not understood by naked eye.

The PPRL problem is picking up a lot of steam lately due to a ubiquitous need for cross matching of records that usually lack common unique identifiers and their field values contain variations, errors, misspellings, and typos. The PPRL process as it is applied to massive ammounts of data comprises of an anonymization phase, a searching phase and a matching phase.

Several searching and anonymization approaches have been developed with the aim to scale the PPRL process to big data without sacrificing quality of the results. Recently, redundant randomized methods have been proposed, which insert each record into multiple independent blocks in order to amplify the probability of bringing together similar records for comparison. The key feature of these methods is the formal guarantees, they provide, in terms of accuracy in the generated results.

In this talk, we present both state-of-the-art private searching methods and anonynimization techniques, by exposing their characteristics, including their strengths and weaknesses, and we also present a comparative evaluation.
INI 1
10:30 to 11:00 Andy Boyd (University of Bristol)
‘Data Safe Havens’ as a framework to support record linkage in observational studies: evidence from the Project to Enhance ALSPAC through Record Linkage (PEARL).
The health research community are engaged in projects which require a wealth of data. These data can be drawn directly from research participants, or via linkage to participants’ routine records. Frequently, investigators require information from multiple sources with multiple legal owners. A fundamental challenge for data managers – such as those maintaining cohort study databanks - is to establish data processing and analysis pipelines that meet the legal, ethical and privacy expectations of participants and data owners alike. This demands socio-technical solutions that may easily become enmeshed in protracted debate and controversy as they encounter the norms, values, expectations and concerns of diverse stakeholders. In this context, ‘Data Safe Havens’ can provide a framework for repositories in which sensitive data are kept securely within governance and informatics systems that are fit-for-purpose, appropriately tailored to the data, while being accessible to legitimate users for legitimate purposes (see Burton et al, 2015. http://www.ncbi.nlm.nih.gov/pubmed/26112289).  

In this paper I will describe our data linkage experiences gained through the Project to Enhance ALSPAC through Record Linkage (PEARL); a project aiming to establish linkages between participants of the ALSPAC birth cohort study and their routine records. This exemplar illustrates how the governance and technical solutions encompassed within the ALSPAC Data Safe Haven have helped counter and address the real world data linkage challenges we have faced.
INI 1
11:00 to 11:30 Morning Coffee
11:30 to 12:00 Mauricio Sadinle (Duke University)
A Bayesian Partitioning Approach to Duplicate Detection and Record Linkage
Record linkage techniques allow us to combine different sources of information from a common population in the absence of unique identifiers. Linking multiple files is an important task in a wide variety of applications, since it permits us to gather information that would not be otherwise available, or that would be too expensive to collect. In practice, an additional complication appears when the datafiles to be linked contain duplicates. Traditional approaches to duplicate detection and record linkage output independent decisions on the coreference status of each pair of records, which often leads to non-transitive decisions that have to be reconciled in some ad-hoc fashion. The joint task of linking multiple datafiles and finding duplicate records within them can be alternatively posed as partitioning the datafiles into groups of coreferent records. We present an approach that targets this partition as the parameter of interest, thereby ensuring transitive decisions. Our Bayesian implementation allows us to incorporate prior information on the reliability of the fields in the datafiles, which is especially useful when no training data are available, and it also provides a proper account of the uncertainty in the duplicate detection and record linkage decisions. We show how this uncertainty can be incorporated in certain models for population size estimation. Throughout the document we present a case study to detect killings that were reported multiple times to organizations recording human rights violations during the civil war of El Salvador. 
INI 1
12:00 to 12:30 Changyu Dong (University of Strathclyde)
From Private Set Intersection to Private Record Linkage
Record linkage allows data from different sources to be integrated to facilitate data mining tasks. However, in many cases, records have to be linked by personally identifiable information. To prevent privacy breaches, ideally records should be linked in a private way such that no information other than the matching result is leaked in the process. One approach for Private Record Linkage (PRL) is by using cryptographic protocols. In this talk, I will introduce Private Set Intersection (PSI), which is a type of cryptographic protocol that enables two parties to obtain the intersection of their private sets. It is almost trivial to build an exact PRL protocol from a PSI protocol. With more efforts, it is also possible to build an approximate PRL protocol from PSI that allows linking records based on certain similarity metrics. In this talk, I will present efficient PSI protocols, and how to obtain PRL protocols that are practically efficient and effective.     
INI 1
12:30 to 13:30 Lunch @ Wolfson Court
13:30 to 15:30 Discipline based discussion and synthesis of workshop outcomes: a few paragraphs per discipline (manifesto) INI 1
15:30 to 16:00 Afternoon Tea
16:00 to 17:00 Presentation and discussion of findings; synthesis of findings; workshop closure INI 1
19:30 to 22:00 Formal Dinner at Emmanuel College
Thursday 15th September 2016
09:00 to 17:00 Hackathon / Zeeland challenge INI 1
11:00 to 11:30 Morning Coffee
12:30 to 13:30 Lunch @ Wolfson Court
15:30 to 16:00 Afternoon Tea
Friday 16th September 2016
09:30 to 17:00 ‘Perspectives on Data Linkage - Techniques, Challenges and Applications - Open for Business Event’ INI 1
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons