Scoping Meeting on Data Linkage and Anonymisation

16 - 18 January 2013

Organisers: Chris Skinner (London School of Economics) and David Hand (Imperial College)

Participants | Programme

Theme:

Recent decades have seen data developments of huge potential analytic value to social science research, especially as a result of the movement away from paper to electronic records and the rapid spread of the internet and related communications technologies. Examples arise in many fields, including education, health and tax records; customer transactions and commercial databases; and digital data generated by mobile telephones, remote sensing devices and social networks. For the social science research opportunities afforded by such data sources to be realized, however, at least two challenges need to be addressed.

First, there is a need to be able to link different data sources so that the same entities - individuals, organisations etc. - can be identified for analytic purposes and so that any damage to analysis as a result of duplicate records or other issues of data quality can be avoided. Second, there is a need to be able to protect privacy and confidentiality, by devising ways of making data available to researchers whereby identification of individuals or other kinds of disclosure is prevented. These two topics are connected since methods of record linkage represent a threat as well as an opportunity, in the sense that the potential linkage of research data records to known individuals by mischievous data users represents a threat to confidentiality.

A mathematical framework for methods of record linkage was established by Fellegi and Sunter in 1969, formalizing earlier work by Newcombe. This framework has been further developed in the statistical literature and, in the last decade or more, in the computer science literature, where the method has also been called data matching, data linkage, entity resolution or object identification. Much recent work in the area has focussed on the practical implementation of these methods in different domains. Such applied work has generated a variety of challenges at a more mathematical level, for example how to link across multiple (rather than pairs of) data sources, possibly at different levels, such as persons and families, or across time via multigenerational genealogies/pedigrees. Other existing challenges include the estimation of patterns of linkage error and the development of methods for the analysis of linked data which take account of linkage error.

Methodology for data anonymisation has developed over a similar time scale. Methods vary according to the environments in which the data is accessed by the researcher, for example through visiting a data enclave, or connecting remotely, or by more open access to data files. Challenges may be formulated as constrained optimisation problems, where measures of the analytic utility of the data are maximized across alternative forms of data protection, subject to measures of disclosure risk meeting specified conditions. More recent developments include the use of simulation and imputation methods to create synthetic datasets and the use differential privacy ideas to measure disclosure risk and to evaluate alternative disclosure control methods.

The first and principal objective of this meeting is to identify key areas of mathematical sciences research on data linkage and anonymisation that can be of value to social and economic research using large and complex datasets. Presentations will be given by those engaged in social and economic research regarding what they see as challenges for research on data linkage and anonymisation and by those engaged more in mathematical sciences research in these areas, regarding what they see as promising lines for development. The general aim will be to find common ground between the mathematical and the social sciences.

Participation is by invitation only and will include: those engaged in social and economic research with awareness and experience of the challenges represented by data linkage and/or anonymisation; those with expertise at an international level in the development of methodology in these areas in the mathematical sciences, including statistics and computer science; individuals with a good knowledge of the relevant parts of the mathematical sciences community in the UK. Social and economic research will include both academic social sciences research and research in the private sector making use of large data sources.

Participation is by invitation only and will include:

  • those engaged in social and economic research with awareness and experience of the challenges represented by data linkage and/or anonymisation
  • those with expertise at an international level in the development of methodology in these areas in the mathematical sciences, including statistics and computer science
  • individuals with a good knowledge of the relevant parts of the mathematical sciences community in the UK.

Social and economic research will include both academic social sciences research and research in the private sector making use of large data sources.