skip to content
 

Seminars (DLAW01)

Videos and presentation materials from other INI events are also available.

Search seminar archive

Event When Speaker Title Presentation Material
DLAW01 5th July 2016
09:00 to 10:30
Peter Christen Tutorial 1: Data Linkage – Introduction, Recent Advances, and Privacy Issues
Tutorial outline:
The tutorial consists of four parts:
(1)  Data linkage introduction, short history of data linkage, example applications, and the data linkage process (overview of the main steps).
(2)  Detailed discussion of all steps of the data linkage process (data cleaning and standardisation, indexing/blocking, field and record comparisons, classification, and evaluation), and core techniques used in these steps.
(3)  Advanced data linkage techniques, including collective, group and graph linking techniques, as well as advanced indexing techniques that enable large-scale data linkage.
(4)  Major concepts, protocols and challenges in privacy-preserving data linkage, which aims to link databases across organisations without revealing any private or confidential information.  

Assumed knowledge: The aim is to make this tutorial as accessible as possible to a wide ranging audience from various backgrounds. The content will focus on concepts and techniques rather than details of algorithms. Basic understanding in databases, algorithms, and probabilities will be beneficial but not required. The tutorial will loosely be based on the book “Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection” (Springer, 2012) written by the presenter.
DLAW01 5th July 2016
11:00 to 12:30
Peter Christen Tutorial 1: Data Linkage – Introduction, Recent Advances, and Privacy Issues
Tutorial outline:
The tutorial consists of four parts:
(1)  Data linkage introduction, short history of data linkage, example applications, and the data linkage process (overview of the main steps).
(2)  Detailed discussion of all steps of the data linkage process (data cleaning and standardisation, indexing/blocking, field and record comparisons, classification, and evaluation), and core techniques used in these steps.
(3)  Advanced data linkage techniques, including collective, group and graph linking techniques, as well as advanced indexing techniques that enable large-scale data linkage.
(4)  Major concepts, protocols and challenges in privacy-preserving data linkage, which aims to link databases across organisations without revealing any private or confidential information.  

Assumed knowledge: The aim is to make this tutorial as accessible as possible to a wide ranging audience from various backgrounds. The content will focus on concepts and techniques rather than details of algorithms. Basic understanding in databases, algorithms, and probabilities will be beneficial but not required. The tutorial will loosely be based on the book “Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection” (Springer, 2012) written by the presenter.
DLAW01 5th July 2016
13:30 to 15:00
Adam Smith Tutorial 2: Defining ‘privacy’ for statistical databases
The tutorial will introduce differential privacy, a widely studied definition of privacy for statistical databases.

We will begin with the motivation for rigorous definitions of privacy in statistical databases, covering several examples of how seemingly aggregate, high-level statistics can leak information about individuals in a data set. We will then define differential privacy, illustrate the definition with several examples, and discuss its properties. The bulk of the tutorial will cover the principal techniques used for the design of differentially private algorithms. Time permitting, we will touch on applications of differential privacy to problems having no immediate connection to privacy, such as equilibrium selection in game theory and adaptive data analysis in statistics and machine learning.
DLAW01 5th July 2016
15:30 to 17:00
Adam Smith Tutorial 2: Defining ‘privacy’ for statistical databases
The tutorial will introduce differential privacy, a widely studied definition of privacy for statistical databases.

We will begin with the motivation for rigorous definitions of privacy in statistical databases, covering several examples of how seemingly aggregate, high-level statistics can leak information about individuals in a data set. We will then define differential privacy, illustrate the definition with several examples, and discuss its properties. The bulk of the tutorial will cover the principal techniques used for the design of differentially private algorithms. Time permitting, we will touch on applications of differential privacy to problems having no immediate connection to privacy, such as equilibrium selection in game theory and adaptive data analysis in statistics and machine learning.
DLAW01 6th July 2016
10:00 to 11:00
Adam Smith tba
DLAW01 6th July 2016
11:30 to 12:30
Jerry Reiter Data Dissemination: A Survey of Recent Approaches, Challenges, and Connections to Data Linkage
I introduce common strategies for reducing disclosure risks when releasing public use microdata, i.e., data on individuals. I discuss some of their pros and cons in terms of data quality and disclosure risks, connecting to data linkage where possible. I also talk about a key challenge in data dissemination: how to give feedback to users on the quality of analyses of disclosure-protected data. Such feedback is essential if analysts are to trust results from (heavily) redacted microdata.  They also are essential for query systems that report (perturbed) outputs from statistical models. However, such feedback leaks information about confidential data values. I discuss approaches for feedback that satisfy the risk criterion differential privacy for releasing diagnostics in regression models.


DLAW01 6th July 2016
13:30 to 14:30
Cynthia Dwork Marginals and Malice
In 2008 Homer et al rocked the genomics community with a discovery that altered the publication policies of the US NIH and the Wellcome Trust, showing that mere allele frequency statistics would permit a forensic analyst -- or a privacy attacker -- to determine the presence of an individual's DNA in a forensic mix -- or a case group.  These results were seen as particularly problematic for Genome-Wide Association Studies (GWAS), where the marginals are SNP minor allele frequency statistics (MAFs).

In this talk, we review the lessons of Homer et al. and report on recent generalizations and strengthenings of the attack, establishing the impossibility of privately reporting "too many" MAFs with any reasonable notion of accuracy.

We then present a differentially private approach to finding significant SNPs that controls the false discovery rate.  The apparent contradiction with the impossibility result is resolved by a relaxation of the problem, in which we limit the total number of potentially significant SNPs that are reported.  

Joint work with Smith, Steinke, Ullman, and Vadhan (lower bounds); and Su and Zhang (FDR control).
DLAW01 6th July 2016
14:30 to 15:30
John Abowd The Challenge of Privacy Protection for Statistical Agencies
Since the field of statistical disclosure limitation (SDL) was first formalized by Ivan Fellegi in 1972, official statistical agencies have recognized that their publications posed confidentiality risks for the households and businesses who responded. For even longer, agencies have protected the source data for those publications by using secure storage methods and access authorization systems. In SDL, Dalenius (1977) and, in computer science, Goldwasser and Micali (1982) formalized what has become the modern approach to privacy protection in data publication: inferential disclosure limitation/semantic security. The modern approach to physical data security centers on firewall and encryption technologies. And the two sets of risks (disclosure through publication and disclosure through unauthorized access) have become increasingly inter-related. It is important to recognize the distinct issues, however. Secure multiparty computing and the stronger fully homomorphic encryption are formal solutions to the problem of permitting statistical computations without granting access to the decrypted data. Privacy-protected query publication is a formal solution to the problem of insuring that inferential disclosures are bounded and that the bound is respected in all published tables. There are now tractable systems that combine secure multi-party computing with formal privacy protection of the computed statistics (e.g., Shokri and Shmatikov 2015). The challenge to statistical agencies is to learn how these systems work, and move their own protection technologies in this direction. Private companies like Google and Microsoft already do this. Statistical agencies must be prepared to explain the differences in their publication requirements and security protocols that distinguish their chosen data storage methods and publications from those used by private companies.

Related Links
DLAW01 6th July 2016
16:00 to 17:00
Peter Christen Recent developments and research challenges in data linkage
Techniques for linking and integrating data from different sources are becoming increasingly important in many application areas, including health, census, taxation, immigration, social welfare, in crime and fraud detection, in the assembly of national security intelligence, for businesses, in bibliometrics, as well as in the social sciences.

In today's Big Data era, data linkage (also known as entity resolution, duplicate detection, and data matching) not only faces computational challenges due to the increasing size of data collections and their complexity, but also operational challenges as many applications move from static environments into real-time processing and analysis of potentially very large and dynamically changing data streams, where real-time linking of records is required. Additionally, with the growing concerns by the public of the use of their sensitive data, privacy and confidentiality often need to be considered when personal information is being linked and shared between organisations.

In this talk I will present a short introduction to data linkage, highlight recent developments in advanced data linkage techniques and methods - with an emphasis on work conducted in the computer science domain - and discuss future research challenges and directions.
DLAW01 7th July 2016
10:00 to 11:00
Christine O'Keefe Measuring risk and utility in remote analysis and online data centres – why isn’t this problem already solved?
Remote analysis servers and online data centres have been around for quite a few years now, appearing both in the academic literature and in a range of large scale implementations. Such systems are considered to provide good confidentiality protection for protecting privacy in the case of data about people, and for protecting commercial sensitivity in the case of data about businesses and enterprises. A variety of different methods for protecting confidentiality in the outputs of such systems have been proposed and a range of them has been implemented and used in practice. However, much less common are quantitative assessments of risk to confidentiality, and usefulness of the system outputs for the purpose for which they are generated. Indeed, it has been suggested that perhaps such quantitative assessments are trying to measure the wrong things. In this talk we will provide an overview of the current state of literature and practice, and compare it with the overall problem o bjective with a view to determining key open challenges and research frontiers in the area, possibly within a redefined statement of the overall challenge.
DLAW01 7th July 2016
11:30 to 12:30
Josep Domingo-Ferrer New Directions in Anonymization: Permutation Paradigm, Verifiability by Subjects and Intruders, Transparency to Users
Co-author: Krishnamurty Muralidhar (University of Oklahoma)

There are currently two approaches to anonymization: "utility first" (use an anonymization method with suitable utility features, then empirically evaluate the disclosure risk and, if necessary, reduce the risk by possibly sacrificing some utility) or "privacy first" (enforce a target privacy level via a privacy model, e.g., k-anonymity or differential privacy, without regard to utility). To get formal privacy guarantees, the second approach must be followed, but then data releases with no utility guarantees are obtained. Also, in general it is unclear how verifiable is anonymization by the data subject (how safely released is the record she has contributed?), what type of intruder is being considered (what does he know and want?) and how transparent is anonymization towards the data user (what is the user told about methods and parameters used?).

We show that, using a generally applicable reverse mapping transformation, any anonymization for microdata can be viewed as a permutation plus (perhaps) a small amount of noise; permutation is thus shown to be the essential principle underlying any anonymization of microdata, which allows giving simple utility and privacy metrics. From this permutation paradigm, a new privacy model naturally follows, which we call (d,v,f)-permuted privacy. The privacy ensured by this method can be verified via record linkage by each subject contributing an original record (subject-verifiability) and also at the data set level by the data protector. We then proceed to define a maximum-knowledge intruder model, which we argue should be the one considered in anonymization. Finally, we make the case for anonymization transparent to the data user, that is, compliant with Kerckhoff's assumption (only the randomness used, if any, must stay secret).
DLAW01 7th July 2016
13:30 to 15:30
Christopher Dibben, Natalie Shlomo Perspectives on user needs for academic, government and commercial data
This session will explore a brief history of commercial applications of large data sets in marketing, retailing and consumer targeting.
It will investigate the implications of new data sets, how anonymisation may be achieved and what commercial business is considering with new open data standards


DLAW01 7th July 2016
16:00 to 17:00
Rebecca Steorts Modern Bayesian Record Linkage: Some Recent Developments and Open Challenges
Record linkage, also known as de-duplication, entity resolution, and coreference resolution is the process of merging together noisy databases to remove duplicate entities. Record linkage is becoming more essential in the age of big data, where duplicates are ever present in such applications as official statistics, human rights, genetics, electronic medical data, and so on. We briefly review the genesis of record linkage with the work of Newcombe in 1959, and then move to recent Bayesian developments using novel clustering approaches in recent work. We speak of recent challenges that have been overcome and ones that are present, needing guidance and attention.
DLAW01 8th July 2016
10:00 to 11:00
Harvey Goldstein Probabilistic anonymisation of microdatasets and models for analysis
The general idea is to use the addition of random noise with known properties to some or all variables in a released dataset, typically following linkage, where the values of some identifier variables for individuals of interest are also available to an external ‘attacker’ who wishes to identify those individuals so that they can interrogate their records in the dataset. The noise is tuned to achieve any given degree of anonymity to avoid identification by an ‘attacker’ via the linking of patterns based on the values of such variables.  The noise so generated can then be ‘removed’ at the analysis stage since its characteristics are known, requiring disclosure of these characteristics by the linking agency. This leads to consistent parameter estimates, although a loss of efficiency will occur, but the data themselves are not degraded by any form of coarsening such as grouping. 
DLAW01 8th July 2016
11:30 to 12:30
Ray Chambers Statistical Modelling using Linked Data - Issues and Opportunities
Probabilistic linkage of multiple data sets is now popular and widespread. Unfortunately, there appears to be little corresponding enthusiasm for adjusting standard methods of statistical analysis when they are used with these linked data sets, even though there is plenty of evidence from simulation studies that both incorrect links as well as informative missed links can lead to biased inference. In this presentation I will describe the key issues that need to be addressed when analysing such linked data and some of the methods that can help. In this context, I will focus in particular on the simple linear regression model as a vehicle for demonstrating how knowledge about the statistical properties of the linkage process as well as summary information about the population distribution of the analysis variables can be used to correct for (or at least alleviate) these inferential problems. Recent research at the Australian Bureau of Statistics on a potential weighting/imputation approach to implementing these solutions will also be presented.
DLAW01 8th July 2016
13:30 to 14:30
Aaron Roth Using Differential Privacy to Control False Discovery in Adaptive Data Analysis
DLAW01 8th July 2016
14:30 to 15:30
Jared Murray Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering
When linking two databases (or deduplicating a single database) the number of possible links grows rapidly in the size of the databases under consideration, and in most applications it is necessary to first reduce the number of record pairs that will be compared. Spurred by practical considerations, a range of indexing or blocking methods have been developed for this task. However, methods for inferring linkage structure that account for indexing, blocking, and filtering steps have not seen commensurate development. I review the implications of indexing, blocking and filtering, focusing primarily on the popular Fellegi-Sunter framework and proposing a new model to account for particular forms of indexing and filtering. 
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons