skip to content
 

Seminars (DLA)

Videos and presentation materials from other INI events are also available.

Search seminar archive

Event When Speaker Title Presentation Material
DLAW01 5th July 2016
09:00 to 10:30
Peter Christen Tutorial 1: Data Linkage – Introduction, Recent Advances, and Privacy Issues
Tutorial outline:
The tutorial consists of four parts:
(1)  Data linkage introduction, short history of data linkage, example applications, and the data linkage process (overview of the main steps).
(2)  Detailed discussion of all steps of the data linkage process (data cleaning and standardisation, indexing/blocking, field and record comparisons, classification, and evaluation), and core techniques used in these steps.
(3)  Advanced data linkage techniques, including collective, group and graph linking techniques, as well as advanced indexing techniques that enable large-scale data linkage.
(4)  Major concepts, protocols and challenges in privacy-preserving data linkage, which aims to link databases across organisations without revealing any private or confidential information.  

Assumed knowledge: The aim is to make this tutorial as accessible as possible to a wide ranging audience from various backgrounds. The content will focus on concepts and techniques rather than details of algorithms. Basic understanding in databases, algorithms, and probabilities will be beneficial but not required. The tutorial will loosely be based on the book “Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection” (Springer, 2012) written by the presenter.
DLAW01 5th July 2016
11:00 to 12:30
Peter Christen Tutorial 1: Data Linkage – Introduction, Recent Advances, and Privacy Issues
Tutorial outline:
The tutorial consists of four parts:
(1)  Data linkage introduction, short history of data linkage, example applications, and the data linkage process (overview of the main steps).
(2)  Detailed discussion of all steps of the data linkage process (data cleaning and standardisation, indexing/blocking, field and record comparisons, classification, and evaluation), and core techniques used in these steps.
(3)  Advanced data linkage techniques, including collective, group and graph linking techniques, as well as advanced indexing techniques that enable large-scale data linkage.
(4)  Major concepts, protocols and challenges in privacy-preserving data linkage, which aims to link databases across organisations without revealing any private or confidential information.  

Assumed knowledge: The aim is to make this tutorial as accessible as possible to a wide ranging audience from various backgrounds. The content will focus on concepts and techniques rather than details of algorithms. Basic understanding in databases, algorithms, and probabilities will be beneficial but not required. The tutorial will loosely be based on the book “Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection” (Springer, 2012) written by the presenter.
DLAW01 5th July 2016
13:30 to 15:00
Adam Smith Tutorial 2: Defining ‘privacy’ for statistical databases
The tutorial will introduce differential privacy, a widely studied definition of privacy for statistical databases.

We will begin with the motivation for rigorous definitions of privacy in statistical databases, covering several examples of how seemingly aggregate, high-level statistics can leak information about individuals in a data set. We will then define differential privacy, illustrate the definition with several examples, and discuss its properties. The bulk of the tutorial will cover the principal techniques used for the design of differentially private algorithms. Time permitting, we will touch on applications of differential privacy to problems having no immediate connection to privacy, such as equilibrium selection in game theory and adaptive data analysis in statistics and machine learning.
DLAW01 5th July 2016
15:30 to 17:00
Adam Smith Tutorial 2: Defining ‘privacy’ for statistical databases
The tutorial will introduce differential privacy, a widely studied definition of privacy for statistical databases.

We will begin with the motivation for rigorous definitions of privacy in statistical databases, covering several examples of how seemingly aggregate, high-level statistics can leak information about individuals in a data set. We will then define differential privacy, illustrate the definition with several examples, and discuss its properties. The bulk of the tutorial will cover the principal techniques used for the design of differentially private algorithms. Time permitting, we will touch on applications of differential privacy to problems having no immediate connection to privacy, such as equilibrium selection in game theory and adaptive data analysis in statistics and machine learning.
DLAW01 6th July 2016
10:00 to 11:00
Adam Smith tba
DLAW01 6th July 2016
11:30 to 12:30
Jerry Reiter Data Dissemination: A Survey of Recent Approaches, Challenges, and Connections to Data Linkage
I introduce common strategies for reducing disclosure risks when releasing public use microdata, i.e., data on individuals. I discuss some of their pros and cons in terms of data quality and disclosure risks, connecting to data linkage where possible. I also talk about a key challenge in data dissemination: how to give feedback to users on the quality of analyses of disclosure-protected data. Such feedback is essential if analysts are to trust results from (heavily) redacted microdata.  They also are essential for query systems that report (perturbed) outputs from statistical models. However, such feedback leaks information about confidential data values. I discuss approaches for feedback that satisfy the risk criterion differential privacy for releasing diagnostics in regression models.


DLAW01 6th July 2016
13:30 to 14:30
Cynthia Dwork Marginals and Malice
In 2008 Homer et al rocked the genomics community with a discovery that altered the publication policies of the US NIH and the Wellcome Trust, showing that mere allele frequency statistics would permit a forensic analyst -- or a privacy attacker -- to determine the presence of an individual's DNA in a forensic mix -- or a case group.  These results were seen as particularly problematic for Genome-Wide Association Studies (GWAS), where the marginals are SNP minor allele frequency statistics (MAFs).

In this talk, we review the lessons of Homer et al. and report on recent generalizations and strengthenings of the attack, establishing the impossibility of privately reporting "too many" MAFs with any reasonable notion of accuracy.

We then present a differentially private approach to finding significant SNPs that controls the false discovery rate.  The apparent contradiction with the impossibility result is resolved by a relaxation of the problem, in which we limit the total number of potentially significant SNPs that are reported.  

Joint work with Smith, Steinke, Ullman, and Vadhan (lower bounds); and Su and Zhang (FDR control).
DLAW01 6th July 2016
14:30 to 15:30
John Abowd The Challenge of Privacy Protection for Statistical Agencies
Since the field of statistical disclosure limitation (SDL) was first formalized by Ivan Fellegi in 1972, official statistical agencies have recognized that their publications posed confidentiality risks for the households and businesses who responded. For even longer, agencies have protected the source data for those publications by using secure storage methods and access authorization systems. In SDL, Dalenius (1977) and, in computer science, Goldwasser and Micali (1982) formalized what has become the modern approach to privacy protection in data publication: inferential disclosure limitation/semantic security. The modern approach to physical data security centers on firewall and encryption technologies. And the two sets of risks (disclosure through publication and disclosure through unauthorized access) have become increasingly inter-related. It is important to recognize the distinct issues, however. Secure multiparty computing and the stronger fully homomorphic encryption are formal solutions to the problem of permitting statistical computations without granting access to the decrypted data. Privacy-protected query publication is a formal solution to the problem of insuring that inferential disclosures are bounded and that the bound is respected in all published tables. There are now tractable systems that combine secure multi-party computing with formal privacy protection of the computed statistics (e.g., Shokri and Shmatikov 2015). The challenge to statistical agencies is to learn how these systems work, and move their own protection technologies in this direction. Private companies like Google and Microsoft already do this. Statistical agencies must be prepared to explain the differences in their publication requirements and security protocols that distinguish their chosen data storage methods and publications from those used by private companies.

Related Links
DLAW01 6th July 2016
16:00 to 17:00
Peter Christen Recent developments and research challenges in data linkage
Techniques for linking and integrating data from different sources are becoming increasingly important in many application areas, including health, census, taxation, immigration, social welfare, in crime and fraud detection, in the assembly of national security intelligence, for businesses, in bibliometrics, as well as in the social sciences.

In today's Big Data era, data linkage (also known as entity resolution, duplicate detection, and data matching) not only faces computational challenges due to the increasing size of data collections and their complexity, but also operational challenges as many applications move from static environments into real-time processing and analysis of potentially very large and dynamically changing data streams, where real-time linking of records is required. Additionally, with the growing concerns by the public of the use of their sensitive data, privacy and confidentiality often need to be considered when personal information is being linked and shared between organisations.

In this talk I will present a short introduction to data linkage, highlight recent developments in advanced data linkage techniques and methods - with an emphasis on work conducted in the computer science domain - and discuss future research challenges and directions.
DLAW01 7th July 2016
10:00 to 11:00
Christine O'Keefe Measuring risk and utility in remote analysis and online data centres – why isn’t this problem already solved?
Remote analysis servers and online data centres have been around for quite a few years now, appearing both in the academic literature and in a range of large scale implementations. Such systems are considered to provide good confidentiality protection for protecting privacy in the case of data about people, and for protecting commercial sensitivity in the case of data about businesses and enterprises. A variety of different methods for protecting confidentiality in the outputs of such systems have been proposed and a range of them has been implemented and used in practice. However, much less common are quantitative assessments of risk to confidentiality, and usefulness of the system outputs for the purpose for which they are generated. Indeed, it has been suggested that perhaps such quantitative assessments are trying to measure the wrong things. In this talk we will provide an overview of the current state of literature and practice, and compare it with the overall problem o bjective with a view to determining key open challenges and research frontiers in the area, possibly within a redefined statement of the overall challenge.
DLAW01 7th July 2016
11:30 to 12:30
Josep Domingo-Ferrer New Directions in Anonymization: Permutation Paradigm, Verifiability by Subjects and Intruders, Transparency to Users
Co-author: Krishnamurty Muralidhar (University of Oklahoma)

There are currently two approaches to anonymization: "utility first" (use an anonymization method with suitable utility features, then empirically evaluate the disclosure risk and, if necessary, reduce the risk by possibly sacrificing some utility) or "privacy first" (enforce a target privacy level via a privacy model, e.g., k-anonymity or differential privacy, without regard to utility). To get formal privacy guarantees, the second approach must be followed, but then data releases with no utility guarantees are obtained. Also, in general it is unclear how verifiable is anonymization by the data subject (how safely released is the record she has contributed?), what type of intruder is being considered (what does he know and want?) and how transparent is anonymization towards the data user (what is the user told about methods and parameters used?).

We show that, using a generally applicable reverse mapping transformation, any anonymization for microdata can be viewed as a permutation plus (perhaps) a small amount of noise; permutation is thus shown to be the essential principle underlying any anonymization of microdata, which allows giving simple utility and privacy metrics. From this permutation paradigm, a new privacy model naturally follows, which we call (d,v,f)-permuted privacy. The privacy ensured by this method can be verified via record linkage by each subject contributing an original record (subject-verifiability) and also at the data set level by the data protector. We then proceed to define a maximum-knowledge intruder model, which we argue should be the one considered in anonymization. Finally, we make the case for anonymization transparent to the data user, that is, compliant with Kerckhoff's assumption (only the randomness used, if any, must stay secret).
DLAW01 7th July 2016
13:30 to 15:30
Christopher Dibben, Natalie Shlomo Perspectives on user needs for academic, government and commercial data
This session will explore a brief history of commercial applications of large data sets in marketing, retailing and consumer targeting.
It will investigate the implications of new data sets, how anonymisation may be achieved and what commercial business is considering with new open data standards


DLAW01 7th July 2016
16:00 to 17:00
Rebecca Steorts Modern Bayesian Record Linkage: Some Recent Developments and Open Challenges
Record linkage, also known as de-duplication, entity resolution, and coreference resolution is the process of merging together noisy databases to remove duplicate entities. Record linkage is becoming more essential in the age of big data, where duplicates are ever present in such applications as official statistics, human rights, genetics, electronic medical data, and so on. We briefly review the genesis of record linkage with the work of Newcombe in 1959, and then move to recent Bayesian developments using novel clustering approaches in recent work. We speak of recent challenges that have been overcome and ones that are present, needing guidance and attention.
DLAW01 8th July 2016
10:00 to 11:00
Harvey Goldstein Probabilistic anonymisation of microdatasets and models for analysis
The general idea is to use the addition of random noise with known properties to some or all variables in a released dataset, typically following linkage, where the values of some identifier variables for individuals of interest are also available to an external ‘attacker’ who wishes to identify those individuals so that they can interrogate their records in the dataset. The noise is tuned to achieve any given degree of anonymity to avoid identification by an ‘attacker’ via the linking of patterns based on the values of such variables.  The noise so generated can then be ‘removed’ at the analysis stage since its characteristics are known, requiring disclosure of these characteristics by the linking agency. This leads to consistent parameter estimates, although a loss of efficiency will occur, but the data themselves are not degraded by any form of coarsening such as grouping. 
DLAW01 8th July 2016
11:30 to 12:30
Ray Chambers Statistical Modelling using Linked Data - Issues and Opportunities
Probabilistic linkage of multiple data sets is now popular and widespread. Unfortunately, there appears to be little corresponding enthusiasm for adjusting standard methods of statistical analysis when they are used with these linked data sets, even though there is plenty of evidence from simulation studies that both incorrect links as well as informative missed links can lead to biased inference. In this presentation I will describe the key issues that need to be addressed when analysing such linked data and some of the methods that can help. In this context, I will focus in particular on the simple linear regression model as a vehicle for demonstrating how knowledge about the statistical properties of the linkage process as well as summary information about the population distribution of the analysis variables can be used to correct for (or at least alleviate) these inferential problems. Recent research at the Australian Bureau of Statistics on a potential weighting/imputation approach to implementing these solutions will also be presented.
DLAW01 8th July 2016
13:30 to 14:30
Aaron Roth Using Differential Privacy to Control False Discovery in Adaptive Data Analysis
DLAW01 8th July 2016
14:30 to 15:30
Jared Murray Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering
When linking two databases (or deduplicating a single database) the number of possible links grows rapidly in the size of the databases under consideration, and in most applications it is necessary to first reduce the number of record pairs that will be compared. Spurred by practical considerations, a range of indexing or blocking methods have been developed for this task. However, methods for inferring linkage structure that account for indexing, blocking, and filtering steps have not seen commensurate development. I review the implications of indexing, blocking and filtering, focusing primarily on the popular Fellegi-Sunter framework and proposing a new model to account for particular forms of indexing and filtering. 
DLA 28th July 2016
15:30 to 16:30
Anne-Sophie Charest Privacy for Bayesian modelling
The literature now contains a large set of methods to privately estimate parameters from a classical statistical model, or to conduct a data mining or machine learning task. However, little is known about how to perform Bayesian statistics privately. In this talk, I will share my thoughts, and a few results, about ways in which Bayesian modelling could be performed to offer some privacy guarantee. In particular, I will discuss some attempts at sampling from posterior predictive distributions under the constraint of differential privacy (DP). I will also discuss empirical differential privacy, a criterion designed to estimate the DP privacy level offered by a certain Bayesian model, and present some recent results on the meaning and limits of this privacy measure. A lot of what I will present is work in progress, and I am hoping that some of you may want to collaborate with me on this research topic.
DLA 11th August 2016
15:30 to 16:00
Peter Christen Talk 1: Advanced methods for linking complex historical birth, death, marriage and census data
In this talk I will provide a short overview of our recent work aimed at linking historical records from the Isle of Skye in Scotland. I'll discuss our linkage approach and present initial results using a variety of linkage techniques.   More details:  http://www.ipdlnconference2016.org/Programme/Abstract/90
DLA 11th August 2016
16:00 to 17:00
Peter Christen Talk 2: Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases
With: Thilina Ranbaduge, Dinusha Vatsalan, and Sean Randall   Abstract: In this talk I will present results from an empirical study comparing the scalability, linkage quality, and privacy of a standard linkage approach compared to state-of-the-art multi-party privacy-preserving record linkage techniques on real Australian health databases.   Details:  http://www.ipdlnconference2016.org/Programme/Abstract/89
DLA 8th September 2016
15:30 to 16:30
David Hand, Peter Christen A note on the F-measure for evaluating record linkage algorithms (and classification methods and information retrieval systems)
Record linkage is the process of identifying and linking records about the same entities from one more databases. If applied on a single database the process is known as deduplication. Record linkage can be viewed as a classification problem where the aim is to decide if a pair of records is a match (the two records refer to the same real-world entity) or a non-match (the two records refer to two different entities). Various classification techniques – including supervised, unsupervised, semi-supervised and active learning based – have been employed for record linkage. If ground truth data in the form of known true matches and non-matches are available, the quality of classified links can be evaluated. Due to the generally high class imbalance in record linkage problems, standard accuracy or misclassification rate are not meaningful for assessing the quality of a set of linked records. Instead, precision and recall, as commonly used in information retrieval, are used. These are often combined into the popular F-measure, which is normally presented as the harmonic mean of precision and recall. We show that F-measure can be expressed as a weighted sum of precision and recall, with weights which depend on the linkage method being used. This reformulation reveals the measure to have a major conceptual weakness: the relative importance assigned to precision and recall should be an aspect of the problem and the user, but not of the particular instrument being used. We suggest alternative measures which do not suffer from this fundamental flaw.
DLAW02 12th September 2016
10:00 to 11:00
Rainer Schnell Hardening Bloom Filter PPRL by modifying identifier encodings
Co-author: Christian Borgs (University of Duisburg Essen)

Using appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time. However, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall in large databases. 
DLAW02 12th September 2016
11:30 to 12:00
Gerrit Bloothooft Historical life cycle reconstruction by indexing
Co-authors: Jelte van Boheemen (Utrecht University), Marijn Schraagen (Utrecht University)

Historical information about individuals is usually scattered across many sources. An integrated use of all available information is then needed to reconstruct their life cycles. Rather than comparing records between pairs of sources, it will be shown to be computationally effective to combine all data in a single table. In such a table, each record summarizes the information that can be deducted for a person who shows up in a source event. The idea is that this table should be ordered in such a way that consecutive records describe the life cycle events of an unique individual, for one individual after another, where each individual has its own ID. To arrive at this situation, it is necessary to filter and index the table in two ways, depending on the possible roles of an individual: the first as ego in focus (at birth, marriage and decease), the second as parent at the same life events of children. The results of both indexes (in terms of preliminary record clusters and IDs) should be combined, while resulting clusters should be tested for validity of the life cycle.

The success of such a procedure strongly depends on the available data and its quality. The Dutch civil registration, introduced by the French in 1811 and now largely digitized, provides quite optimal conditions. Remaining problems of data fuzziness can be circumvented by name standardization (to various levels of name reduction) and by testing different sequences of the available information in records for indexing. Both approaches are only effective when there is more information available then needed to identify an individual uniquely – which in many cases seems to be the case for the Dutch civil registration. An example of the procedure will be given for data for the province of Zeeland, while options for application of the method to older data of (much) less quality and completeness will be discussed. The latter touches upon the limits of historical life cycle reconstruction.
DLAW02 12th September 2016
12:00 to 12:30
Patrick Ball Deduplicating databases of deaths in war: advances in adaptive blocking, pairwise classification, and clustering
Violent inter-state and civil wars are documented with lists of the casualties, each of which constitutes a partial, non-probability sample of the universe of deaths. There are often several lists, with duplicate entries within each list and among the lists, requiring record linkage to dedeuplicate the lists to create a unique enumeration of the known dead.

This talk will explore how we do record linkage, including: new advances in generating and learning from training data; an adaptive blocking approach; pairwise classification with string, date, and integer features and several classifiers; and a hybrid clustering method. Assessment metrics will be proposed for each stage, with real-world results from deduplicating more than 420,000 records of Syrian people killed since 2011.

DLAW02 12th September 2016
13:30 to 14:30
Bill Winkler Computational Methods for Linking Sets of National Files
A combination of faster hardware and new computational algorithms makes it possible to link two or more national files having suitable quasi-identifying information such as name, address, date-of-birth and other non-uniquely identifying information far faster than methods of a decade earlier. The methods (Winkler, Yancey, and Porter 2010) were used for matching 10^17 pairs (300 million x 300 million) using 40 cpus of an SGI machine (with 2006 Itanium chips) in less than 30 hours during the 2010 U.S. Decennial Census. The methods are 50 times as fast as PSwoosh parallel software (Kawai et al. 2006) from Stanford University. The methods are ~10 times as fast as recent parallel software that applies new methods of load balancing (Rahm and Kolb 2013, Yan et al. 2013, Karapiperis and Verykios 2014). This talk will describe how this software bypasses the needs for system sorts and provides highly optimized search-retrieval-comparison for a narrow range of situations needed for record linkage.

Related Links
DLAW02 12th September 2016
14:30 to 15:30
Amy O'Hara The U.S. Census Bureau’s Linkage Infrastructure: Overview and New Challenges
The U.S. Census Bureau makes extensive use of administrative records and other third-party data to produce statistics on our population and economy.  We use these data to evaluate survey quality, to produce new statistics about the population and economy, and to support evaluation of federal and state programs. Our success hinges on our ability to link external data with data already held at Census, usually at the individual person or housing unit level. We carry out this linkage using a standardized set of practices that has been in place since the early 2000s. This presentation focuses on the lifecycle of the production data linkage carried out at the U.S. Census Bureau, including authorities to obtain identified data, the types of data acquired, ingest and initial processing, linkage practices, evaluations of linkage quality, and the documentation, governance, and uses of linked data files. The presentation will conclude with a discussion of new demands on Census’s linked data infrastructure, and the need to modernize and further streamline governance and processes.
DLAW02 12th September 2016
16:00 to 17:00
Intra-disciplinary “speed-dating”
DLAW02 12th September 2016
16:00 to 17:00
'speed-dating' sessions
DLAW02 13th September 2016
09:30 to 10:00
Hye-Chung Kum Privacy Preserving Interactive Record Linkage (PPIRL)
Record linkage to integrate uncoordinated databases is critical to population informatics research. Balancing privacy protection against the need for high quality record linkage requires a human-machine hybrid system to manage uncertainty in the ever changing streams of chaotic big data safely. We review the literature in record linkage and privacy. In the computer science literature, private record linkage, which investigates how to apply a known linkage function safely is the most published area. However, in practice, the linkage function is rarely known. Thus, there are many data linkage centers whose main role is to be the trusted third party to determine the linkage function manually and link data for research via a master population list for a designated region. Most recently, a more flexible computerized third party linkage platform, Secure Decoupled Linkage (SDLink), has been proposed based on (1) decoupling data via encryption, (3) obfuscation via chaffing (addi ng fake data) and universe manipulation, and (3) minimum incremental information disclosure via recoding. Based on this findings, we formalize a new framework for privacy preserving interactive record linkage (PPIRL) with tractable privacy and utility properties. Human based third-party linkage centers for privacy preserving record linkage are the accepted norm internationally. We find that a computer based third-party platform that can precisely control the information disclosed at the micro level and allow frequent human interaction during the linkage process, is an effective human-machine hybrid system that significantly improves on the linkage center model both in terms of privacy and utility.

Related Links
DLAW02 13th September 2016
10:00 to 10:30
James Boyd Technical Challenges associated with record linkage
Background: The task of record linkage is increasingly being undertaken by dedicated record linkage units with secure environments and specialised linkage personnel. In addition to the complexity of undertaking record linkage, these units face additional technical challenges in providing record linkage ‘as a service’. The extent of this functionality, and approaches to solving these issues has had little focus in record linkage literature.

Methods: This session identifies and discusses the range of functions that are required or undertaken in the provision of record linkage services. These include managing routine, on-going linkage; storing and handling changing data; handling different linkage scenarios; accommodating ever increasing datasets. Current linkage methods also include analysis of data attributes such as data completeness, consistency, constancy and field discriminating power. This information is used to develop appropriate linkage strategies.

Results: In order to maximise matching quality and efficiency, linkage systems must address real-world operational requirements to manage linked data over time. By maintaining a full history of links, and storing pairwise information, many of the challenges around handling ‘open’ records, and providing automated managed extractions are solved. Automation of linkage processes (including clerical processes) is another way of ensuring consistency of results and scalability of service. Several of these solutions have been implemented as part of developments by the PHRN Centre for Data Linkage in Australia.

Conclusions: Increasing demand for and complexity of, linkage services present challenges to linkage units as they seek to offer accurate and efficient services to government and the research community. Linkage units need to be both flexible and scalable to meet this demand. It is hoped that the solutions presented will help overcome these difficulties.
DLAW02 13th September 2016
10:30 to 11:00
Evan Roberts Record linkage with complete-count historical census data
Many areas of social science research benefit from being able to follow individuals and families across time, to observe changes in social behavior across at least part of the life course. Since the 1920s, and particularly since World War II longitudinal social surveys have become a common tool of social scientists. Despite their many benefits these surveys only allow us to study a limited number of birth cohorts, and few of these cohorts are entirely deceased. Comparison across multiple cohorts, and across long periods of the life course is not always possible as social scientists must follow their cohorts in real time.

Historical data on past populations allows us to reconstruct life-course panels for past cohorts. In the past few years complete transcriptions of census data from sequential censuses has become available in several countries including Britain, Canada, Iceland, Norway, Sweden, and the United States. The Minnesota Population Center is developing tools to create large datasets of people linked between at least two censuses. There are multiple challenges in creating this form of historical data, centering around the lack of unique identifiers. People must be identified by a combination of characteristics recorded with error including names, birthplaces, date of birth, and ethnic background. Although linkage rates are low by comparison to modern longitudinal surveys it has proved possible to create samples that are reasonably representative of the origin or terminal population. This paper describes the sources used in creating linked census datasets, the domain-specific issues in data linkage, and demonstrates some of the applications of historical longitudinal data in studying social mobility, and mortality in the past.

Related Links
DLAW02 13th September 2016
11:30 to 12:00
Bradley Malin A LifeCycle Model for Privacy Preserving Record Linkage
Individuals increasingly leave behind information in resources managed by disparate organizations.  There is an interest in making this information available for a wide array of endeavors (e.g., policy assessment, discovery-based research, and surveillance activities).  Given the distribution of data, it is critical to ensure that it is sufficiently integrated before conducting any statistical investigation to prevent duplication (and thus overcounting of events) and fragmentation (and thus undercounting of events).  This problem is resolved through record linkage procedures, techniques that have been refined for over half a century.  However, these methods often rely on explicit- or potentially identifying features, which often conflict with the expectations of privacy regulations.  More recently, privacy preserving record linkage (PPRL) methods have been proposed which rely on randomized transformations of data, as well as cryptographically secure processing methods. However, it is often unclear how the various steps of a record lifecycle, including standardization, parameter estimation, blocking, record pair comparison, and ccommunication between all of the various parties involved in the process can take place. In this talk, I will review recent developments in PPRL methods, discuss how they have been engineered into working software systems, and provide examples of how they have been applied in several distributed networks in the healthcare community to facilitate biomedical research and epidemiological investigations.
DLAW02 13th September 2016
12:00 to 12:30
Luiza Antonie Tracking people over time in 19th century Canada: Challenges, Bias and Results
Co-author: Kris Inwood (University of Guelph)

Linking multiple databases to create longitudinal data is an important research problem with multiple applications. Longitudinal data allows analysts to perform studies that would be unfeasible otherwise. In this talk, I discuss a system we designed to link historical census databases in order to create longitudinal data that allow tracking people over time. Data imprecision in historical census data and the lack of unique personal identifiers make this task a challenging one. We design and employ a record linkage system that incorporates a supervised learning module for classifying pairs of records as matches and non-matches. In addition, we disambiguate ambiguous links by taking into account the family context. We report results on linking four Canadian census collections, from 1871 to 1901, and identify and discuss the impact on precision and bias when family context is employed. We show that our system performs large scale linkage producing high quality links and generat ing sufficient longitudinal data to allow meaningful social science studies. 
DLAW02 13th September 2016
13:30 to 14:00
Katie Harron Handling identifier error rate variation in data linkage of large administrative data sources
Co-authors: Gareth Hagger-Johnson (Administrative Data Research Centre for England, University College London), Ruth Gilbert (Institute of Child Health, University College London), Harvey Goldstein (University of Bristol and University College London)

Background: Linkage of administrative data with no unique identifier often relies on probabilistic linkage. Variation in data quality on individual or organisational levels can adversely affect match weight estimation, and potentially introduce selection bias to the linked data if subgroups of individuals are more likely to link than others. We quantified individual and organisational variation in identifier error in a large administrative dataset (Hospital Episode Statistics; HES) and incorporated this information within a match probability estimation model. Methods: A stratified sample of 10,000 admission records were extracted from 2011/2012 HES for three cohorts of ages 0-1, 5-6 and 18-19 years. A reference standard was created by linking via NHS number with the Personal Demographic Service for up-to-date identifiers. Based on aggregate tables, we calculated identifier error rates for sex, date of birth and postcode and investigated whether these errors were dependent on individual characteristics and evaluated variation between organisations. We used a log-linear model to estimate match probabilities, and used a simulation study to compare readmission rates based on traditional match weights. Results: Match probabilities differed significantly according to age (p
DLAW02 13th September 2016
14:00 to 14:30
Ruth Gilbert GUILD: GUidance for Information about Linked Datasets
Co-authors: Lafferty, Hagger-Johnson, Harron, Smith, Zhang, Dibben, Goldstein

Linkage of large administrative datasets often involves different teams at the various steps of the linkage pathway. Information is rarely shared throughout the pathway about processes that might contribute to linkage error and potentially bias results. However, improved awareness about the type of information that should be shared could lead to greater transparency and more robust methods, including analyses that take into account linkage error.  

The Administrative Data Research Centre for England convened a series of 3 face-to-face meetings between data linkage experts to develop the GUILD guidance (GUidance for Information about Linked Datasets). GUILD recommends key items of information that should be shared at 4 steps in the data linkage pathway: data provision (how data was generated, extracted, processed and quality controlled), data linkage, analyses of linked data, and report writing. The guidance aims to improve transparency in data linkage processes and reporting of analyses, and to improve the validity of results based on linked data.  GUILD guidance is desgined to be used by data providers, linkers, and analysts, but will also be relevant to policy makers, funders and legislators responsible for widening use of linked data for research, services and policy. The GUILD recommendations will be presented and discussed. 
DLAW02 13th September 2016
14:30 to 15:00
Dinusha Vatsalan Advanced Techniques for Privacy-Preserving Linking of Multiple Large Databases
Co-author: Peter Christen (The Australian National University)

In the era of Big Data the collection of person-specific data disseminated in diverse databases provides enormous opportunities for businesses and governments by exploiting data linked across these databases. Linked data empowers quality analysis and decision making that is not possible on individual databases. Therefore, linking databases is increasingly being required in many application areas, including healthcare, government services, crime and fraud detection, national security, and business applications. Linking data from different databases requires comparison of quasi-identifiers (QIDs), such as names and addresses. These QIDs are personal identifying attributes that contain sensitive and confidential information about the entities represented in these databases. The exchange or sharing of QIDs across organisations for linkage is often prohibited due to laws and business policies. Privacy-preserving record linkage (PPRL) has been an active research area over the past two decades addressing this problem through the development of techniques that facilitate the linkage on masked (encoded) records such that no private or confidential information needs to be revealed.

Most of the work in PPRL thus far has concentrated on linking two databases only. Linking multiple databases has only recently received more attention as it is being required in a variety of application areas. We have developed several advanced techniques for practical PPRL of multiple large databases addressing the scalability, linkage quality, and privacy challenges. Our approaches perform linkage on masked records using Bloom filter encoding, which is a widely used masking technique for PPRL. In this talk, we will first highlight the challenges of PPRL of multiple databases, then describe our developed approaches, and then discuss future research directions required to leverage the huge potential that linked data from multiple databases can provide for businesses and government services.
DLAW02 13th September 2016
15:00 to 15:30
Harvey Goldstein A scaling approach to record linkage
Co-authors: Mario Cortina-Borja (UCL), Katie Harron (LSHTM)

With increasing availability of large data sets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. The very large size of such datasets and the variation in the quality of the identifiers used to carry out linkage means that existing approaches based upon ‘probabilistic’ models can make heavy computational demands. They are also based upon questionable assumptions. In this paper we suggest a new approach, based upon a scaling algorithm, that is computationally fast, requires only moderate amounts of storage and has intuitive appeal. A comparison with existing methods is given. 
DLAW02 13th September 2016
16:00 to 17:00
'speed-dating' sessions
DLAW02 14th September 2016
09:00 to 10:00
Erhard Rahm Big data integration: challenges and new approaches
Data integration is a key challenge for Big Data applications to semantically enrich and combine large sets of heterogeneous data for enhanced data analysis. In many cases, there is also a need to deal with a very high number of data sources, e.g., product offers from many e-commerce websites. We will discuss approaches to deal with the key data integration tasks of (large-scale) entity resolution and schema matching. In particular, we discuss parallel blocking and entity resolution on Hadoop platforms together with load balancing techniques to deal with data skew. We also discuss challenges and recent approaches for holistic data integration of many data sources, e.g., to create knowledge graphs or to make use of huge collections of web tables.
DLAW02 14th September 2016
10:00 to 10:30
Vassilios Verykios Space Embedding of Records for Privacy Preserving Linkage
Massive amounts of data, collected by a wide variety of organizations, need to be integrated and matched in order to facilitate data analyses that may be highly beneficial to businesses, governments, and academia. Record linkage, also known as entity resolution, is the process of identifying records that refer to the same real-world entity from disparate data sets. Privacy Preserving Record Linkage (PPRL) techniques are employed to perform the linkage process in a secure manner, when the data that need to be matched are sensitive. In PPRL, input records undergo an anonymization process that embeds the records into a space, where the underlying data can be matched but not understood by naked eye.

The PPRL problem is picking up a lot of steam lately due to a ubiquitous need for cross matching of records that usually lack common unique identifiers and their field values contain variations, errors, misspellings, and typos. The PPRL process as it is applied to massive ammounts of data comprises of an anonymization phase, a searching phase and a matching phase.

Several searching and anonymization approaches have been developed with the aim to scale the PPRL process to big data without sacrificing quality of the results. Recently, redundant randomized methods have been proposed, which insert each record into multiple independent blocks in order to amplify the probability of bringing together similar records for comparison. The key feature of these methods is the formal guarantees, they provide, in terms of accuracy in the generated results.

In this talk, we present both state-of-the-art private searching methods and anonynimization techniques, by exposing their characteristics, including their strengths and weaknesses, and we also present a comparative evaluation.
DLAW02 14th September 2016
10:30 to 11:00
Andy Boyd ‘Data Safe Havens’ as a framework to support record linkage in observational studies: evidence from the Project to Enhance ALSPAC through Record Linkage (PEARL).
The health research community are engaged in projects which require a wealth of data. These data can be drawn directly from research participants, or via linkage to participants’ routine records. Frequently, investigators require information from multiple sources with multiple legal owners. A fundamental challenge for data managers – such as those maintaining cohort study databanks - is to establish data processing and analysis pipelines that meet the legal, ethical and privacy expectations of participants and data owners alike. This demands socio-technical solutions that may easily become enmeshed in protracted debate and controversy as they encounter the norms, values, expectations and concerns of diverse stakeholders. In this context, ‘Data Safe Havens’ can provide a framework for repositories in which sensitive data are kept securely within governance and informatics systems that are fit-for-purpose, appropriately tailored to the data, while being accessible to legitimate users for legitimate purposes (see Burton et al, 2015. http://www.ncbi.nlm.nih.gov/pubmed/26112289).  

In this paper I will describe our data linkage experiences gained through the Project to Enhance ALSPAC through Record Linkage (PEARL); a project aiming to establish linkages between participants of the ALSPAC birth cohort study and their routine records. This exemplar illustrates how the governance and technical solutions encompassed within the ALSPAC Data Safe Haven have helped counter and address the real world data linkage challenges we have faced.
DLAW02 14th September 2016
11:30 to 12:00
Mauricio Sadinle A Bayesian Partitioning Approach to Duplicate Detection and Record Linkage
Record linkage techniques allow us to combine different sources of information from a common population in the absence of unique identifiers. Linking multiple files is an important task in a wide variety of applications, since it permits us to gather information that would not be otherwise available, or that would be too expensive to collect. In practice, an additional complication appears when the datafiles to be linked contain duplicates. Traditional approaches to duplicate detection and record linkage output independent decisions on the coreference status of each pair of records, which often leads to non-transitive decisions that have to be reconciled in some ad-hoc fashion. The joint task of linking multiple datafiles and finding duplicate records within them can be alternatively posed as partitioning the datafiles into groups of coreferent records. We present an approach that targets this partition as the parameter of interest, thereby ensuring transitive decisions. Our Bayesian implementation allows us to incorporate prior information on the reliability of the fields in the datafiles, which is especially useful when no training data are available, and it also provides a proper account of the uncertainty in the duplicate detection and record linkage decisions. We show how this uncertainty can be incorporated in certain models for population size estimation. Throughout the document we present a case study to detect killings that were reported multiple times to organizations recording human rights violations during the civil war of El Salvador. 
DLAW02 14th September 2016
12:00 to 12:30
Changyu Dong From Private Set Intersection to Private Record Linkage
Record linkage allows data from different sources to be integrated to facilitate data mining tasks. However, in many cases, records have to be linked by personally identifiable information. To prevent privacy breaches, ideally records should be linked in a private way such that no information other than the matching result is leaked in the process. One approach for Private Record Linkage (PRL) is by using cryptographic protocols. In this talk, I will introduce Private Set Intersection (PSI), which is a type of cryptographic protocol that enables two parties to obtain the intersection of their private sets. It is almost trivial to build an exact PRL protocol from a PSI protocol. With more efforts, it is also possible to build an approximate PRL protocol from PSI that allows linking records based on certain similarity metrics. In this talk, I will present efficient PSI protocols, and how to obtain PRL protocols that are practically efficient and effective.     
DLAW02 15th September 2016
09:00 to 17:00
Hackathon / Zeeland challenge
DLA 29th September 2016
15:30 to 16:30
Grigorios Loukides Sanitization for sequential data
Organizations disseminate sequential data to support applications in domains ranging from marketing to healthcare. Such data are typically modeled as a collection of sequences, or a series of time-stamped events, and they are mined by data recipients aiming to discover actionable knowledge. However, the mining of sequential data may expose sensitive patterns that leak confidential knowledge, or lead to intrusive inferences about groups of individuals.   In this talk, I will review the problem and present two approaches that prevent it, while retaining the usefulness of data in mining tasks. The first approach is applicable to a collection of sequences and sanitizes sensitive patterns by permuting their events. The selected permutations avoid changes in the set of frequent non-sensitive patterns and in the ordering information of the sequences. The second approach is applicable to a series of time-stamped events and sanitizes sensitive events by deleting them from carefully selected time points. The deletion of events is guided by a model that captures changes to the probability distribution of events across the sequence.  
DLA 20th October 2016
15:30 to 16:30
Tom Dalton Evaluating Data Linkage: Creating longitudinal synthetic data to provide a gold-standard linked dataset
When performing probabilistic data linkage on real world data we, by the fact we need to link it, do not know the true linkage. Therefore, the success of our linkage approach is difficult to evaluate. Often small hand linked datasets are used as a ‘gold-standard’ for the linkage approach to be evaluated against. However, errors in the hand-linkage and the limited size and number of these datasets do not allow for robust evaluation. The research focuses on the creation of longitudinal synthetic datasets for the domain of population reconstruction. In this talk I will cover the previous and current models we have created to achieve this and detail the approaches to how we: define the desired behaviour in the model to avoid clashes between input distributions, verify the statistical correctness of the population, and initialise the model such that the starting population meets the temporal requirements of the desired behaviour. To conclude I will outline the model’s intended use for linkage evaluation, its other potential uses and also take questions.



DLAW04 28th October 2016
10:00 to 11:00
Laura Brandimarte How does Government surveillance affect perceived online privacy/security and online information disclosure?
Disclosure behaviors in the digital world are affected by perceived privacy and security just as much, or arguably more than they are by actual privacy/security features of the digital environment. Several Governments have recently been at the center of attention for secret surveillance programs that have affected the sense of privacy and security people experience online. In this talk, I will discuss evidence from two research projects showing how privacy concerns and disclosure behaviors are affected by perceived privacy/security intrusions associated with Government monitoring and surveillance. These two interdisciplinary projects bring together methodologies from different disciplines: information systems, machine learning, psychology, and economics.

The first project is in collaboration with the Census Bureau, and studies geo-location and its effects on willingness to disclose personal information. The U.S. Census Bureau has begun a transition from a paper-based questionnaire to an Internet-based one. Online data collection would not only allow for a more efficient gathering of information; it would also, through geo-location technologies, allow for the automated inference of the location from which the citizen is responding. Geo-location features in Census forms, however, may raise privacy concerns and even backfire, as they allow for the collection of a sensitive piece of information without explicit consent of the individual. Four online experiments investigate individuals’ reactions to geo-location by measuring willingness to disclose personal information as a function of geo-location awareness and the entity requesting information: research or Governmental institutions. The experiments also explicitly test how surveillance primes affect the relationship between geo-location awareness and disclosure. Consistent with theories of perceived risk, contextual integrity, and fairness in social exchanges, we find that awareness of geo-location increases privacy concerns and perceived sensitivity of requested information, thus decreasing willingness to disclose sensitive information, especially when participants did not have a prior expectation that the institution would collect that data. No significant interaction effects are found for a surveillance prime.

The second project is ongoing research about the “chilling effects” of Government surveillance on social media disclosures, or the tendency to self-censor in order to cope with mass monitoring systems raising privacy concerns. Until now, such effects have only been estimated using either Google/Bing search terms, Wikipedia articles, or survey data. In this research in progress, we propose a new method in order to test for chilling effects in online social media platforms. We use a unique, large dataset of Tweets and propose the use of new statistical machine learning techniques in order to detect anomalous trends in user behavior (use of predetermined, sensitive sets of keywords) after Snowden’s revelations made users aware of existing surveillance programs.
DLAW04 28th October 2016
11:30 to 12:30
Ian Schumutte Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods
Co-author: John M. Abowd (Cornell University and U.S. Census Bureau)

We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε,δ) -differential privacy with (α,β) -accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we disp lay properties of the solution to the social planner’s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial.


Related Links
DLAW04 28th October 2016
13:30 to 14:30
Alessandro Acquisti The Economics of Privacy (remote presentation)
In the policy and scholarly debate over privacy, the protection of personal information is often set against the benefits society is expected to gain from large scale analytics applied to individual data. An implicit assumption underlays the contrast between privacy and 'big data': economic research is assumed to univocally predict that the increasing collection and analysis of personal data will be an economic win-win for data holders and data subjects alike - some sort of unalloyed public good. Using a recently published review of the economic literature on privacy, I will work from within traditional economic frameworks to investigate this notion. In so doing, I will highlight how results from economic research on data sharing and data protection actually paint a nuanced picture of the economic benefits and costs of privacy.
DLAW04 28th October 2016
14:30 to 15:30
Katrina Ligett Buying Private Data without Verification
Joint work with Arpita Ghosh, Aaron Roth, and Grant Schoenebeck

We consider the  problem  of  designing  a  survey  to  aggregate  non-verifiable  information  from a privacy-sensitive population: an analyst wants to compute some aggregate statistic from the private bits held by each member of a population, but cannot verify the correctness of the bits reported by participants in his survey. Individuals in the population are strategic agents with a cost for privacy, i.e., they not only account for the payments they expect to receive from the mechanism, but also their privacy costs from any information revealed about them by the mechanism’s outcome—the computed statistic as well as the payments—to determine their utilities. How can the analyst design payments to obtain an accurate estimate of the population statistic when individuals strategically decide both whether to participate and whether to truthfully report their sensitive information?

In this talk, we will discuss an approach to this problem based on ideas from peer prediction and differential privacy.
DLAW04 28th October 2016
16:00 to 17:00
Mallesh Pai The Strange Case of Privacy in Equilibrium Models
Joint work with Rachel Cummings, Katrina Ligett and Aaron Roth

The literature on differential privacy by and large takes the data set being being analyzed as exogenously given. As a result, by varying a privacy parameter in his algorithm, the analyst straightforwardly chooses the potential privacy loss of any single entry in the data set.  Motivated by privacy concerns on the internet, we consider a stylized setting where the dataset is endogenously generated, depending on the privacy parameter chosen by the analyst. In our model, an agent chooses whether to purchase a product. This purchase decision is recorded, and a differentially private version of his purchase decision may be used by an advertiser to target the consumer. A change in the privacy parameter therefore affects, in equilibrium, the agents' purchase decision, the price of the product, and the targeting rule used by the advertiser. We demonstrate that the comparative statics with respect to privacy parameter may be exactly reversed relative to the exogenous data set benchmark, for example a higher privacy parameter may nevertheless be more informative etc.  More care is needed in understanding the effects of private analysis of a data set that is endogenously generated.
DLA 31st October 2016
11:00 to 11:20
Robin Mitra tba
DLA 31st October 2016
11:20 to 11:40
Cong Chen tba
DLA 31st October 2016
12:00 to 12:20
Anne-Sophie Charest tba
DLA 31st October 2016
12:20 to 12:40
Christine O'Keefe Synthetic data - more questions than answers
DLA 31st October 2016
12:40 to 13:00
Natalie Shlomo tba
DLA 31st October 2016
13:00 to 13:20
Peter Christen Generating realistic personal data for data linkage research
DLA 31st October 2016
13:50 to 14:00
Mark Elliot GA Approaches to Synthetic data
In this talk
DLA 31st October 2016
14:00 to 14:20
Gillian Raab Analysis methods and utility measures
DLA 31st October 2016
14:20 to 14:40
Beata Nowok Challenges in generating and communicating synthetic data
DLA 31st October 2016
14:40 to 15:00
Joshua Snoke Beond Microdata: Synthetic tweets
DLA 31st October 2016
15:00 to 15:20
Joerg Drechsler My View on the Key Research Questions for Synthetic Data
DLA 31st October 2016
16:40 to 17:00
Mark Elliot Plan for the rest of the week
DLA 3rd November 2016
15:30 to 16:30
Gillian Raab Measures of Utility for Synthetic Data
When synthetic data are produced to overcome potential disclosure they can be used either in place of the original data or, more commonly, to allow researchers to develop code that will ultimately be run on the original data.  The utility of synthetic data can be measured by comparing the results of the final analysis with the synthetic and original data. This is not possible until the final analysis is complete.  General utility measures that measure the overall differences between the original and synthetic data are more useful for those creating synthetic data. This presentation will discuss two such >measures. The first is a propensity score measure originally proposed by Woo et. al., 2009 and the second is one based on comparing tables, suggested by Voas and Williamson, 2001. Their null distributions, when the synthesis model is "correct" will be discussed as well as their practical implementation as part of the synthpop package.
DLA 10th November 2016
15:30 to 16:30
Graham Cormode Engineering Privacy for Small Groups
Concern about how to collect sensitive user data without compromising individual privacy is a major barrier to greater availability of data. The model of Local Differential Privacy has recently gained favor and enjoys widespread adoption for data gathering from browsers and apps. Deployed methods use Randomized Response, which applies only when the user data is a single bit.  We study general mechanisms for data release which allow the release of statistics from small groups. We formalize this by introducing a set of desirable properties that such mechanisms can obey. Any combination of these can be satisfied by solving a linear program which minimizes a cost function. We also provide explicit constructions that are optimal for certain combinations of properties, and show a closed form for their cost. In the end, there are only three distinct optimal mechanisms to choose between: one is the well-known (truncated) geometric mechanism; the second a novel mechanism that we introduce, and the third is found as the solution to a particular LP. Through a set of experiments on real and synthetic data we determine which is preferable in practice, for different combinations of data distributions and privacy parameters.   Joint work with Tejas Kulkarni and Divesh Srivastava
DLA 22nd November 2016
15:30 to 16:30
Thomas Steinke Generalisation for Adaptive Data Analysis
Adaptivity is an important aspect of data analysis -- that is, the choice of questions to ask about a dataset is often informed by previous use of the same dataset. However, statistical validity is typically only guaranteed in a non-adaptive model, in which the questions must be specified before the dataset is collected. A recent line of work initiated by Dwork et al. (STOC 2015) provides a formal model for studying the power of adaptive data analysis.   This talk will show that there are sophisticated techniques -- using tools from information theory and differential privacy -- that enable us to ensure that adaptive data analysis provides statistically valid answers that generalise to the overall population from which the dataset was drawn. This talk will also discuss how adaptive data analysis is inherently more powerful than non-adaptive data analysis, namely there is an exponential separation between the number of adaptive queries needed to overfit a dataset and the number of non-adaptive queries needed.  
DLA 30th November 2016
16:00 to 17:00
Cynthia Dwork Rothschild Lecture: The Promise of Differential Privacy
The rise of "Big Data" has been accompanied by an increase in the twin risks of spurious scientific discovery and privacy compromise.  A great deal of effort has been devoted to the former, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing.  However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, selected non-adaptively before the data are gathered, whereas in practice data are shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. Privacy-preserving data analysis also has a large literature, spanning several disciplines. However, many attempts have proved problematic either in practice or on paper.   
"Differential privacy" – a recent notion tailored to situations in which data are plentiful – has provided a theoretically sound and powerful framework, giving rise to an explosion of research. We will review the definition of differential privacy, describe some basic algorithmic techniques for achieving it, and see that it also prevents false discoveries arising from adaptivity in data analysis.  
DLA 1st December 2016
15:30 to 16:30
Joerg Drechsler Strategies to facilitate access to detailed geocoding information based on synthetic data
In this seminar we investigate if generating synthetic data can be a viable strategy to provide access to detailed geocoding information for external researchers without compromising the confidentiality of the units included in the database. This research was motivated by a recent project at the Institute for Employment Research (IAB) that linked exact geocodes to the Integrated Employment Biographies, a large administrative database containing several million records. Based on these data we evaluate the performance of several synthesizers in terms of addressing the trade-off between preserving analytical validity and limiting the risk of disclosure. We propose strategies for making the synthesizers scalable for such large files, introduce analytical validity measures for the generated data and provide general recommendations for statistical agencies considering the synthetic data approach for disseminating detailed geographical information.
DLA 1st December 2016
16:30 to 17:30
Atikur Khan A Risk-Utility Balancing Approach to Generate Synthetic Microdata
Generation of synthetic microdata is one of the promising approaches to statistical disclosure control of microdata. Methods for generating synthetic microdata for multivariate continuous variables include noise addition and decomposition of data matrix. We present a framework to generate synthetic data matrix based on resampling from Stiefel manifold and application of Slutsky’s theorem in singular value decomposition. We also derive utility and risk measures based on this theorem and present a risk-utility balancing approach to generate synthetic continuous microdata. We apply our proposed methods to some reference microdata sets and demonstrate the usefulness of our proposed methods.


DLA 2nd December 2016
14:00 to 15:00
Jordi Soria-Comas Topics in differential privacy: optimal noise and record perturbation baseddata sets
We explore two different aspects of differential privacy. First we explore the optimality of noise distributions in noise addition. In particular, we show that the Laplace distribution is nearly optimal in the univariate case, but not in the multivariate case. Optimal distributions are described. Then we explore the generation of differentially private data sets via perturbative masking of the original records. This approach is
remarkably more efficient than histogram-based approaches but a naive application of it may completely damage the data utility. In particular, we analyze the use of microaggregation to reduce the
 sensitivity and, thus, the amount of noise required to attain differential privacy.
DLAW03 7th December 2016
10:00 to 10:45
Bradley Malin Gaming the System for Privacy
Over the past several decades, there has been a cat-and-mouse game played between data protectors of and data users.  It seems as though every time a new model of data privacy is posited, a new attack is published along with high-profile demonstrations of its failure to guarantee protection.  This has, upon many occasions, led to outcries about how privacy is either dead, is dying, or was a mere myth and never existed in the first place.  The goal of this talk is to review how we got to this juncture in time and suggest that data privacy may not only be dead, but that our technical definitions of the problem may require an augmentation to account for real world adversarial settings.  In doing so, I will introduce a new direction in privacy, which is rooted in a formal game theoretic framework and will provide examples of how this view can provide for greater flexibility with respect to several classic privacy problems, including the publication of individual-level records and summary statistics.
DLAW03 7th December 2016
11:30 to 12:00
Anna Oganian Combining statistical disclosure limitation methods to preserve relationships and data-specific constraints in survey data.
Applications of data swapping and noise are among the most widely used methods for Statistical Disclosure Limitation (SDL) by statistical agencies for public-use non-interactive data release. The core ideas of swapping and noise are conceptually easy to understand and are naturally suited for masking purposes. We believe that they are worth revisiting with a special emphasis given to the utility aspects of these methods and to the ways of combining the methods to increase their efficiency and reliability.  Indeed, many data collecting agencies use complex sample designs to increase the precision of their estimates and often allocate additional funds to obtain larger samples for particular groups in the population. Thus, it is particularly undesirable and counterproductive when SDL methods applied to these data significantly change the magnitude of estimates and/or their levels of precision. We will present and discuss two methods of disclosure limitation based on swapping and noise, which can work together in synergy while protecting continuous and categorical variables. The first method is a version of multiplicative noise that preserves means and covariance together with some structural constraints in the data. The second method is loosely based on swapping. It is designed with the goal of preserving the relationships between strata-defining variables with other variables in the survey. We will show how these methods can be applied together enhancing each other’s efficiency.
DLAW03 7th December 2016
12:00 to 12:30
Yulia Gel Bootstrapped Inference for Degree Distributions in Large Sparse Networks
We propose a new method of nonparametric bootstrap to quantify estimation uncertainties in functions of network degree distribution in large ultra sparse networks. Both network degree distribution and network order are assumed to be unknown. The key idea is based on adaptation of the ``blocking'' argument, developed for bootstrapping of time series and re-tiling of spatial data, to random networks. We first sample a set of multiple ego networks of varying orders that form a patch, or a network block analogue, and then resample the data within patches. To select an optimal patch size, we develop a new computationally efficient and data-driven cross-validation algorithm. In our simulation study, we show that the new fast patchwork bootstrap (FPB) outperforms competing approaches by providing sharper and better calibrated confidence intervals for functions of a network degree distribution, including the cases of networks in an ultra sparse regime. We illustrate the FPB in application to analysis of social networks and discuss its potential utility for nonparametric anomaly detection and privacy-preserving data mining.
DLAW03 7th December 2016
13:30 to 14:15
Sheila Bird Barred from work in Scottish prisons, data science to the rescue: discoveries about drugs-related deaths and women via record linkage - A part of Women in Data Science (WiDS)
Willing Anonymous HIV Salivary (WASH) surveillance studies in Scottish prisons changed the focus on drugs in prisons - not always for the better. Barred from work in Scottish prisons, we turned to powerful record-linkage studies to quantify drugs-related deaths soon after prison-release (and how to reduce them); reveal that female injectors' risk of drugs-related death was half that of male injectors; but that the female advantage narrowed with age; and is not evident for methadone-specific deaths. Explanations for these strong, validated empirical findings are the next step.
DLAW03 7th December 2016
14:15 to 15:00
Christine O'Keefe A new relaxation of differential privacy - A part of Women in Data Science (WiDS)
Co-author: Anne-Sophie Charest 

Agencies and organisations around the world are increasingly seeking to realise the value embodied in their growing data holdings, including by making data available for research and policy analysis. On the other hand, access to data must be provided in a way that protects the privacy of individuals represented in the data. In order to achieve a justifiable trade-off between these competing objectives, appropriate measures of privacy protection and data usefulness are needed.  

In recent years, the formal differential privacy condition has emerged as a verifiable privacy protection standard. While differential privacy has had a marked impact on theory and literature, it has had far less impact in practice. Some concerns include the possibility that the differential privacy standard is so strong that statistical outputs are altered to the point where they are no longer useful. Various relaxations have been proposed to increase the utility of outputs, although none has yet achieved widespread adoption. In this paper we describe a new relaxation of the differential privacy condition, and demonstrate some of its properties. 
DLAW03 7th December 2016
15:30 to 16:15
Cynthia Dwork Theory for Society: Fairness in Classification
The uses of data and algorithms are publicly under debate on an unprecedented scale. Listening carefully to the concerns of rights advocates, researchers, and other data consumers and collectors suggests many possible current and future problems in which theoretical computer science can play a positive role.  The talk will discuss the nascent mathematically rigorous study of fairness in classification and scoring.
DLAW03 8th December 2016
09:30 to 10:15
Yosi (Joseph) Rinott Attempts to apply Differential Privacy for comparison of standard Table Builder type schemes (O'Keefe, Skinner, Shlomo)
We try to compare different practical perturbation schemes for frequency tables used by certain agencies, with different parameters, using the parameters of Differential Privacy guaranteed by these schemes. At the same time we look at utility of perturbed data in terms of loss functions that are common in statistics, and also by trying to study the conditions required for certain statistical properties of the original table, such as independence of certain variables, to be preserved under the perturbation. The `worst case' nature of Differential Privacy appears problematic, and so are alternative definitions, which rely on scenarios that are not the worst.  This is ongoing work, and we still have many unanswered questions which I will raise in the talk in hope for help from the audience. 
DLAW03 8th December 2016
10:15 to 11:00
Josep Domingo-Ferrer, Jordi Soria-Comas Individual Differential Privacy
Differential privacy is well-known because of the strong privacy guarantees it offers: the results of a data analysis must be indistinguishable between data sets that differ in one record. However, the use of differential privacy may limit the accuracy of the results significantly. Essentially, we are limited to data analyses with small global sensitivity (although some workarounds have been proposed that improve the accuracy of the results when the local sensitivity is small). We introduce individual differential privacy (iDP), a relaxation of differential privacy that: (i) preserves the strong privacy guarantees that differential privacy gives to individuals, and (ii) improves the accuracy of the results significantly. The improvement in the accuracy comes from the fact that the trusted party does a more precise assessment of the risk associated to a given data analysis. This is possible because we allow the trusted party to take advantage of all the available information, namely the actual data set.
DLAW03 8th December 2016
11:30 to 12:00
Marco Gaboardi PSI : a Private data Sharing Interface
Co-authors: James Honaker (Harvard University) , Gary King (Harvard University) , Jack Murtagh (Harvard University) , Kobbi Nissim (Ben-Gurion University and CRCS Harvard University) , Jonathan Ullman (Northeastern University) , Salil Vadhan (Harvard University)

We provide an overview of the design of PSI (“a Private data Sharing Interface”), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy.
PSI is designed so that none of its users need expertise in privacy, computer science, or statistics. PSI enables them to make informed decisions about the appropriate use of differential privacy, the setting of privacy parameters, the partitioning of a privacy budget across different statistics, and the interpretation of errors introduced for privacy.
Additionally, PSI is designed to be integrated with existing and widely used data repository infrastructures as part of a broader collection of mechanisms for the handling of privacy-sensitive data, including an approval process for accessing raw data (e.g. through IRB review), access control, and secure storage.
Its initial set of differentially private algorithms were chosen to include statistics that have wide use in the social sciences, and are integrated with existing statistical software designed for modeling, interpreting, and exploring social science data.

Related Links
DLAW03 8th December 2016
12:00 to 12:30
Mark Bun The Price of Online Queries in Differential Privacy
Co-authors: Thomas Steinke (IBM Research - Almaden), Jonathan Ullman (Northeastern University)
 
We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set of allowable queries via one of three interactive models. These models capture whether the queries are given to the mechanism all in a single batch (“offline”), whether they are chosen in advance but presented to the mechanism one at a time (“online”), or whether they may be chosen by an analyst adaptively (“adaptive”).
 
Many differentially private mechanisms are just as efficient in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models might be equivalent.
 
We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that many more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries, such as threshold queries over the real line.
 
Joint work with Thomas Steinke and Jonathan Ullman.

Related Links
DLAW03 8th December 2016
13:30 to 14:15
Ross Anderson Can we have medical privacy, cloud computing and genomics all at the same time?
"The collection, linking and use of data in biomedical research and health care: ethical issues" is a report from the Nuffield Bioethics Council, published last year. It took over a year to write. Our working group came from the medical profession, academics, insurers and drug companies. As the information we gave to our doctors in private to help them treat us is now collected and treated as an industrial raw material, there has been scandal after scandal. From failures of anonymisation through unethical sales to the care.data catastrophe, things just seem to get worse. Where is it all going, and what must a medical data user do to behave ethically?

We put forward four principles. First, respect persons; do not treat their confidential data like were coal or bauxite. Second, respect established human-rights and data-protection law, rather than trying to find ways round it. Third, consult people who’ll be affected or who have morally relevant interests. And fourth, tell them what you’ve done – including errors and security breaches.

Since medicine is the canary in the mine, we hope that the privacy lessons can be of value elsewhere – from consumer data to law enforcement and human rights.

Related Links
DLAW03 8th December 2016
14:15 to 15:00
Paul Burton DataSHIELD: taking the analysis to the data not the data to the analysis
Research in modern biomedicine and social science often requires sample sizes so large that they can only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important governance questions and can be controversial. These reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that circumvents some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. Commands are sent from a central analysis computer (AC) to several data computers (DCs) that store the data to be co-analysed. Each DC is located at one of the studies contributing data to the analysis. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands that are transmitted back and forth between the DCs and the AC. Technical implementation of DataSHIELD employs a specially modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is then controlled through a standard R environment at the AC. DataSHIELD is most often configured to carry out a – typically fully-efficient – analysis that is mathematically equivalent to placing all data from all studies in one central database and analysing them all together (with centre-effects, of course, where required). Alternatively, it can be set up for study-level meta-analysis: estimates and standard errors are derived independently from each study and are subject to centralized random effects meta-analysis at the AC. DataSHIELD is being developed as a flexible, easily extendible, open-source way to provide secure data access to a single study or data repository as well as for settings involving several studies. Although the talk will focus on the version of DataSHIELD that represents our current standard implementation, it will also explore some of our recent thinking in relation to issues such as vertically partitioned (record linkage) data, textual data and non-disclosive graphical visualisation. 
DLAW03 8th December 2016
15:30 to 16:00
Grigorios Loukides Anonymization of high-dimensional datasets
Organizations collect increasing amounts of high-dimensional data about individuals. Examples are health record datasets containing diagnosis information, marketing datasets containing products purchased by customers, and web datasets containing check-ins in social networks. The sharing of such data is increasingly needed to support applications and/or satisfy policies and legislation. However, the high dimensionality of data makes their anonymization difficult, both from an effectiveness and from an efficiency point of view. In this talk, I will illustrate the problem and briefly review the main techniques used in the anonymization of high-dimensional data. Subsequently, I will present a class of methods we have been developing for anonymizing complex, high-dimensional data and their application to the healthcare domain. 
DLAW03 8th December 2016
16:00 to 16:30
Mark Elliot An empirical measure of attribution risk (for fully synthetic data)
DLAW03 8th December 2016
16:30 to 17:00
Natalie Shlomo Assessing Re-identification Risk in Sample Microdata
Co-author: Chris Skinner      

Abstract:    Disclosure risk occurs when there is a high probability that an   intruder can identify an individual in released sample microdata and confidential information may be revealed. A probabilistic modelling  framework based on the Poisson log-linear model is used for  quantifying disclosure risk in terms of population uniqueness when  population counts are unknown. This method does not account for  measurement error arising either naturally from survey processes or  purposely introduced as a perturbative disclosure limitation technique. The probabilistic modelling framework for assessing disclosure risk is  expanded to take into account the misclassification/ perturbation and  demonstrated on sample microdata which has undergone   perturbation  procedures. Finally, we adapt the probabilistic modelling framework to   assess the disclosure risk of samples from sub-populations  and show some initial results.
DLAW03 8th December 2016
17:00 to 17:30
Discussion of the Future
This session is for open discussion, to share ideas about opportunities for follow-up to the DLA programme:
- both scientific, regarding what has emerged from the programme and ways that this can be taken forward;
- and more practical, regarding e.g. future events, research programmes and opportunities for interaction.
DLAW03 9th December 2016
09:30 to 10:15
Ross Gayler Linkage and anonymisation in consumer finance
The operation of the consumer finance industry has important social and economic consequences, and is heavily regulated. The industry makes millions of decisions based on personal data that must be protected. The people who create these decision processes seek to make industry decision making practices as rational and well informed as possible, but sometimes this work is surprisingly hard. Major difficulties occur around the issues of linkage and anonymisation. I will describe how some of these practically important issues arise and play out. This is intended as a motivating example for developments in data linkage and anonymisation (no maths involved).
DLAW03 9th December 2016
10:15 to 10:45
David Hand On anonymisation and discrimination
The perspective of anonymisation is one of ‘I don’t know who you are, but I know this about you’, while the perspective of anti-discrimination legislation is the complementary ‘I don’t know this about you, but I know who you are’. I look at how organisations have attempted to comply with the law, and show that this has led to confusion and lack of compliance. The fundamental problem arises from ambiguous and incompatible definitions, and recent changes to the law have made it worse. I illustrate some of the damaging adverse consequences, for both individuals and for society, that have arisen from this confusion.
DLAW03 9th December 2016
11:15 to 12:00
Daniel Kifer Statistical Asymptotics with Differential Privacy
Differential privacy introduces non-ignorable noise into synthetic data and query answers. A proper statistical analysis must account for both the sampling noise in the data and the additional privacy noise. In order to accomplish this, it is often necessary to modify the asymptotic theory of statistical estimators. In this talk, we will present a formal approach to this problem, with applications to confidence intervals and hypothesis tests.
DLAW03 9th December 2016
12:00 to 12:30
Steven Murdoch Decentralising Data Collection and Anonymisation
A frequent approach for anonymising datasets is for individuals to submit sensitive data records to a central authority. The central authority then is responsible for safely storing and sharing the data, for example by aggregating or perturbing records. However, this approach introduces the risk that the central authority may be compromised, whether this from an externally originated hacking attempt or as a result of an insider attack. As a result, central authorities responsible for handling sensitive data records must be well protected, often at great expense, and even then the risk of compromise will not be eliminated.

In this talk I will discuss an alternative anonymisation approach, where sensitive data records have identifiable information removed before being submitted to the central authority. In order for this approach to work, not only must this first-stage anonymisation prevent the data from disclosing the identity of the submitter, but also the data records must be submitted in such a way as to prevent the central authority from being able to establish the identity of the submitter from submission metadata. I will show how advances in network metadata anonymisation can be applied to facilitate this approach, including techniques to preserve validity of data despite not knowing the identity of contributors.
DLAW03 9th December 2016
13:30 to 14:15
Moni Naor tba
DLAW03 9th December 2016
14:15 to 15:00
Silvia Polettini Mixed effects models with covariates perturbed for SDC
Co-author: Serena Arima (Sapienza Università di Roma)

We focus on mixed effects with data subject to PRAM. An instance of this is a small area model.  We assume that categorical covariates have been perturbed by Post Randomization,
whereas the level identifier is not perturbed. We also assume that a continuous response is available,  and consider a nested linear regression model:
$$
y_{ij}= X_{ij}^{'}\beta +v_{i}+e_{ij},     j=1,...,n_{i}; \,\,i=1,...,m
$$
where
$v_{i}\iid N(0,\sigma^{2}_{v})$ (model error);$e_{i}\iid
N(\mu,\sigma^{2}_{e})$ (design error).

We resort to a measurement error model and define a unit-level small area model accounting for measurement error  in   discrete covariates.
PRAM is defined in terms of  a transition matrix $P$ modeling the changes in categories; we consider both the case of known $P$, and the case when  $P$ is
unknown and is estimated from the data.

A small simulation study is conducted to assess the effectiveness of the proposed Bayesian measurement error model in estimating the model
parameters and to investigate the protection provided by PRAM in this context.
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons