Videos and presentation materials from other INI events are also available.
Event | When | Speaker | Title | Presentation Material |
---|---|---|---|---|
DLAW01 |
5th July 2016 09:00 to 10:30 |
Peter Christen |
Tutorial 1: Data Linkage – Introduction, Recent Advances, and Privacy Issues
Tutorial outline:
The tutorial consists of four parts: (1) Data linkage introduction, short history of data linkage, example applications, and the data linkage process (overview of the main steps). (2) Detailed discussion of all steps of the data linkage process (data cleaning and standardisation, indexing/blocking, field and record comparisons, classification, and evaluation), and core techniques used in these steps. (3) Advanced data linkage techniques, including collective, group and graph linking techniques, as well as advanced indexing techniques that enable large-scale data linkage. (4) Major concepts, protocols and challenges in privacy-preserving data linkage, which aims to link databases across organisations without revealing any private or confidential information. Assumed knowledge: The aim is to make this tutorial as accessible as possible to a wide ranging audience from various backgrounds. The content will focus on concepts and techniques rather than details of algorithms. Basic understanding in databases, algorithms, and probabilities will be beneficial but not required. The tutorial will loosely be based on the book “Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection” (Springer, 2012) written by the presenter. |
![]() |
DLAW01 |
5th July 2016 11:00 to 12:30 |
Peter Christen |
Tutorial 1: Data Linkage – Introduction, Recent Advances, and Privacy Issues
Tutorial outline:
The tutorial consists of four parts: (1) Data linkage introduction, short history of data linkage, example applications, and the data linkage process (overview of the main steps). (2) Detailed discussion of all steps of the data linkage process (data cleaning and standardisation, indexing/blocking, field and record comparisons, classification, and evaluation), and core techniques used in these steps. (3) Advanced data linkage techniques, including collective, group and graph linking techniques, as well as advanced indexing techniques that enable large-scale data linkage. (4) Major concepts, protocols and challenges in privacy-preserving data linkage, which aims to link databases across organisations without revealing any private or confidential information. Assumed knowledge: The aim is to make this tutorial as accessible as possible to a wide ranging audience from various backgrounds. The content will focus on concepts and techniques rather than details of algorithms. Basic understanding in databases, algorithms, and probabilities will be beneficial but not required. The tutorial will loosely be based on the book “Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection” (Springer, 2012) written by the presenter. |
![]() |
DLAW01 |
5th July 2016 13:30 to 15:00 |
Adam Smith |
Tutorial 2: Defining ‘privacy’ for statistical databases
The tutorial will introduce differential privacy, a widely studied definition of privacy for statistical databases. We will begin with the motivation for rigorous definitions of privacy in statistical databases, covering several examples of how seemingly aggregate, high-level statistics can leak information about individuals in a data set. We will then define differential privacy, illustrate the definition with several examples, and discuss its properties. The bulk of the tutorial will cover the principal techniques used for the design of differentially private algorithms. Time permitting, we will touch on applications of differential privacy to problems having no immediate connection to privacy, such as equilibrium selection in game theory and adaptive data analysis in statistics and machine learning. |
![]() |
DLAW01 |
5th July 2016 15:30 to 17:00 |
Adam Smith |
Tutorial 2: Defining ‘privacy’ for statistical databases
The tutorial will introduce differential privacy, a widely studied definition of privacy for statistical databases. We will begin with the motivation for rigorous definitions of privacy in statistical databases, covering several examples of how seemingly aggregate, high-level statistics can leak information about individuals in a data set. We will then define differential privacy, illustrate the definition with several examples, and discuss its properties. The bulk of the tutorial will cover the principal techniques used for the design of differentially private algorithms. Time permitting, we will touch on applications of differential privacy to problems having no immediate connection to privacy, such as equilibrium selection in game theory and adaptive data analysis in statistics and machine learning. |
![]() |
DLAW01 |
6th July 2016 10:00 to 11:00 |
Adam Smith | tba |
![]() |
DLAW01 |
6th July 2016 11:30 to 12:30 |
Jerry Reiter |
Data Dissemination: A Survey of Recent Approaches, Challenges, and Connections to Data Linkage
I introduce common strategies for reducing disclosure risks when releasing public use microdata, i.e., data on individuals. I discuss some of their pros and cons in terms of data quality and disclosure risks, connecting to data linkage where possible. I also talk about a key challenge in data dissemination: how to give feedback to users on the quality of analyses of disclosure-protected data. Such feedback is essential if analysts are to trust results from (heavily) redacted microdata. They also are essential for query systems that report (perturbed) outputs from statistical models. However, such feedback leaks information about confidential data values. I discuss approaches for feedback that satisfy the risk criterion differential privacy for releasing diagnostics in regression models. |
![]() |
DLAW01 |
6th July 2016 13:30 to 14:30 |
Cynthia Dwork |
Marginals and Malice
In 2008 Homer et al rocked the genomics community
with a discovery that altered the publication policies of the US NIH and the
Wellcome Trust, showing that mere allele frequency statistics would permit a
forensic analyst -- or a privacy attacker -- to determine the presence of an
individual's DNA in a forensic mix -- or a case group. These results were
seen as particularly problematic for Genome-Wide Association Studies (GWAS),
where the marginals are SNP minor allele frequency statistics (MAFs). In this talk, we review the lessons of Homer et al. and report on recent generalizations and strengthenings of the attack, establishing the impossibility of privately reporting "too many" MAFs with any reasonable notion of accuracy. We then present a differentially private approach to finding significant SNPs that controls the false discovery rate. The apparent contradiction with the impossibility result is resolved by a relaxation of the problem, in which we limit the total number of potentially significant SNPs that are reported. Joint work with Smith, Steinke, Ullman, and Vadhan (lower bounds); and Su and Zhang (FDR control). |
![]() |
DLAW01 |
6th July 2016 14:30 to 15:30 |
John Abowd |
The Challenge of Privacy Protection for Statistical Agencies
Since the field of statistical disclosure limitation (SDL) was first formalized by Ivan Fellegi in 1972, official statistical agencies have recognized that their publications posed confidentiality risks for the households and businesses who responded. For even longer, agencies have protected the source data for those publications by using secure storage methods and access authorization systems. In SDL, Dalenius (1977) and, in computer science, Goldwasser and Micali (1982) formalized what has become the modern approach to privacy protection in data publication: inferential disclosure limitation/semantic security. The modern approach to physical data security centers on firewall and encryption technologies. And the two sets of risks (disclosure through publication and disclosure through unauthorized access) have become increasingly inter-related. It is important to recognize the distinct issues, however. Secure multiparty computing and the stronger fully homomorphic encryption are formal solutions to the problem of permitting statistical computations without granting access to the decrypted data. Privacy-protected query publication is a formal solution to the problem of insuring that inferential disclosures are bounded and that the bound is respected in all published tables. There are now tractable systems that combine secure multi-party computing with formal privacy protection of the computed statistics (e.g., Shokri and Shmatikov 2015). The challenge to statistical agencies is to learn how these systems work, and move their own protection technologies in this direction. Private companies like Google and Microsoft already do this. Statistical agencies must be prepared to explain the differences in their publication requirements and security protocols that distinguish their chosen data storage methods and publications from those used by private companies. Related Links
|
![]() |
DLAW01 |
6th July 2016 16:00 to 17:00 |
Peter Christen |
Recent developments and research challenges in data linkage
Techniques for linking and integrating data from different sources are becoming increasingly important in many application areas, including health, census, taxation, immigration, social welfare, in crime and fraud detection, in the assembly of national security intelligence, for businesses, in bibliometrics, as well as in the social sciences. In today's Big Data era, data linkage (also known as entity resolution, duplicate detection, and data matching) not only faces computational challenges due to the increasing size of data collections and their complexity, but also operational challenges as many applications move from static environments into real-time processing and analysis of potentially very large and dynamically changing data streams, where real-time linking of records is required. Additionally, with the growing concerns by the public of the use of their sensitive data, privacy and confidentiality often need to be considered when personal information is being linked and shared between organisations. In this talk I will present a short introduction to data linkage, highlight recent developments in advanced data linkage techniques and methods - with an emphasis on work conducted in the computer science domain - and discuss future research challenges and directions. |
![]() |
DLAW01 |
7th July 2016 10:00 to 11:00 |
Christine O'Keefe |
Measuring risk and utility in remote analysis and online data centres – why isn’t this problem already solved?
Remote analysis servers and online data centres have been around for quite a few years now, appearing both in the academic literature and in a range of large scale implementations. Such systems are considered to provide good confidentiality protection for protecting privacy in the case of data about people, and for protecting commercial sensitivity in the case of data about businesses and enterprises. A variety of different methods for protecting confidentiality in the outputs of such systems have been proposed and a range of them has been implemented and used in practice. However, much less common are quantitative assessments of risk to confidentiality, and usefulness of the system outputs for the purpose for which they are generated. Indeed, it has been suggested that perhaps such quantitative assessments are trying to measure the wrong things. In this talk we will provide an overview of the current state of literature and practice, and compare it with the overall problem o bjective with a view to determining key open challenges and research frontiers in the area, possibly within a redefined statement of the overall challenge.
|
![]() |
DLAW01 |
7th July 2016 11:30 to 12:30 |
Josep Domingo-Ferrer |
New Directions in Anonymization: Permutation Paradigm, Verifiability by Subjects and Intruders, Transparency to Users
Co-author: Krishnamurty Muralidhar (University of Oklahoma) There are currently two approaches to anonymization: "utility first" (use an anonymization method with suitable utility features, then empirically evaluate the disclosure risk and, if necessary, reduce the risk by possibly sacrificing some utility) or "privacy first" (enforce a target privacy level via a privacy model, e.g., k-anonymity or differential privacy, without regard to utility). To get formal privacy guarantees, the second approach must be followed, but then data releases with no utility guarantees are obtained. Also, in general it is unclear how verifiable is anonymization by the data subject (how safely released is the record she has contributed?), what type of intruder is being considered (what does he know and want?) and how transparent is anonymization towards the data user (what is the user told about methods and parameters used?). We show that, using a generally applicable reverse mapping transformation, any anonymization for microdata can be viewed as a permutation plus (perhaps) a small amount of noise; permutation is thus shown to be the essential principle underlying any anonymization of microdata, which allows giving simple utility and privacy metrics. From this permutation paradigm, a new privacy model naturally follows, which we call (d,v,f)-permuted privacy. The privacy ensured by this method can be verified via record linkage by each subject contributing an original record (subject-verifiability) and also at the data set level by the data protector. We then proceed to define a maximum-knowledge intruder model, which we argue should be the one considered in anonymization. Finally, we make the case for anonymization transparent to the data user, that is, compliant with Kerckhoff's assumption (only the randomness used, if any, must stay secret). |
![]() |
DLAW01 |
7th July 2016 13:30 to 15:30 |
Christopher Dibben, Natalie Shlomo |
Perspectives on user needs for academic, government and commercial data
This session will explore a brief history of commercial applications of large data sets in marketing, retailing and consumer targeting. It will investigate the implications of new data sets, how anonymisation may be achieved and what commercial business is considering with new open data standards |
![]() |
DLAW01 |
7th July 2016 16:00 to 17:00 |
Rebecca Steorts |
Modern Bayesian Record Linkage: Some Recent Developments and Open Challenges
Record linkage, also known as de-duplication, entity
resolution, and coreference resolution is the process of merging together noisy
databases to remove duplicate entities. Record linkage is becoming more
essential in the age of big data, where duplicates are ever present in such
applications as official statistics, human rights, genetics, electronic medical
data, and so on. We briefly review the genesis of record linkage with the work
of Newcombe in 1959, and then move to recent Bayesian developments using novel
clustering approaches in recent work. We speak of recent challenges that have
been overcome and ones that are present, needing guidance and attention.
|
![]() |
DLAW01 |
8th July 2016 10:00 to 11:00 |
Harvey Goldstein |
Probabilistic anonymisation of microdatasets and models for analysis
The general idea is to use
the addition of random noise with known properties to some or all variables in
a released dataset, typically following linkage, where the values of some
identifier variables for individuals of interest are also available to an
external ‘attacker’ who wishes to identify those individuals so that they can
interrogate their records in the dataset. The noise is tuned to achieve any
given degree of anonymity to avoid identification by an ‘attacker’ via the
linking of patterns based on the values of such variables. The noise so generated can then be ‘removed’
at the analysis stage since its characteristics are known, requiring disclosure
of these characteristics by the linking agency. This leads to consistent
parameter estimates, although a loss of efficiency will occur, but the data
themselves are not degraded by any form of coarsening such as grouping.
|
|
DLAW01 |
8th July 2016 11:30 to 12:30 |
Ray Chambers |
Statistical Modelling using Linked Data - Issues and Opportunities
Probabilistic linkage of multiple data sets is now popular
and widespread. Unfortunately, there appears to be little corresponding enthusiasm
for adjusting standard methods of statistical analysis when they are used with these
linked data sets, even though there is plenty of evidence from simulation
studies that both incorrect links as well as informative missed links can lead
to biased inference. In this presentation I will describe the key issues that
need to be addressed when analysing such linked data and some of the methods
that can help. In this context, I will focus in particular on the simple linear
regression model as a vehicle for demonstrating how knowledge about the statistical
properties of the linkage process as well as summary information about the population
distribution of the analysis variables can be used to correct for (or at least
alleviate) these inferential problems. Recent research at the Australian Bureau
of Statistics on a potential weighting/imputation approach to implementing
these solutions will also be presented.
|
![]() |
DLAW01 |
8th July 2016 13:30 to 14:30 |
Aaron Roth | Using Differential Privacy to Control False Discovery in Adaptive Data Analysis |
![]() |
DLAW01 |
8th July 2016 14:30 to 15:30 |
Jared Murray |
Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering
When linking two databases (or deduplicating a single database) the number of possible links grows rapidly in the size of the databases under consideration, and in most applications it is necessary to first reduce the number of record pairs that will be compared. Spurred by practical considerations, a range of indexing or blocking methods have been developed for this task. However, methods for inferring linkage structure that account for indexing, blocking, and filtering steps have not seen commensurate development. I review the implications of indexing, blocking and filtering, focusing primarily on the popular Fellegi-Sunter framework and proposing a new model to account for particular forms of indexing and filtering.
|
![]() |
DLA |
28th July 2016 15:30 to 16:30 |
Anne-Sophie Charest |
Privacy for Bayesian modelling
The literature now contains a large set of methods to
privately estimate parameters from a classical statistical model, or to conduct
a data mining or machine learning task. However, little is known about how to
perform Bayesian statistics privately.
In this talk, I will share my thoughts, and a few
results, about ways in which Bayesian modelling could be performed to offer
some privacy guarantee. In particular, I will discuss some attempts at sampling
from posterior predictive distributions under the constraint of differential
privacy (DP). I will also discuss empirical differential privacy, a criterion
designed to estimate the DP privacy level offered by a certain Bayesian model,
and present some recent results on the meaning and limits of this privacy measure.
A lot of what I will present is work in progress, and I am hoping that some of
you may want to collaborate with me on this research topic.
|
![]() |
DLA |
11th August 2016 15:30 to 16:00 |
Peter Christen |
Talk 1: Advanced methods for linking complex historical birth, death, marriage and census data
In this talk I will provide a short overview of our
recent work aimed at linking historical
records from the Isle of Skye in Scotland. I'll discuss
our linkage approach and present initial
results using a variety of linkage techniques.
More details: http://www.ipdlnconference2016.org/Programme/Abstract/90
|
![]() |
DLA |
11th August 2016 16:00 to 17:00 |
Peter Christen |
Talk 2: Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases
With: Thilina Ranbaduge, Dinusha Vatsalan, and Sean
Randall
Abstract: In this talk I will present results from an
empirical study comparing the scalability,
linkage quality, and privacy of a standard linkage
approach compared to state-of-the-art multi-party
privacy-preserving record linkage techniques on real
Australian health databases.
Details: http://www.ipdlnconference2016.org/Programme/Abstract/89
|
![]() |
DLA |
8th September 2016 15:30 to 16:30 |
David Hand, Peter Christen |
A note on the F-measure for evaluating record linkage algorithms (and classification methods and information retrieval systems)
Record linkage is the process of identifying and linking
records about the same entities from one more databases. If applied on a single
database the process is known as deduplication. Record linkage can be viewed as
a classification problem where the aim is to decide if a pair of records is a
match (the two records refer to the same real-world
entity) or a non-match (the two records refer to two
different entities). Various classification techniques – including supervised,
unsupervised, semi-supervised and active learning based – have been employed
for record linkage. If ground truth data in the form of known true matches and
non-matches are available, the quality of classified links can be evaluated.
Due to the generally high class imbalance in record linkage problems, standard
accuracy or misclassification rate are not meaningful for assessing the quality
of a set of linked records.
Instead, precision and recall, as commonly used in
information retrieval, are used. These are often combined into the popular
F-measure, which is normally presented as the harmonic mean of precision and
recall. We show that F-measure can be expressed as a weighted sum of precision
and recall, with weights which depend on the linkage method being used. This
reformulation reveals the measure to have a major conceptual weakness: the
relative importance assigned to precision and recall should be an aspect of the
problem and the user, but not of the particular instrument being used. We
suggest alternative measures which do not suffer from this fundamental flaw.
|
![]() |
DLAW02 |
12th September 2016 10:00 to 11:00 |
Rainer Schnell |
Hardening Bloom Filter PPRL by modifying identifier encodings
Co-author: Christian Borgs (University of Duisburg
Essen) Using appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time. However, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall in large databases. |
![]() |
DLAW02 |
12th September 2016 11:30 to 12:00 |
Gerrit Bloothooft |
Historical life cycle reconstruction by indexing
Co-authors: Jelte van Boheemen (Utrecht
University), Marijn Schraagen (Utrecht University) Historical information about individuals is usually scattered across many sources. An integrated use of all available information is then needed to reconstruct their life cycles. Rather than comparing records between pairs of sources, it will be shown to be computationally effective to combine all data in a single table. In such a table, each record summarizes the information that can be deducted for a person who shows up in a source event. The idea is that this table should be ordered in such a way that consecutive records describe the life cycle events of an unique individual, for one individual after another, where each individual has its own ID. To arrive at this situation, it is necessary to filter and index the table in two ways, depending on the possible roles of an individual: the first as ego in focus (at birth, marriage and decease), the second as parent at the same life events of children. The results of both indexes (in terms of preliminary record clusters and IDs) should be combined, while resulting clusters should be tested for validity of the life cycle. The success of such a procedure strongly depends on the available data and its quality. The Dutch civil registration, introduced by the French in 1811 and now largely digitized, provides quite optimal conditions. Remaining problems of data fuzziness can be circumvented by name standardization (to various levels of name reduction) and by testing different sequences of the available information in records for indexing. Both approaches are only effective when there is more information available then needed to identify an individual uniquely – which in many cases seems to be the case for the Dutch civil registration. An example of the procedure will be given for data for the province of Zeeland, while options for application of the method to older data of (much) less quality and completeness will be discussed. The latter touches upon the limits of historical life cycle reconstruction. |
![]() |
DLAW02 |
12th September 2016 12:00 to 12:30 |
Patrick Ball |
Deduplicating databases of deaths in war: advances in adaptive blocking, pairwise classification, and clustering
Violent inter-state and civil wars are documented with lists of the casualties, each of which constitutes a partial, non-probability sample of the universe of deaths. There are often several lists, with duplicate entries within each list and among the lists, requiring record linkage to dedeuplicate the lists to create a unique enumeration of the known dead. This talk will explore how we do record linkage, including: new advances in generating and learning from training data; an adaptive blocking approach; pairwise classification with string, date, and integer features and several classifiers; and a hybrid clustering method. Assessment metrics will be proposed for each stage, with real-world results from deduplicating more than 420,000 records of Syrian people killed since 2011. |
![]() |
DLAW02 |
12th September 2016 13:30 to 14:30 |
Bill Winkler |
Computational Methods for Linking Sets of National Files
A combination of faster hardware and new computational algorithms makes it
possible to link two or more national files having suitable quasi-identifying
information such as name, address, date-of-birth and other non-uniquely
identifying information far faster than methods of a decade earlier. The methods
(Winkler, Yancey, and Porter 2010) were used for matching 10^17 pairs (300
million x 300 million) using 40 cpus of an SGI machine (with 2006 Itanium chips)
in less than 30 hours during the 2010 U.S. Decennial Census. The methods are 50
times as fast as PSwoosh parallel software (Kawai et al. 2006) from Stanford
University. The methods are ~10 times as fast as recent parallel software that
applies new methods of load balancing (Rahm and Kolb 2013, Yan et al. 2013,
Karapiperis and Verykios 2014). This talk will describe how this software
bypasses the needs for system sorts and provides highly optimized
search-retrieval-comparison for a narrow range of situations needed for record
linkage.
Related Links
|
![]() |
DLAW02 |
12th September 2016 14:30 to 15:30 |
Amy O'Hara |
The U.S. Census Bureau’s Linkage Infrastructure: Overview and New Challenges
The U.S. Census Bureau makes extensive use of administrative
records and other third-party data to produce statistics on our population and
economy. We use these data to evaluate
survey quality, to produce new statistics about the population and economy, and
to support evaluation of federal and state programs. Our success hinges on our
ability to link external data with data already held at Census, usually at the individual
person or housing unit level. We carry out this linkage using a standardized
set of practices that has been in place since the early 2000s. This
presentation focuses on the lifecycle of the production data linkage carried
out at the U.S. Census Bureau, including authorities to obtain identified data,
the types of data acquired, ingest and initial processing, linkage practices, evaluations
of linkage quality, and the documentation, governance, and uses of linked data
files. The presentation will conclude with a discussion of new demands on
Census’s linked data infrastructure, and the need to modernize and further
streamline governance and processes.
|
![]() |
DLAW02 |
12th September 2016 16:00 to 17:00 |
Intra-disciplinary “speed-dating” | ||
DLAW02 |
12th September 2016 16:00 to 17:00 |
'speed-dating' sessions | ||
DLAW02 |
13th September 2016 09:30 to 10:00 |
Hye-Chung Kum |
Privacy Preserving Interactive Record Linkage (PPIRL)
Record linkage to integrate uncoordinated databases is critical to population
informatics research. Balancing privacy protection against the need for high
quality record linkage requires a human-machine hybrid system to manage
uncertainty in the ever changing streams of chaotic big data safely. We review
the literature in record linkage and privacy. In the computer science
literature, private record linkage, which investigates how to apply a known
linkage function safely is the most published area. However, in practice, the
linkage function is rarely known. Thus, there are many data linkage centers
whose main role is to be the trusted third party to determine the linkage
function manually and link data for research via a master population list for a
designated region. Most recently, a more flexible computerized third party
linkage platform, Secure Decoupled Linkage (SDLink), has been proposed based on
(1) decoupling data via encryption, (3) obfuscation via chaffing (addi ng fake
data) and universe manipulation, and (3) minimum incremental information
disclosure via recoding. Based on this findings, we formalize a new framework
for privacy preserving interactive record linkage (PPIRL) with tractable privacy
and utility properties. Human based third-party linkage centers for privacy
preserving record linkage are the accepted norm internationally. We find that a
computer based third-party platform that can precisely control the information
disclosed at the micro level and allow frequent human interaction during the
linkage process, is an effective human-machine hybrid system that significantly
improves on the linkage center model both in terms of privacy and utility. Related Links
|
![]() |
DLAW02 |
13th September 2016 10:00 to 10:30 |
James Boyd |
Technical Challenges associated with record linkage
Background: The task of record linkage is increasingly being undertaken by
dedicated record linkage units with secure environments and specialised linkage
personnel. In addition to the complexity of undertaking record linkage, these
units face additional technical challenges in providing record linkage ‘as a
service’. The extent of this functionality, and approaches to solving these
issues has had little focus in record linkage literature. Methods: This session identifies and discusses the range of functions that are required or undertaken in the provision of record linkage services. These include managing routine, on-going linkage; storing and handling changing data; handling different linkage scenarios; accommodating ever increasing datasets. Current linkage methods also include analysis of data attributes such as data completeness, consistency, constancy and field discriminating power. This information is used to develop appropriate linkage strategies. Results: In order to maximise matching quality and efficiency, linkage systems must address real-world operational requirements to manage linked data over time. By maintaining a full history of links, and storing pairwise information, many of the challenges around handling ‘open’ records, and providing automated managed extractions are solved. Automation of linkage processes (including clerical processes) is another way of ensuring consistency of results and scalability of service. Several of these solutions have been implemented as part of developments by the PHRN Centre for Data Linkage in Australia. Conclusions: Increasing demand for and complexity of, linkage services present challenges to linkage units as they seek to offer accurate and efficient services to government and the research community. Linkage units need to be both flexible and scalable to meet this demand. It is hoped that the solutions presented will help overcome these difficulties. |
![]() |
DLAW02 |
13th September 2016 10:30 to 11:00 |
Evan Roberts |
Record linkage with complete-count historical census data
Many areas of social science research benefit from being able to follow
individuals and families across time, to observe changes in social behavior
across at least part of the life course. Since the 1920s, and particularly since
World War II longitudinal social surveys have become a common tool of social
scientists. Despite their many benefits these surveys only allow us to study a
limited number of birth cohorts, and few of these cohorts are entirely deceased.
Comparison across multiple cohorts, and across long periods of the life course
is not always possible as social scientists must follow their cohorts in real
time. Historical data on past populations allows us to reconstruct life-course panels for past cohorts. In the past few years complete transcriptions of census data from sequential censuses has become available in several countries including Britain, Canada, Iceland, Norway, Sweden, and the United States. The Minnesota Population Center is developing tools to create large datasets of people linked between at least two censuses. There are multiple challenges in creating this form of historical data, centering around the lack of unique identifiers. People must be identified by a combination of characteristics recorded with error including names, birthplaces, date of birth, and ethnic background. Although linkage rates are low by comparison to modern longitudinal surveys it has proved possible to create samples that are reasonably representative of the origin or terminal population. This paper describes the sources used in creating linked census datasets, the domain-specific issues in data linkage, and demonstrates some of the applications of historical longitudinal data in studying social mobility, and mortality in the past. Related Links
|
![]() |
DLAW02 |
13th September 2016 11:30 to 12:00 |
Bradley Malin |
A LifeCycle Model for Privacy Preserving Record Linkage
Individuals increasingly leave behind information in
resources managed by disparate organizations. There is an interest in
making this information available for a wide array of endeavors (e.g., policy
assessment, discovery-based research, and surveillance activities).
Given the distribution of data, it is critical to ensure that it is
sufficiently integrated before conducting any statistical investigation to
prevent duplication (and thus overcounting of events) and fragmentation (and
thus undercounting of events). This problem is resolved through record
linkage procedures, techniques that have been refined for over half a
century. However, these methods often rely on explicit- or potentially
identifying features, which often conflict with the expectations of privacy
regulations. More recently, privacy preserving record linkage (PPRL)
methods have been proposed which rely on randomized transformations of data, as
well as cryptographically secure processing methods. However, it is often unclear how the various steps of a record lifecycle, including standardization, parameter estimation, blocking, record pair comparison, and ccommunication between all of the various parties involved in the process can take place. In this talk, I will review
recent developments in PPRL methods, discuss how they have been engineered into
working software systems, and provide examples of how they have been applied in
several distributed networks in the healthcare community to facilitate
biomedical research and epidemiological investigations.
|
![]() |
DLAW02 |
13th September 2016 12:00 to 12:30 |
Luiza Antonie |
Tracking people over time in 19th century Canada: Challenges, Bias and Results
Co-author: Kris Inwood (University of Guelph) Linking multiple databases to create longitudinal data is an important research problem with multiple applications. Longitudinal data allows analysts to perform studies that would be unfeasible otherwise. In this talk, I discuss a system we designed to link historical census databases in order to create longitudinal data that allow tracking people over time. Data imprecision in historical census data and the lack of unique personal identifiers make this task a challenging one. We design and employ a record linkage system that incorporates a supervised learning module for classifying pairs of records as matches and non-matches. In addition, we disambiguate ambiguous links by taking into account the family context. We report results on linking four Canadian census collections, from 1871 to 1901, and identify and discuss the impact on precision and bias when family context is employed. We show that our system performs large scale linkage producing high quality links and generat ing sufficient longitudinal data to allow meaningful social science studies. |
![]() |
DLAW02 |
13th September 2016 13:30 to 14:00 |
Katie Harron |
Handling identifier error rate variation in data linkage of large administrative data sources
Co-authors: Gareth Hagger-Johnson (Administrative Data
Research Centre for England, University College London), Ruth Gilbert
(Institute of Child Health, University College London), Harvey
Goldstein (University of Bristol and University College London) Background: Linkage of administrative data with no unique identifier often relies on probabilistic linkage. Variation in data quality on individual or organisational levels can adversely affect match weight estimation, and potentially introduce selection bias to the linked data if subgroups of individuals are more likely to link than others. We quantified individual and organisational variation in identifier error in a large administrative dataset (Hospital Episode Statistics; HES) and incorporated this information within a match probability estimation model. Methods: A stratified sample of 10,000 admission records were extracted from 2011/2012 HES for three cohorts of ages 0-1, 5-6 and 18-19 years. A reference standard was created by linking via NHS number with the Personal Demographic Service for up-to-date identifiers. Based on aggregate tables, we calculated identifier error rates for sex, date of birth and postcode and investigated whether these errors were dependent on individual characteristics and evaluated variation between organisations. We used a log-linear model to estimate match probabilities, and used a simulation study to compare readmission rates based on traditional match weights. Results: Match probabilities differed significantly according to age (p |
|
DLAW02 |
13th September 2016 14:00 to 14:30 |
Ruth Gilbert |
GUILD: GUidance for Information about Linked Datasets
Co-authors:
Lafferty, Hagger-Johnson, Harron, Smith, Zhang, Dibben, Goldstein Linkage of large administrative datasets often involves different teams at the various steps of the linkage pathway. Information is rarely shared throughout the pathway about processes that might contribute to linkage error and potentially bias results. However, improved awareness about the type of information that should be shared could lead to greater transparency and more robust methods, including analyses that take into account linkage error. The Administrative Data Research Centre for England convened a series of 3 face-to-face meetings between data linkage experts to develop the GUILD guidance (GUidance for Information about Linked Datasets). GUILD recommends key items of information that should be shared at 4 steps in the data linkage pathway: data provision (how data was generated, extracted, processed and quality controlled), data linkage, analyses of linked data, and report writing. The guidance aims to improve transparency in data linkage processes and reporting of analyses, and to improve the validity of results based on linked data. GUILD guidance is desgined to be used by data providers, linkers, and analysts, but will also be relevant to policy makers, funders and legislators responsible for widening use of linked data for research, services and policy. The GUILD recommendations will be presented and discussed. |
![]() |
DLAW02 |
13th September 2016 14:30 to 15:00 |
Dinusha Vatsalan |
Advanced Techniques for Privacy-Preserving Linking of Multiple Large Databases
Co-author: Peter Christen (The Australian National
University) In the era of Big Data the collection of person-specific data disseminated in diverse databases provides enormous opportunities for businesses and governments by exploiting data linked across these databases. Linked data empowers quality analysis and decision making that is not possible on individual databases. Therefore, linking databases is increasingly being required in many application areas, including healthcare, government services, crime and fraud detection, national security, and business applications. Linking data from different databases requires comparison of quasi-identifiers (QIDs), such as names and addresses. These QIDs are personal identifying attributes that contain sensitive and confidential information about the entities represented in these databases. The exchange or sharing of QIDs across organisations for linkage is often prohibited due to laws and business policies. Privacy-preserving record linkage (PPRL) has been an active research area over the past two decades addressing this problem through the development of techniques that facilitate the linkage on masked (encoded) records such that no private or confidential information needs to be revealed. Most of the work in PPRL thus far has concentrated on linking two databases only. Linking multiple databases has only recently received more attention as it is being required in a variety of application areas. We have developed several advanced techniques for practical PPRL of multiple large databases addressing the scalability, linkage quality, and privacy challenges. Our approaches perform linkage on masked records using Bloom filter encoding, which is a widely used masking technique for PPRL. In this talk, we will first highlight the challenges of PPRL of multiple databases, then describe our developed approaches, and then discuss future research directions required to leverage the huge potential that linked data from multiple databases can provide for businesses and government services. |
![]() |
DLAW02 |
13th September 2016 15:00 to 15:30 |
Harvey Goldstein |
A scaling approach to record linkage
Co-authors: Mario Cortina-Borja (UCL), Katie Harron
(LSHTM) With increasing availability of large data sets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. The very large size of such datasets and the variation in the quality of the identifiers used to carry out linkage means that existing approaches based upon ‘probabilistic’ models can make heavy computational demands. They are also based upon questionable assumptions. In this paper we suggest a new approach, based upon a scaling algorithm, that is computationally fast, requires only moderate amounts of storage and has intuitive appeal. A comparison with existing methods is given. |
![]() |
DLAW02 |
13th September 2016 16:00 to 17:00 |
'speed-dating' sessions | ||
DLAW02 |
14th September 2016 09:00 to 10:00 |
Erhard Rahm |
Big data integration: challenges and new approaches
Data integration is a key challenge for Big Data applications to semantically
enrich and combine large sets of heterogeneous data for enhanced data analysis.
In many cases, there is also a need to deal with a very high number of data
sources, e.g., product offers from many e-commerce websites. We will discuss
approaches to deal with the key data integration tasks of (large-scale) entity
resolution and schema matching. In particular, we discuss parallel blocking and
entity resolution on Hadoop platforms together with load balancing techniques to
deal with data skew. We also discuss challenges and recent approaches for
holistic data integration of many data sources, e.g., to create knowledge graphs
or to make use of huge collections of web tables.
|
![]() |
DLAW02 |
14th September 2016 10:00 to 10:30 |
Vassilios Verykios |
Space Embedding of Records for Privacy Preserving Linkage
Massive amounts of data, collected by a wide variety of organizations, need to be integrated and matched in order to facilitate data analyses that may be highly beneficial to businesses, governments, and academia. Record linkage, also known as entity resolution, is the process of identifying records that refer to the same real-world entity from disparate data sets. Privacy Preserving Record Linkage (PPRL) techniques are employed to perform the linkage process in a secure manner, when the data that need to be matched are sensitive. In PPRL, input records undergo an anonymization process that embeds the records into a space, where the underlying data can be matched but not understood by naked eye. The PPRL problem is picking up a lot of steam lately due to a ubiquitous need for cross matching of records that usually lack common unique identifiers and their field values contain variations, errors, misspellings, and typos. The PPRL process as it is applied to massive ammounts of data comprises of an anonymization phase, a searching phase and a matching phase. Several searching and anonymization approaches have been developed with the aim to scale the PPRL process to big data without sacrificing quality of the results. Recently, redundant randomized methods have been proposed, which insert each record into multiple independent blocks in order to amplify the probability of bringing together similar records for comparison. The key feature of these methods is the formal guarantees, they provide, in terms of accuracy in the generated results. In this talk, we present both state-of-the-art private searching methods and anonynimization techniques, by exposing their characteristics, including their strengths and weaknesses, and we also present a comparative evaluation. |
![]() |
DLAW02 |
14th September 2016 10:30 to 11:00 |
Andy Boyd |
‘Data Safe Havens’ as a framework to support record linkage in observational studies: evidence from the Project to Enhance ALSPAC through Record Linkage (PEARL).
The
health research community are engaged in projects which require a wealth of
data. These data can be drawn directly from research participants, or via linkage
to participants’ routine records. Frequently, investigators require
information from multiple sources with multiple legal owners. A fundamental
challenge for data managers – such as those maintaining cohort study databanks
- is to establish data processing and analysis pipelines that meet the legal, ethical and privacy
expectations of participants and data owners alike. This demands
socio-technical solutions that may easily become enmeshed in protracted debate
and controversy as they encounter the norms, values, expectations and concerns
of diverse stakeholders. In this context, ‘Data Safe Havens’ can provide a
framework for repositories in which sensitive data are kept securely within governance
and informatics systems that are fit-for-purpose, appropriately tailored to the
data, while being accessible to legitimate users for legitimate purposes (see
Burton et al, 2015. http://www.ncbi.nlm.nih.gov/pubmed/26112289).
In this paper I will describe our data linkage experiences gained through the Project to Enhance ALSPAC through Record Linkage (PEARL); a project aiming to establish linkages between participants of the ALSPAC birth cohort study and their routine records. This exemplar illustrates how the governance and technical solutions encompassed within the ALSPAC Data Safe Haven have helped counter and address the real world data linkage challenges we have faced. |
|
DLAW02 |
14th September 2016 11:30 to 12:00 |
Mauricio Sadinle |
A Bayesian Partitioning Approach to Duplicate Detection and Record Linkage
Record linkage techniques allow us to combine different sources of
information from a common population in the absence of unique identifiers.
Linking multiple files is an important task in a wide variety of applications,
since it permits us to gather information that would not be otherwise available,
or that would be too expensive to collect. In practice, an additional
complication appears when the datafiles to be linked contain duplicates.
Traditional approaches to duplicate detection and record linkage output
independent decisions on the coreference status of each pair of records, which
often leads to non-transitive decisions that have to be reconciled in some
ad-hoc fashion. The joint task of linking multiple datafiles and finding
duplicate records within them can be alternatively posed as partitioning the
datafiles into groups of coreferent records. We present an approach that targets
this partition as the parameter of interest, thereby ensuring transitive
decisions. Our Bayesian implementation allows us to incorporate prior
information on the reliability of the fields in the datafiles, which is
especially useful when no training data are available, and it also provides a
proper account of the uncertainty in the duplicate detection and record linkage
decisions. We show how this uncertainty can be incorporated in certain models
for population size estimation. Throughout the document we present a case study
to detect killings that were reported multiple times to organizations recording
human rights violations during the civil war of El Salvador.
|
![]() |
DLAW02 |
14th September 2016 12:00 to 12:30 |
Changyu Dong |
From Private Set Intersection to Private Record Linkage
Record linkage allows data from different sources to be integrated to facilitate data mining tasks. However, in many cases, records have to be linked by personally identifiable information. To prevent privacy breaches, ideally records should be linked in a private way such that no information other than the matching result is leaked in the process. One approach for Private Record Linkage (PRL) is by using cryptographic protocols. In this talk, I will introduce Private Set Intersection (PSI), which is a type of cryptographic protocol that enables two parties to obtain the intersection of their private sets. It is almost trivial to build an exact PRL protocol from a PSI protocol. With more efforts, it is also possible to build an approximate PRL protocol from PSI that allows linking records based on certain similarity metrics. In this talk, I will present efficient PSI protocols, and how to obtain PRL protocols that are practically efficient and effective. |
![]() |
DLAW02 |
15th September 2016 09:00 to 17:00 |
Hackathon / Zeeland challenge | ||
DLA |
29th September 2016 15:30 to 16:30 |
Grigorios Loukides |
Sanitization for sequential data
Organizations disseminate sequential data to support
applications in domains ranging from marketing to healthcare. Such data are
typically modeled as a collection of sequences, or a series of time-stamped
events, and they are mined by data recipients aiming to discover actionable
knowledge. However, the mining of sequential data may expose sensitive patterns
that leak confidential knowledge, or lead to intrusive inferences about groups
of individuals.
In this talk, I will review the problem and present two
approaches that prevent it, while retaining the usefulness of data in mining
tasks.
The first approach is applicable to a collection of
sequences and sanitizes sensitive patterns by permuting their events. The
selected permutations avoid changes in the set of frequent non-sensitive
patterns and in the ordering information of the sequences. The second approach
is applicable to a series of time-stamped events and sanitizes sensitive events
by deleting them from carefully selected time points. The deletion of events is
guided by a model that captures changes to the probability distribution of
events across the sequence.
|
![]() |
DLA |
20th October 2016 15:30 to 16:30 |
Tom Dalton |
Evaluating Data Linkage: Creating longitudinal synthetic data to provide a gold-standard linked dataset
When performing probabilistic data linkage on real world
data we, by the fact we need to link it, do not know the true linkage. Therefore,
the success of our linkage approach is difficult to evaluate. Often small hand
linked datasets are used as a ‘gold-standard’ for the linkage approach to be
evaluated against. However, errors in the hand-linkage and the limited size and
number of these datasets do not allow for robust evaluation. The research
focuses on the creation of longitudinal synthetic datasets for the domain of
population reconstruction. In this talk I will cover the previous and current
models we have created to achieve this and detail the approaches to how we:
define the desired behaviour in the model to avoid clashes between input
distributions, verify the statistical correctness of the population, and
initialise the model such that the starting population meets the temporal requirements
of the desired behaviour. To conclude I will outline the model’s intended use
for linkage evaluation, its other potential uses and also take questions.
|
![]() |
DLAW04 |
28th October 2016 10:00 to 11:00 |
Laura Brandimarte |
How does Government surveillance affect perceived online privacy/security and online information disclosure?
Disclosure behaviors in the digital world are affected
by perceived privacy and security just as much, or arguably more than
they are by actual privacy/security features of the digital environment.
Several Governments have recently been at the center of attention for secret surveillance
programs that have affected the sense of privacy and security people experience
online. In this talk, I will discuss evidence from two research projects
showing how privacy concerns and disclosure behaviors are affected by perceived
privacy/security intrusions associated with Government monitoring and
surveillance. These two interdisciplinary projects bring together methodologies
from different disciplines: information systems, machine learning, psychology,
and economics.
The first project is in collaboration with the Census Bureau, and studies geo-location and its effects on willingness to disclose personal information. The U.S. Census Bureau has begun a transition from a paper-based questionnaire to an Internet-based one. Online data collection would not only allow for a more efficient gathering of information; it would also, through geo-location technologies, allow for the automated inference of the location from which the citizen is responding. Geo-location features in Census forms, however, may raise privacy concerns and even backfire, as they allow for the collection of a sensitive piece of information without explicit consent of the individual. Four online experiments investigate individuals’ reactions to geo-location by measuring willingness to disclose personal information as a function of geo-location awareness and the entity requesting information: research or Governmental institutions. The experiments also explicitly test how surveillance primes affect the relationship between geo-location awareness and disclosure. Consistent with theories of perceived risk, contextual integrity, and fairness in social exchanges, we find that awareness of geo-location increases privacy concerns and perceived sensitivity of requested information, thus decreasing willingness to disclose sensitive information, especially when participants did not have a prior expectation that the institution would collect that data. No significant interaction effects are found for a surveillance prime. The second project is ongoing research about the “chilling effects” of Government surveillance on social media disclosures, or the tendency to self-censor in order to cope with mass monitoring systems raising privacy concerns. Until now, such effects have only been estimated using either Google/Bing search terms, Wikipedia articles, or survey data. In this research in progress, we propose a new method in order to test for chilling effects in online social media platforms. We use a unique, large dataset of Tweets and propose the use of new statistical machine learning techniques in order to detect anomalous trends in user behavior (use of predetermined, sensitive sets of keywords) after Snowden’s revelations made users aware of existing surveillance programs. |
|
DLAW04 |
28th October 2016 11:30 to 12:30 |
Ian Schumutte |
Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods
Co-author: John M. Abowd (Cornell University and U.S.
Census Bureau) We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε,δ) Related Links
|
![]() |
DLAW04 |
28th October 2016 13:30 to 14:30 |
Alessandro Acquisti |
The Economics of Privacy (remote presentation)
In
the policy and scholarly debate over privacy, the protection of personal
information is often set against the benefits society is expected to gain from
large scale analytics applied to individual data. An implicit assumption
underlays the contrast between privacy and 'big data': economic research is
assumed to univocally predict that the increasing collection and analysis of
personal data will be an economic win-win for data holders and data subjects
alike - some sort of unalloyed public good. Using a recently published review
of the economic literature on privacy, I will work from within traditional
economic frameworks to investigate this notion. In so doing, I will highlight
how results from economic research on data sharing and data protection actually
paint a nuanced picture of the economic benefits and costs of privacy.
|
![]() |
DLAW04 |
28th October 2016 14:30 to 15:30 |
Katrina Ligett |
Buying Private Data without Verification
Joint
work with Arpita
Ghosh, Aaron Roth, and Grant Schoenebeck We consider the problem of designing a survey to aggregate non-verifiable information from a privacy-sensitive population: an analyst wants to compute some aggregate statistic from the private bits held by each member of a population, but cannot verify the correctness of the bits reported by participants in his survey. Individuals in the population are strategic agents with a cost for privacy, i.e., they not only account for the payments they expect to receive from the mechanism, but also their privacy costs from any information revealed about them by the mechanism’s outcome—the computed statistic as well as the payments—to determine their utilities. How can the analyst design payments to obtain an accurate estimate of the population statistic when individuals strategically decide both whether to participate and whether to truthfully report their sensitive information? In this talk, we will discuss an approach to this problem based on ideas from peer prediction and differential privacy. |
![]() |
DLAW04 |
28th October 2016 16:00 to 17:00 |
Mallesh Pai |
The Strange Case of Privacy in Equilibrium Models
Joint work
with Rachel Cummings, Katrina Ligett and Aaron Roth The literature on differential privacy by and large takes the data set being being analyzed as exogenously given. As a result, by varying a privacy parameter in his algorithm, the analyst straightforwardly chooses the potential privacy loss of any single entry in the data set. Motivated by privacy concerns on the internet, we consider a stylized setting where the dataset is endogenously generated, depending on the privacy parameter chosen by the analyst. In our model, an agent chooses whether to purchase a product. This purchase decision is recorded, and a differentially private version of his purchase decision may be used by an advertiser to target the consumer. A change in the privacy parameter therefore affects, in equilibrium, the agents' purchase decision, the price of the product, and the targeting rule used by the advertiser. We demonstrate that the comparative statics with respect to privacy parameter may be exactly reversed relative to the exogenous data set benchmark, for example a higher privacy parameter may nevertheless be more informative etc. More care is needed in understanding the effects of private analysis of a data set that is endogenously generated. |
![]() |
DLA |
31st October 2016 11:00 to 11:20 |
Robin Mitra | tba | |
DLA |
31st October 2016 11:20 to 11:40 |
Cong Chen | tba | |
DLA |
31st October 2016 12:00 to 12:20 |
Anne-Sophie Charest | tba | |
DLA |
31st October 2016 12:20 to 12:40 |
Christine O'Keefe | Synthetic data - more questions than answers | |
DLA |
31st October 2016 12:40 to 13:00 |
Natalie Shlomo | tba | |
DLA |
31st October 2016 13:00 to 13:20 |
Peter Christen | Generating realistic personal data for data linkage research | |
DLA |
31st October 2016 13:50 to 14:00 |
Mark Elliot |
GA Approaches to Synthetic data
In this talk |
|
DLA |
31st October 2016 14:00 to 14:20 |
Gillian Raab | Analysis methods and utility measures | |
DLA |
31st October 2016 14:20 to 14:40 |
Beata Nowok | Challenges in generating and communicating synthetic data | |
DLA |
31st October 2016 14:40 to 15:00 |
Joshua Snoke | Beond Microdata: Synthetic tweets | |
DLA |
31st October 2016 15:00 to 15:20 |
Joerg Drechsler | My View on the Key Research Questions for Synthetic Data | |
DLA |
31st October 2016 16:40 to 17:00 |
Mark Elliot | Plan for the rest of the week | |
DLA |
3rd November 2016 15:30 to 16:30 |
Gillian Raab |
Measures of Utility for Synthetic Data
When synthetic data are produced to overcome potential
disclosure they
can be used either in place of the original data or,
more commonly, to
allow researchers to develop code that will
ultimately be run on the
original data.
The utility of synthetic data can be measured by
comparing the results of the final analysis with the
synthetic and
original data. This is not possible until the final
analysis is
complete.
General utility measures that measure the overall
differences between the original and synthetic data
are more useful for
those creating synthetic data. This presentation will
discuss two such
>measures. The first is a propensity score measure
originally proposed
by Woo et. al., 2009 and the second is one based on
comparing tables,
suggested by Voas and Williamson, 2001. Their null
distributions, when
the synthesis model is "correct" will be
discussed as well as their
practical implementation as part of the synthpop
package.
|
![]() |
DLA |
10th November 2016 15:30 to 16:30 |
Graham Cormode |
Engineering Privacy for Small Groups
Concern about how to collect sensitive user data without
compromising individual privacy is a major barrier to greater availability of
data. The model of Local Differential Privacy has recently gained favor and
enjoys widespread adoption for data gathering from browsers and apps. Deployed
methods use Randomized Response, which applies only when the user data is a
single bit. We study general mechanisms
for data release which allow the release of statistics from small groups.
We formalize this by introducing a set of desirable
properties that such mechanisms can obey. Any combination of these can be
satisfied by solving a linear program which minimizes a cost function. We also
provide explicit constructions that are optimal for certain combinations of
properties, and show a closed form for their cost. In the end, there are only
three distinct optimal mechanisms to choose between: one is the well-known
(truncated) geometric mechanism; the second a novel mechanism that we
introduce, and the third is found as the solution to a particular LP. Through a
set of experiments on real and synthetic data we determine which is preferable
in practice, for different combinations of data distributions and privacy
parameters.
Joint work with Tejas Kulkarni and Divesh Srivastava
|
![]() |
DLA |
22nd November 2016 15:30 to 16:30 |
Thomas Steinke |
Generalisation for Adaptive Data Analysis
Adaptivity is an important aspect of data analysis --
that is, the choice of questions to ask about a dataset is often informed by
previous use of the same dataset. However, statistical validity is typically
only guaranteed in a non-adaptive model, in which the questions must be specified
before the dataset is collected. A recent line of work initiated by Dwork et
al. (STOC 2015) provides a formal model for studying the power of adaptive data
analysis.
This talk will show that there are sophisticated
techniques -- using tools from information theory and differential privacy --
that enable us to ensure that adaptive data analysis provides statistically
valid answers that generalise to the overall population from which the dataset
was drawn. This talk will also discuss how adaptive data analysis is inherently
more powerful than non-adaptive data analysis, namely there is an exponential
separation between the number of adaptive queries needed to overfit a dataset
and the number of non-adaptive queries needed.
|
![]() |
DLA |
30th November 2016 16:00 to 17:00 |
Cynthia Dwork |
Rothschild Lecture: The Promise of Differential Privacy
The rise of
"Big Data" has been accompanied by an increase in the twin risks of spurious
scientific discovery and privacy compromise. A great deal of effort has
been devoted to the former, from the use of sophisticated validation
techniques, to deep statistical methods for controlling the false discovery
rate in multiple hypothesis testing. However, there is a fundamental
disconnect between the theoretical results and the practice of data analysis:
the theory of statistical inference assumes a fixed collection of hypotheses to
be tested, selected non-adaptively before the data are gathered, whereas in
practice data are shared and reused with hypotheses and new analyses being
generated on the basis of data exploration and the outcomes of previous
analyses. Privacy-preserving data analysis also has a large literature,
spanning several disciplines. However, many attempts have proved problematic
either in practice or on paper.
"Differential privacy" – a recent notion tailored to situations in which data are plentiful – has provided a theoretically sound and powerful framework, giving rise to an explosion of research. We will review the definition of differential privacy, describe some basic algorithmic techniques for achieving it, and see that it also prevents false discoveries arising from adaptivity in data analysis. |
![]() |
DLA |
1st December 2016 15:30 to 16:30 |
Joerg Drechsler |
Strategies to facilitate access to detailed geocoding information based on synthetic data
In this seminar we investigate if generating synthetic
data can be a viable strategy to provide access to detailed geocoding
information for external researchers without compromising the confidentiality
of the units included in the database. This research was motivated by a recent
project at the Institute for Employment Research (IAB) that linked exact
geocodes to the Integrated Employment Biographies, a large administrative
database containing several million records. Based on these data we evaluate
the performance of several synthesizers in terms of addressing the trade-off
between preserving analytical validity and limiting the risk of disclosure. We
propose strategies for making the synthesizers scalable for such large files,
introduce analytical validity measures for the generated data and provide
general recommendations for statistical agencies considering the synthetic data
approach for disseminating detailed geographical information.
|
![]() |
DLA |
1st December 2016 16:30 to 17:30 |
Atikur Khan |
A Risk-Utility Balancing Approach to Generate Synthetic Microdata
Generation of synthetic microdata is one of the
promising approaches to statistical disclosure control of microdata. Methods
for generating synthetic microdata for multivariate continuous variables
include noise addition and decomposition of data matrix. We present a framework
to generate synthetic data matrix based on resampling from Stiefel manifold and
application of Slutsky’s theorem in singular value decomposition. We also
derive utility and risk measures based on this theorem and present a
risk-utility balancing approach to generate synthetic continuous microdata. We
apply our proposed methods to some reference microdata sets and demonstrate the
usefulness of our proposed methods. |
![]() |
DLA |
2nd December 2016 14:00 to 15:00 |
Jordi Soria-Comas |
Topics in differential privacy: optimal noise and record perturbation baseddata sets
We
explore two different aspects of differential privacy. First we explore the optimality of noise distributions in noise addition. In particular, we
show that the Laplace distribution is nearly optimal in the univariate case, but not in
the multivariate case. Optimal distributions are described. Then we explore the generation
of differentially private data sets via perturbative masking of the original
records. This approach is remarkably more efficient than histogram-based approaches but a naive application of it may completely damage the data utility. In particular, we analyze the use of microaggregation to reduce the sensitivity and, thus, the amount of noise required to attain differential privacy. |
![]() |
DLAW03 |
7th December 2016 10:00 to 10:45 |
Bradley Malin |
Gaming the System for Privacy
Over the past several decades, there has been a cat-and-mouse game played between data protectors of and data users. It seems as though every time a new model of data privacy is posited, a new attack is published along with high-profile demonstrations of its failure to guarantee protection. This has, upon many occasions, led to outcries about how privacy is either dead, is dying, or was a mere myth and never existed in the first place. The goal of this talk is to review how we got to this juncture in time and suggest that data privacy may not only be dead, but that our technical definitions of the problem may require an augmentation to account for real world adversarial settings. In doing so, I will introduce a new direction in privacy, which is rooted in a formal game theoretic framework and will provide examples of how this view can provide for greater flexibility with respect to several classic privacy problems, including the publication of individual-level records and summary statistics.
|
![]() |
DLAW03 |
7th December 2016 11:30 to 12:00 |
Anna Oganian |
Combining statistical disclosure limitation methods to preserve relationships and data-specific constraints in survey data.
Applications of data swapping and noise are among the most
widely used methods for Statistical Disclosure Limitation (SDL) by statistical
agencies for public-use non-interactive data release. The core ideas of
swapping and noise are conceptually easy to understand and are naturally suited
for masking purposes. We believe that they are worth revisiting with a special
emphasis given to the utility aspects of these methods and to the ways of
combining the methods to increase their efficiency and reliability. Indeed,
many data collecting agencies use complex sample designs to increase the
precision of their estimates and often allocate additional funds to obtain larger
samples for particular groups in the population. Thus, it is particularly undesirable
and counterproductive when SDL methods applied to these data significantly
change the magnitude of estimates and/or their levels of precision. We will
present and discuss two methods of disclosure limitation based on swapping and
noise, which can work together in synergy while protecting continuous and
categorical variables. The first method is a version of multiplicative noise
that preserves means and covariance together with some structural constraints
in the data. The second method is loosely based on swapping. It is designed
with the goal of preserving the relationships between strata-defining variables
with other variables in the survey. We will show how these methods can be
applied together enhancing each other’s efficiency.
|
![]() |
DLAW03 |
7th December 2016 12:00 to 12:30 |
Yulia Gel |
Bootstrapped Inference for Degree Distributions in Large Sparse Networks
We propose a new method of nonparametric bootstrap to quantify
estimation uncertainties in functions of network degree distribution in large
ultra sparse networks. Both network degree distribution and network order are
assumed to be unknown. The key idea is based on adaptation of the ``blocking''
argument, developed for bootstrapping of time series and re-tiling of spatial
data, to random networks. We first sample a set of multiple ego networks of
varying orders that form a patch, or a network block analogue, and then
resample the data within patches. To select an optimal patch size, we develop a
new computationally efficient and data-driven cross-validation algorithm. In
our simulation study, we show that the new fast patchwork bootstrap (FPB)
outperforms competing approaches by providing sharper and better calibrated
confidence intervals for functions of a network degree distribution, including
the cases of networks in an ultra sparse regime. We illustrate the FPB
in application to analysis of social networks and discuss its potential utility
for nonparametric anomaly detection and privacy-preserving data mining.
|
|
DLAW03 |
7th December 2016 13:30 to 14:15 |
Sheila Bird |
Barred from work in Scottish prisons, data science to the rescue: discoveries about drugs-related deaths and women via record linkage - A part of Women in Data Science (WiDS)
Willing Anonymous HIV Salivary (WASH) surveillance
studies in Scottish prisons changed the focus on drugs in prisons - not always
for the better. Barred from work in Scottish prisons, we turned to powerful
record-linkage studies to quantify drugs-related deaths soon after
prison-release (and how to reduce them); reveal that female injectors'
risk of drugs-related death was half that of male
injectors; but that the female advantage narrowed with age; and is not evident
for methadone-specific deaths. Explanations for these strong, validated
empirical findings are the next step.
|
![]() |
DLAW03 |
7th December 2016 14:15 to 15:00 |
Christine O'Keefe |
A new relaxation of differential privacy - A part of Women in Data Science (WiDS)
Co-author: Anne-Sophie Charest Agencies and organisations around the world are increasingly seeking to realise the value embodied in their growing data holdings, including by making data available for research and policy analysis. On the other hand, access to data must be provided in a way that protects the privacy of individuals represented in the data. In order to achieve a justifiable trade-off between these competing objectives, appropriate measures of privacy protection and data usefulness are needed. In recent years, the formal differential privacy condition has emerged as a verifiable privacy protection standard. While differential privacy has had a marked impact on theory and literature, it has had far less impact in practice. Some concerns include the possibility that the differential privacy standard is so strong that statistical outputs are altered to the point where they are no longer useful. Various relaxations have been proposed to increase the utility of outputs, although none has yet achieved widespread adoption. In this paper we describe a new relaxation of the differential privacy condition, and demonstrate some of its properties. |
![]() |
DLAW03 |
7th December 2016 15:30 to 16:15 |
Cynthia Dwork |
Theory for Society: Fairness in Classification
The uses of data and algorithms are publicly under debate on an unprecedented scale. Listening carefully to the concerns of rights advocates, researchers, and other data consumers and collectors suggests many possible current and future problems in which theoretical computer science can play a positive role. The talk will discuss the nascent mathematically rigorous study of fairness in classification and scoring.
|
![]() |
DLAW03 |
8th December 2016 09:30 to 10:15 |
Yosi (Joseph) Rinott |
Attempts to apply Differential Privacy for comparison of standard Table Builder type schemes (O'Keefe, Skinner, Shlomo)
We try to compare different practical perturbation schemes for frequency tables used by certain agencies, with different parameters, using the parameters of Differential Privacy guaranteed by these schemes. At the same time we look at utility of perturbed data in terms of loss functions that are common in statistics, and also by trying to study the conditions required for certain statistical properties of the original table, such as independence of certain variables, to be preserved under the perturbation. The `worst case' nature of Differential Privacy appears problematic, and so are alternative definitions, which rely on scenarios that are not the worst. This is ongoing work, and we still have many unanswered questions which I will raise in the talk in hope for help from the audience.
|
![]() |
DLAW03 |
8th December 2016 10:15 to 11:00 |
Josep Domingo-Ferrer, Jordi Soria-Comas |
Individual Differential Privacy
Differential privacy is well-known because of the strong privacy guarantees
it offers: the results of a data analysis must be indistinguishable
between data sets that differ in one record. However, the use of differential
privacy may limit the accuracy of the results significantly. Essentially, we
are limited to data analyses with small global sensitivity (although some
workarounds have been proposed that improve the accuracy of the results
when the local sensitivity is small). We introduce individual differential
privacy (iDP), a relaxation of differential privacy that: (i) preserves the
strong privacy guarantees that differential privacy gives to individuals,
and (ii) improves the accuracy of the results significantly. The improvement
in the accuracy comes from the fact that the trusted party does a
more precise assessment of the risk associated to a given data analysis.
This is possible because we allow the trusted party to take advantage of
all the available information, namely the actual data set.
|
![]() |
DLAW03 |
8th December 2016 11:30 to 12:00 |
Marco Gaboardi |
PSI : a Private data Sharing Interface
Co-authors: James Honaker (Harvard University) , Gary King (Harvard University) , Jack Murtagh (Harvard University) , Kobbi Nissim (Ben-Gurion University and CRCS Harvard University) , Jonathan Ullman (Northeastern University) , Salil Vadhan (Harvard University) We provide an overview of the design of PSI (“a Private data Sharing Interface”), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy. PSI is designed so that none of its users need expertise in privacy, computer science, or statistics. PSI enables them to make informed decisions about the appropriate use of differential privacy, the setting of privacy parameters, the partitioning of a privacy budget across different statistics, and the interpretation of errors introduced for privacy. Additionally, PSI is designed to be integrated with existing and widely used data repository infrastructures as part of a broader collection of mechanisms for the handling of privacy-sensitive data, including an approval process for accessing raw data (e.g. through IRB review), access control, and secure storage. Its initial set of differentially private algorithms were chosen to include statistics that have wide use in the social sciences, and are integrated with existing statistical software designed for modeling, interpreting, and exploring social science data. Related Links
|
![]() |
DLAW03 |
8th December 2016 12:00 to 12:30 |
Mark Bun |
The Price of Online Queries in Differential Privacy
Co-authors: Thomas Steinke
(IBM Research - Almaden), Jonathan Ullman
(Northeastern University)
We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set of allowable queries via one of three interactive models. These models capture whether the queries are given to the mechanism all in a single batch (“offline”), whether they are chosen in advance but presented to the mechanism one at a time (“online”), or whether they may be chosen by an analyst adaptively (“adaptive”). Many differentially private mechanisms are just as efficient in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models might be equivalent. We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that many more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries, such as threshold queries over the real line. Joint work with Thomas Steinke and Jonathan Ullman. Related Links
|
![]() |
DLAW03 |
8th December 2016 13:30 to 14:15 |
Ross Anderson |
Can we have medical privacy, cloud computing and genomics all at the same time?
"The collection, linking and use of data in biomedical research and health
care: ethical issues" is a report from the Nuffield Bioethics Council, published
last year. It took over a year to write. Our working group came from the medical
profession, academics, insurers and drug companies. As the information we gave
to our doctors in private to help them treat us is now collected and treated as
an industrial raw material, there has been scandal after scandal. From failures
of anonymisation through unethical sales to the care.data catastrophe, things
just seem to get worse. Where is it all going, and what must a medical data user
do to behave ethically? We put forward four principles. First, respect persons; do not treat their confidential data like were coal or bauxite. Second, respect established human-rights and data-protection law, rather than trying to find ways round it. Third, consult people who’ll be affected or who have morally relevant interests. And fourth, tell them what you’ve done – including errors and security breaches. Since medicine is the canary in the mine, we hope that the privacy lessons can be of value elsewhere – from consumer data to law enforcement and human rights. Related Links |
![]() |
DLAW03 |
8th December 2016 14:15 to 15:00 |
Paul Burton |
DataSHIELD: taking the analysis to the data not the data to the analysis
Research in modern biomedicine and social science often requires
sample sizes so large that they can only be achieved through a pooled
co-analysis of data from several studies. But the pooling of information from
individuals in a central database that may be queried by researchers raises
important governance questions and can be controversial. These reflect
important societal and professional concerns about privacy, confidentiality and
intellectual property. DataSHIELD provides a novel technological solution that
circumvents some of the most basic challenges in facilitating the access of
researchers and other healthcare professionals to individual-level data.
Commands are sent from a central analysis computer (AC) to several data
computers (DCs) that store the data to be co-analysed. Each DC is located at
one of the studies contributing data to the analysis. The data sets are
analysed simultaneously but in parallel. The separate parallelized analyses are
linked by non-disclosive summary statistics and commands that are transmitted
back and forth between the DCs and the AC. Technical implementation of
DataSHIELD employs a specially modified R statistical environment linked to an
Opal database deployed behind the computer firewall of each DC. Analysis is
then controlled through a standard R environment at the AC. DataSHIELD is most
often configured to carry out a – typically fully-efficient – analysis that is mathematically
equivalent to placing all data from all studies in one central database and
analysing them all together (with centre-effects, of course, where required).
Alternatively, it can be set up for study-level meta-analysis: estimates and
standard errors are derived independently from each study and are subject to centralized
random effects meta-analysis at the AC. DataSHIELD is being developed as a
flexible, easily extendible, open-source way to provide secure data access to a
single study or data repository as
well as for settings involving several studies. Although the talk will focus on
the version of DataSHIELD that represents our current standard implementation,
it will also explore some of our recent thinking in relation to issues such as vertically
partitioned (record linkage) data, textual data and non-disclosive graphical
visualisation.
|
![]() |
DLAW03 |
8th December 2016 15:30 to 16:00 |
Grigorios Loukides |
Anonymization of high-dimensional datasets
Organizations collect increasing amounts of high-dimensional data about
individuals. Examples are health record datasets containing diagnosis
information, marketing datasets containing products purchased by customers, and
web datasets containing check-ins in social networks. The sharing of such data
is increasingly needed to support applications and/or satisfy policies and
legislation. However, the high dimensionality of data makes their anonymization
difficult, both from an effectiveness and from an efficiency point of view. In
this talk, I will illustrate the problem and briefly review the main techniques used in the anonymization of high-dimensional
data. Subsequently, I will present a class of methods we have been developing
for anonymizing complex, high-dimensional data and their application to the
healthcare domain.
|
![]() |
DLAW03 |
8th December 2016 16:00 to 16:30 |
Mark Elliot | An empirical measure of attribution risk (for fully synthetic data) |
![]() |
DLAW03 |
8th December 2016 16:30 to 17:00 |
Natalie Shlomo |
Assessing Re-identification Risk in Sample Microdata
Co-author:
Chris Skinner
Abstract: Disclosure risk occurs when there is a high probability that an intruder can identify an individual in released sample microdata and confidential information may be revealed. A probabilistic modelling framework based on the Poisson log-linear model is used for quantifying disclosure risk in terms of population uniqueness when population counts are unknown. This method does not account for measurement error arising either naturally from survey processes or purposely introduced as a perturbative disclosure limitation technique. The probabilistic modelling framework for assessing disclosure risk is expanded to take into account the misclassification/ perturbation and demonstrated on sample microdata which has undergone perturbation procedures. Finally, we adapt the probabilistic modelling framework to assess the disclosure risk of samples from sub-populations and show some initial results. |
![]() |
DLAW03 |
8th December 2016 17:00 to 17:30 |
Discussion of the Future
This session is
for open discussion, to share ideas about opportunities for follow-up to the
DLA programme:
- both scientific, regarding what has emerged from the programme and ways that this can be taken forward; - and more practical, regarding e.g. future events, research programmes and opportunities for interaction. |
||
DLAW03 |
9th December 2016 09:30 to 10:15 |
Ross Gayler |
Linkage and anonymisation in consumer finance
The operation of the consumer finance industry has important social and economic consequences, and is heavily regulated. The industry makes millions of decisions based on personal data that must be protected. The people who create these decision processes seek to make industry decision making practices as rational and well informed as possible, but sometimes this work is surprisingly hard. Major difficulties occur around the issues of linkage and anonymisation. I will describe how some of these practically important issues arise and play out. This is intended as a motivating example for developments in data linkage and anonymisation (no maths involved). |
|
DLAW03 |
9th December 2016 10:15 to 10:45 |
David Hand |
On anonymisation and discrimination
The perspective of anonymisation is one of ‘I don’t know who
you are, but I know this about you’, while the perspective of anti-discrimination
legislation is the complementary ‘I don’t know this about you, but I know who
you are’. I look at how organisations have attempted to comply with the law,
and show that this has led to confusion and lack of compliance. The fundamental
problem arises from ambiguous and incompatible definitions, and recent changes
to the law have made it worse. I illustrate some of the damaging adverse
consequences, for both individuals and for society, that have arisen from this
confusion.
|
![]() |
DLAW03 |
9th December 2016 11:15 to 12:00 |
Daniel Kifer |
Statistical Asymptotics with Differential Privacy
Differential privacy introduces non-ignorable noise into synthetic data and query answers. A proper statistical analysis must account for both the sampling noise in the data and the additional privacy noise. In order to accomplish this, it is often necessary to modify the asymptotic theory of statistical estimators. In this talk, we will present a formal approach to this problem, with applications to confidence intervals and hypothesis tests.
|
![]() |
DLAW03 |
9th December 2016 12:00 to 12:30 |
Steven Murdoch |
Decentralising Data Collection and Anonymisation
A frequent approach for anonymising datasets is for individuals to submit
sensitive data records to a central authority. The central authority then is
responsible for safely storing and sharing the data, for example by aggregating
or perturbing records. However, this approach introduces the risk that the
central authority may be compromised, whether this from an externally originated
hacking attempt or as a result of an insider attack. As a result, central
authorities responsible for handling sensitive data records must be well
protected, often at great expense, and even then the risk of compromise will not
be eliminated. In this talk I will discuss an alternative anonymisation approach, where sensitive data records have identifiable information removed before being submitted to the central authority. In order for this approach to work, not only must this first-stage anonymisation prevent the data from disclosing the identity of the submitter, but also the data records must be submitted in such a way as to prevent the central authority from being able to establish the identity of the submitter from submission metadata. I will show how advances in network metadata anonymisation can be applied to facilitate this approach, including techniques to preserve validity of data despite not knowing the identity of contributors. |
![]() |
DLAW03 |
9th December 2016 13:30 to 14:15 |
Moni Naor | tba |
![]() |
DLAW03 |
9th December 2016 14:15 to 15:00 |
Silvia Polettini |
Mixed effects models with covariates perturbed for SDC
Co-author: Serena Arima (Sapienza Università di
Roma) We focus on mixed effects with data subject to PRAM. An instance of this is a small area model. We assume that categorical covariates have been perturbed by Post Randomization, whereas the level identifier is not perturbed. We also assume that a continuous response is available, and consider a nested linear regression model: $$ y_{ij}= X_{ij}^{'}\beta +v_{i}+e_{ij}, j=1,...,n_{i}; \,\,i=1,...,m $$ where $v_{i}\iid N(0,\sigma^{2}_{v})$ (model error);$e_{i}\iid N(\mu,\sigma^{2}_{e})$ (design error). We resort to a measurement error model and define a unit-level small area model accounting for measurement error in discrete covariates. PRAM is defined in terms of a transition matrix $P$ modeling the changes in categories; we consider both the case of known $P$, and the case when $P$ is unknown and is estimated from the data. A small simulation study is conducted to assess the effectiveness of the proposed Bayesian measurement error model in estimating the model parameters and to investigate the protection provided by PRAM in this context. |
![]() |