skip to content

Timetable (DLAW03)

New developments in data privacy

Monday 5th December 2016 to Friday 9th December 2016

Monday 5th December 2016
09:30 to 17:00 'New Approaches to Anonymisation - Open for Business Event' INI 1
Tuesday 6th December 2016
09:30 to 17:00 'Engaging People in Data Privacy - Open for Business Event' INI 1
Wednesday 7th December 2016
09:00 to 09:50 Registration
09:50 to 10:00 Introduction by the Organisers INI 1
10:00 to 10:45 Bradley Malin
Gaming the System for Privacy
Over the past several decades, there has been a cat-and-mouse game played between data protectors of and data users.  It seems as though every time a new model of data privacy is posited, a new attack is published along with high-profile demonstrations of its failure to guarantee protection.  This has, upon many occasions, led to outcries about how privacy is either dead, is dying, or was a mere myth and never existed in the first place.  The goal of this talk is to review how we got to this juncture in time and suggest that data privacy may not only be dead, but that our technical definitions of the problem may require an augmentation to account for real world adversarial settings.  In doing so, I will introduce a new direction in privacy, which is rooted in a formal game theoretic framework and will provide examples of how this view can provide for greater flexibility with respect to several classic privacy problems, including the publication of individual-level records and summary statistics.
10:45 to 11:15 Morning Coffee
11:15 to 11:30 Introduction - A part of Women in Data Science (WiDS) INI 1
11:30 to 12:00 Anna Oganian
Combining statistical disclosure limitation methods to preserve relationships and data-specific constraints in survey data.
Applications of data swapping and noise are among the most widely used methods for Statistical Disclosure Limitation (SDL) by statistical agencies for public-use non-interactive data release. The core ideas of swapping and noise are conceptually easy to understand and are naturally suited for masking purposes. We believe that they are worth revisiting with a special emphasis given to the utility aspects of these methods and to the ways of combining the methods to increase their efficiency and reliability.  Indeed, many data collecting agencies use complex sample designs to increase the precision of their estimates and often allocate additional funds to obtain larger samples for particular groups in the population. Thus, it is particularly undesirable and counterproductive when SDL methods applied to these data significantly change the magnitude of estimates and/or their levels of precision. We will present and discuss two methods of disclosure limitation based on swapping and noise, which can work together in synergy while protecting continuous and categorical variables. The first method is a version of multiplicative noise that preserves means and covariance together with some structural constraints in the data. The second method is loosely based on swapping. It is designed with the goal of preserving the relationships between strata-defining variables with other variables in the survey. We will show how these methods can be applied together enhancing each other’s efficiency.
12:00 to 12:30 Yulia Gel
Bootstrapped Inference for Degree Distributions in Large Sparse Networks
We propose a new method of nonparametric bootstrap to quantify estimation uncertainties in functions of network degree distribution in large ultra sparse networks. Both network degree distribution and network order are assumed to be unknown. The key idea is based on adaptation of the ``blocking'' argument, developed for bootstrapping of time series and re-tiling of spatial data, to random networks. We first sample a set of multiple ego networks of varying orders that form a patch, or a network block analogue, and then resample the data within patches. To select an optimal patch size, we develop a new computationally efficient and data-driven cross-validation algorithm. In our simulation study, we show that the new fast patchwork bootstrap (FPB) outperforms competing approaches by providing sharper and better calibrated confidence intervals for functions of a network degree distribution, including the cases of networks in an ultra sparse regime. We illustrate the FPB in application to analysis of social networks and discuss its potential utility for nonparametric anomaly detection and privacy-preserving data mining.
12:30 to 13:30 Lunch @ Wolfson Court
13:30 to 14:15 Sheila Bird
Barred from work in Scottish prisons, data science to the rescue: discoveries about drugs-related deaths and women via record linkage - A part of Women in Data Science (WiDS)
Willing Anonymous HIV Salivary (WASH) surveillance studies in Scottish prisons changed the focus on drugs in prisons - not always for the better. Barred from work in Scottish prisons, we turned to powerful record-linkage studies to quantify drugs-related deaths soon after prison-release (and how to reduce them); reveal that female injectors' risk of drugs-related death was half that of male injectors; but that the female advantage narrowed with age; and is not evident for methadone-specific deaths. Explanations for these strong, validated empirical findings are the next step.
14:15 to 15:00 Christine O'Keefe
A new relaxation of differential privacy - A part of Women in Data Science (WiDS)
Co-author: Anne-Sophie Charest 

Agencies and organisations around the world are increasingly seeking to realise the value embodied in their growing data holdings, including by making data available for research and policy analysis. On the other hand, access to data must be provided in a way that protects the privacy of individuals represented in the data. In order to achieve a justifiable trade-off between these competing objectives, appropriate measures of privacy protection and data usefulness are needed.  

In recent years, the formal differential privacy condition has emerged as a verifiable privacy protection standard. While differential privacy has had a marked impact on theory and literature, it has had far less impact in practice. Some concerns include the possibility that the differential privacy standard is so strong that statistical outputs are altered to the point where they are no longer useful. Various relaxations have been proposed to increase the utility of outputs, although none has yet achieved widespread adoption. In this paper we describe a new relaxation of the differential privacy condition, and demonstrate some of its properties. 
15:00 to 15:30 Afternoon Tea
15:30 to 16:15 Cynthia Dwork
Theory for Society: Fairness in Classification
The uses of data and algorithms are publicly under debate on an unprecedented scale. Listening carefully to the concerns of rights advocates, researchers, and other data consumers and collectors suggests many possible current and future problems in which theoretical computer science can play a positive role.  The talk will discuss the nascent mathematically rigorous study of fairness in classification and scoring.
16:30 to 18:30 Women's Round Table Event and Reception
19:30 to 22:00 Formal Dinner at Christ's College
Thursday 8th December 2016
09:30 to 10:15 Yosi (Joseph) Rinott
Attempts to apply Differential Privacy for comparison of standard Table Builder type schemes (O'Keefe, Skinner, Shlomo)
We try to compare different practical perturbation schemes for frequency tables used by certain agencies, with different parameters, using the parameters of Differential Privacy guaranteed by these schemes. At the same time we look at utility of perturbed data in terms of loss functions that are common in statistics, and also by trying to study the conditions required for certain statistical properties of the original table, such as independence of certain variables, to be preserved under the perturbation. The `worst case' nature of Differential Privacy appears problematic, and so are alternative definitions, which rely on scenarios that are not the worst.  This is ongoing work, and we still have many unanswered questions which I will raise in the talk in hope for help from the audience. 
10:15 to 11:00 Josep Domingo-Ferrer ; Jordi Soria-Comas
Individual Differential Privacy
Differential privacy is well-known because of the strong privacy guarantees it offers: the results of a data analysis must be indistinguishable between data sets that differ in one record. However, the use of differential privacy may limit the accuracy of the results significantly. Essentially, we are limited to data analyses with small global sensitivity (although some workarounds have been proposed that improve the accuracy of the results when the local sensitivity is small). We introduce individual differential privacy (iDP), a relaxation of differential privacy that: (i) preserves the strong privacy guarantees that differential privacy gives to individuals, and (ii) improves the accuracy of the results significantly. The improvement in the accuracy comes from the fact that the trusted party does a more precise assessment of the risk associated to a given data analysis. This is possible because we allow the trusted party to take advantage of all the available information, namely the actual data set.
11:00 to 11:30 Morning Coffee
11:30 to 12:00 Marco Gaboardi
PSI : a Private data Sharing Interface
Co-authors: James Honaker (Harvard University) , Gary King (Harvard University) , Jack Murtagh (Harvard University) , Kobbi Nissim (Ben-Gurion University and CRCS Harvard University) , Jonathan Ullman (Northeastern University) , Salil Vadhan (Harvard University)

We provide an overview of the design of PSI (“a Private data Sharing Interface”), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy.
PSI is designed so that none of its users need expertise in privacy, computer science, or statistics. PSI enables them to make informed decisions about the appropriate use of differential privacy, the setting of privacy parameters, the partitioning of a privacy budget across different statistics, and the interpretation of errors introduced for privacy.
Additionally, PSI is designed to be integrated with existing and widely used data repository infrastructures as part of a broader collection of mechanisms for the handling of privacy-sensitive data, including an approval process for accessing raw data (e.g. through IRB review), access control, and secure storage.
Its initial set of differentially private algorithms were chosen to include statistics that have wide use in the social sciences, and are integrated with existing statistical software designed for modeling, interpreting, and exploring social science data.

Related Links
12:00 to 12:30 Mark Bun
The Price of Online Queries in Differential Privacy
Co-authors: Thomas Steinke (IBM Research - Almaden), Jonathan Ullman (Northeastern University)
We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set of allowable queries via one of three interactive models. These models capture whether the queries are given to the mechanism all in a single batch (“offline”), whether they are chosen in advance but presented to the mechanism one at a time (“online”), or whether they may be chosen by an analyst adaptively (“adaptive”).
Many differentially private mechanisms are just as efficient in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models might be equivalent.
We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that many more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries, such as threshold queries over the real line.
Joint work with Thomas Steinke and Jonathan Ullman.

Related Links
12:30 to 13:30 Lunch @ Wolfson Court
13:30 to 14:15 Ross Anderson
Can we have medical privacy, cloud computing and genomics all at the same time?
"The collection, linking and use of data in biomedical research and health care: ethical issues" is a report from the Nuffield Bioethics Council, published last year. It took over a year to write. Our working group came from the medical profession, academics, insurers and drug companies. As the information we gave to our doctors in private to help them treat us is now collected and treated as an industrial raw material, there has been scandal after scandal. From failures of anonymisation through unethical sales to the catastrophe, things just seem to get worse. Where is it all going, and what must a medical data user do to behave ethically?

We put forward four principles. First, respect persons; do not treat their confidential data like were coal or bauxite. Second, respect established human-rights and data-protection law, rather than trying to find ways round it. Third, consult people who’ll be affected or who have morally relevant interests. And fourth, tell them what you’ve done – including errors and security breaches.

Since medicine is the canary in the mine, we hope that the privacy lessons can be of value elsewhere – from consumer data to law enforcement and human rights.

Related Links
14:15 to 15:00 Paul Burton
DataSHIELD: taking the analysis to the data not the data to the analysis
Research in modern biomedicine and social science often requires sample sizes so large that they can only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important governance questions and can be controversial. These reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that circumvents some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. Commands are sent from a central analysis computer (AC) to several data computers (DCs) that store the data to be co-analysed. Each DC is located at one of the studies contributing data to the analysis. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands that are transmitted back and forth between the DCs and the AC. Technical implementation of DataSHIELD employs a specially modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is then controlled through a standard R environment at the AC. DataSHIELD is most often configured to carry out a – typically fully-efficient – analysis that is mathematically equivalent to placing all data from all studies in one central database and analysing them all together (with centre-effects, of course, where required). Alternatively, it can be set up for study-level meta-analysis: estimates and standard errors are derived independently from each study and are subject to centralized random effects meta-analysis at the AC. DataSHIELD is being developed as a flexible, easily extendible, open-source way to provide secure data access to a single study or data repository as well as for settings involving several studies. Although the talk will focus on the version of DataSHIELD that represents our current standard implementation, it will also explore some of our recent thinking in relation to issues such as vertically partitioned (record linkage) data, textual data and non-disclosive graphical visualisation. 
15:00 to 15:30 Afternoon Tea
15:30 to 16:00 Grigorios Loukides
Anonymization of high-dimensional datasets
Organizations collect increasing amounts of high-dimensional data about individuals. Examples are health record datasets containing diagnosis information, marketing datasets containing products purchased by customers, and web datasets containing check-ins in social networks. The sharing of such data is increasingly needed to support applications and/or satisfy policies and legislation. However, the high dimensionality of data makes their anonymization difficult, both from an effectiveness and from an efficiency point of view. In this talk, I will illustrate the problem and briefly review the main techniques used in the anonymization of high-dimensional data. Subsequently, I will present a class of methods we have been developing for anonymizing complex, high-dimensional data and their application to the healthcare domain. 
16:00 to 16:30 Mark Elliot
An empirical measure of attribution risk (for fully synthetic data)
16:30 to 17:00 Natalie Shlomo
Assessing Re-identification Risk in Sample Microdata
Co-author: Chris Skinner      

Abstract:    Disclosure risk occurs when there is a high probability that an   intruder can identify an individual in released sample microdata and confidential information may be revealed. A probabilistic modelling  framework based on the Poisson log-linear model is used for  quantifying disclosure risk in terms of population uniqueness when  population counts are unknown. This method does not account for  measurement error arising either naturally from survey processes or  purposely introduced as a perturbative disclosure limitation technique. The probabilistic modelling framework for assessing disclosure risk is  expanded to take into account the misclassification/ perturbation and  demonstrated on sample microdata which has undergone   perturbation  procedures. Finally, we adapt the probabilistic modelling framework to   assess the disclosure risk of samples from sub-populations  and show some initial results.
17:00 to 17:30 Discussion of the Future
This session is for open discussion, to share ideas about opportunities for follow-up to the DLA programme:
- both scientific, regarding what has emerged from the programme and ways that this can be taken forward;
- and more practical, regarding e.g. future events, research programmes and opportunities for interaction.
Friday 9th December 2016
09:30 to 10:15 Ross Gayler
Linkage and anonymisation in consumer finance
The operation of the consumer finance industry has important social and economic consequences, and is heavily regulated. The industry makes millions of decisions based on personal data that must be protected. The people who create these decision processes seek to make industry decision making practices as rational and well informed as possible, but sometimes this work is surprisingly hard. Major difficulties occur around the issues of linkage and anonymisation. I will describe how some of these practically important issues arise and play out. This is intended as a motivating example for developments in data linkage and anonymisation (no maths involved).
10:15 to 10:45 David Hand
On anonymisation and discrimination
The perspective of anonymisation is one of ‘I don’t know who you are, but I know this about you’, while the perspective of anti-discrimination legislation is the complementary ‘I don’t know this about you, but I know who you are’. I look at how organisations have attempted to comply with the law, and show that this has led to confusion and lack of compliance. The fundamental problem arises from ambiguous and incompatible definitions, and recent changes to the law have made it worse. I illustrate some of the damaging adverse consequences, for both individuals and for society, that have arisen from this confusion.
10:45 to 11:15 Morning Coffee
11:15 to 12:00 Daniel Kifer
Statistical Asymptotics with Differential Privacy
Differential privacy introduces non-ignorable noise into synthetic data and query answers. A proper statistical analysis must account for both the sampling noise in the data and the additional privacy noise. In order to accomplish this, it is often necessary to modify the asymptotic theory of statistical estimators. In this talk, we will present a formal approach to this problem, with applications to confidence intervals and hypothesis tests.
12:00 to 12:30 Steven Murdoch
Decentralising Data Collection and Anonymisation
A frequent approach for anonymising datasets is for individuals to submit sensitive data records to a central authority. The central authority then is responsible for safely storing and sharing the data, for example by aggregating or perturbing records. However, this approach introduces the risk that the central authority may be compromised, whether this from an externally originated hacking attempt or as a result of an insider attack. As a result, central authorities responsible for handling sensitive data records must be well protected, often at great expense, and even then the risk of compromise will not be eliminated.

In this talk I will discuss an alternative anonymisation approach, where sensitive data records have identifiable information removed before being submitted to the central authority. In order for this approach to work, not only must this first-stage anonymisation prevent the data from disclosing the identity of the submitter, but also the data records must be submitted in such a way as to prevent the central authority from being able to establish the identity of the submitter from submission metadata. I will show how advances in network metadata anonymisation can be applied to facilitate this approach, including techniques to preserve validity of data despite not knowing the identity of contributors.
12:30 to 13:30 Lunch @ Wolfson Court
13:30 to 14:15 Moni Naor
14:15 to 15:00 Silvia Polettini
Mixed effects models with covariates perturbed for SDC
Co-author: Serena Arima (Sapienza Università di Roma)

We focus on mixed effects with data subject to PRAM. An instance of this is a small area model.  We assume that categorical covariates have been perturbed by Post Randomization,
whereas the level identifier is not perturbed. We also assume that a continuous response is available,  and consider a nested linear regression model:
y_{ij}= X_{ij}^{'}\beta +v_{i}+e_{ij},     j=1,...,n_{i}; \,\,i=1,...,m
$v_{i}\iid N(0,\sigma^{2}_{v})$ (model error);$e_{i}\iid
N(\mu,\sigma^{2}_{e})$ (design error).

We resort to a measurement error model and define a unit-level small area model accounting for measurement error  in   discrete covariates.
PRAM is defined in terms of  a transition matrix $P$ modeling the changes in categories; we consider both the case of known $P$, and the case when  $P$ is
unknown and is estimated from the data.

A small simulation study is conducted to assess the effectiveness of the proposed Bayesian measurement error model in estimating the model
parameters and to investigate the protection provided by PRAM in this context.
15:00 to 15:30 Afternoon Tea
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons