skip to content

Data compression with statistical guarantees

Presented by: 
Sylvia Richardson
Monday 3rd July 2017 - 13:30 to 14:15
INI Seminar Room 1
Joint talk with Daniel Ahfock (MRC Biostatistics Unit @ University of Cambridge)

The talk is concerned with translating recent ideas from computer science on probabilistic data-compression techniques into a statistical framework that can be ‘safely’ applied for speeding linear regression analyses for very larges sample sizes in bio-medicine.

 Our motivation is to facilitate the use of multivariate regression and model exploration in tall data sets, so that, for example, genetic association analyses carried out on hundreds of thousands of subjects can investigate multivariate effects for a set of explanatory features, rather than be restricted to one feature at a time associations for computational feasibility.

Among the many approaches to dealing with tall data, probabilistic data compression techniques using random linear mapping, developed in the computer science community, so called sketching, are particularly suitable for linear regression problems. In the first part of the talk, we will present a hierarchical representation of sketching, which allows deriving statistical properties (distributional) of different sketching algorithms. In particular, we will discuss how the signal to noise ratio in the original data set is important for the choice of sketching algorithm. In the second part of the talk, we will further refine some of the approximation guarantees and consider iterative sketches. The talk will be illustrated on a genetic analysis of the link between a blood cell trait and the HLA region involving a sample of 130,000 people.

University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons