# Timetable (STSW01)

## Theoretical and algorithmic underpinnings of Big Data

Monday 15th January 2018 to Friday 19th January 2018

 09:30 to 09:50 Registration 09:50 to 10:00 Welcome from Christie Marr (INI Deputy Director) 10:00 to 10:45 Michael Jordan (University of California, Berkeley)On Gradient-Based Optimization: Accelerated, Stochastic and Nonconvex Many new theoretical challenges have arisen in the area of gradient-based optimization for large-scale statistical data analysis, driven by the needs of applications and the opportunities provided by new hardware and software platforms. I discuss several related, recent results in this area: (1) a new framework for understanding Nesterov acceleration, obtained by taking a continuous-time, Lagrangian/Hamiltonian/symplectic perspective, (2) a discussion of how to escape saddle points efficiently in nonconvex optimization, and (3) the acceleration of Langevin diffusion. INI 1 10:45 to 11:10 Morning Coffee 11:10 to 11:55 Rina Foygel Barber (University of Chicago)Robust inference with the knockoff filter In this talk, I will present ongoing work on the knockoff filter for inference in regression. In a high-dimensional model selection problem, we would like to select relevant features without too many false positives. The knockoff filter provides a tool for model selection by creating knockoff copies of each feature, testing the model selection algorithm for its ability to distinguish true from false covariates to control the false positives. In practice, the modeling assumptions that underlie the construction of the knockoffs may be violated, as we cannot know the exact dependence structure between the various features. Our ongoing work aims to determine and improve the robustness properties of the knockoff framework in this setting. We find that when knockoff features are constructed using estimated feature distributions whose errors are small in a KL divergence type measure, the knockoff filter provably controls the false discovery rate at only a slightly higher level. This work is joint with Emmanuel Candès and Richard Samworth. INI 1 11:55 to 12:40 Jonas Peters (University of Copenhagen)The Hardness of Conditional Independence Testing Testing for conditional independence between continuous random variables is considered to be a hard statistical problem. We provide a formalization of this result and show that a test with correct size does not have power against any alternative. It is thus impossible to obtain any guarantee if the relationships between the random variables can be arbitrarily complex. We propose a practical test that achieves the correct size if the conditional expectations are smooth enough such that they can be estimated from data. This is joint work with Rajen Shah. INI 1 12:40 to 14:10 Lunch @ Churchill College 14:10 to 14:55 Jianqing Fan (Princeton University)Uniform pertubation analysis of eigenspaces and its applications to Community Detection, Ranking and Beyond Co-authors: Emmanuel Abbe (Princeton University), Kaizheng Wang (Princeton University), Yiqiao Zhong (Princeton University) Spectral methods have been widely used for a large class of challenging problems, ranging from top-K ranking via pairwise comparisons, community detection, factor analysis, among others. Analyses of these spectral methods require super-norm perturbation analysis of top eigenvectors. This allows us to UNIFORMLY approximate elements in eigenvectors by linear functions of the observed random matrix that can be analyzed further. We first establish such an infinity-norm pertubation bound for top eigenvectors and apply the idea to several challenging problems such as top-K ranking, community detections, Z_2-syncronization and matrix completion. We show that the spectral methods are indeed optimal for these problems. We illustrate these methods via simulations. INI 1 14:55 to 15:40 Nicolai Meinshausen (ETH Zürich)Instruments, Images, and Anchors Co-authors: Dominik Rothenhäusler, Peter Bühlmann, Christina Heinze (ETH Zurich), and Jonas Peters (Univ of Copenhagen) I will present some ongoing work on two related projects. Firstly, some joint work with Jonas Peters, Dominik Rothenhaeulser and Peter Buhlmann on anchor regression, which is putting both standard regression and causal regression into a common framework and allows to interpolate between both in a continuous fashion. The anchor regression can be seen as a least-squares term with a causal, instrumental-variable-type, penalty. The chosen penalty is shown to correspond to the expected future magnitude of interventions and some risk bounds can be derived for the high-dimensional setting. Second, in joint work with Christina Heinze, I will show an application of related ideas to a (non-causal) setting in image classification. INI 1 15:40 to 16:10 Afternoon Tea 16:10 to 16:55 Gareth Roberts (University of Warwick)The zig-zag and super-efficient sampling for Bayesian analysis of big data Standard MCMC methods can scale poorly to big data settings due to the need to evaluate the likelihood at each iteration. There have been a number of approximate MCMC algorithms that use sub-sampling ideas to reduce this computational burden, but with the drawback that these algorithms no longer target the true posterior distribution. The talk will discuss a new family of Monte Carlo methods based upon a multi-dimensional version of the Zig-Zag process of (Bierkens, Roberts, 2016), a continuous time piecewise deterministic Markov process. While traditional MCMC methods are reversible by construction the Zig-Zag process offers a flexible non-reversible alternative. The dynamics of the Zig-Zag process correspond to a constant velocity model, with the velocity of the process switching at events from a point process. The rate of this point process can be related to the invariant distribution of the process. If we wish to target a given posterior distribution, then rates need to be set equal to the gradient of the log of the posterior. Unlike traditional MCMC, Zig-Zag process can be simulated without discretisation error, and give conditions for the process to be ergodic. Most importantly, I will discuss two generalisations which have good scaling properties for big data: firstly a sub-sampling version of the Zig-Zag process that is an example of an exact approximate scheme; and secondly a control-variate variant of the sub-sampling idea to reduce the variance of our unbiased estimator. Very recent ergodic theory will also be described. INI 1 17:00 to 18:00 Welcome Wine Reception at INI
 09:00 to 09:45 Yi Yu (University of Cambridge); (University of Bristol)Optimal Covariance Change Point Detection in High Dimension Co-authors: Daren Wang (Carnegie Mellon University), Alessandro Rinaldo (Carnegie Mellon University) In this paper, we study covariance change point detection problem in high dimension. Specifically, we assume that the time series  $X_i \in \mathbb{R}^p$, $i = 1, \ldots, n$ are independent $p$-dimensional sub-Gaussian random vectors and that the corresponding covariance matrices $\{\Sigma_i\}_{i=1}^n$ are stationary within segments and only change at certain time points.  Our generic model setting allows $p$ grows with $n$ and we do not place any additional structural assumptions on the covariance matrices.  We introduce algorithms based on binary segmentation (e.g. Vostrikova, 1981) and wild binary segmentation (Fryzlewicz, 2014) and establish the consistency results under suitable conditions.  To improve the detection performance in high dimension, we propose   Wild Binary Segmentation through Independent Projection (WBSIP).  We show that WBSIP can optimally estimate the locations of the change points.  Our analysis also reveals a phase transition effect based on our generic model assumption and to the best of our knowledge, this type of results have not been established elsewhere in the change point detection literature. INI 1 09:45 to 10:30 Alexandre Tsybakov (CREST: Centre de Recherche en Économie et Statistique)Adaptive estimation of functionals under sparsity Adaptive estimation of functionals in sparse mean model and in sparse regression exhibits some interesting effects. This talk focuses on estimation of the L_2-norm of the target vector and of the variance of the noise. In the first problem, the ignorance of the noise level causes changes in the rates. Moreover, the form of the noise distribution also infuences the optimal rate. For example, the rates of estimating the variance differ depending on whether the noise is Gaussian or sub-Gaussian without a precise knowledge of the distribution. Finally, for the sparse mean model, the sub-Gaussian rate cannot be attained adaptively to the noise level on classes of noise distributions with polynomial tails, independently on how fast is the polynomial decay. Joint work with O.Collier and L.Comminges. INI 1 10:30 to 11:00 Morning Coffee 11:00 to 11:45 Claire Boyer (Université Pierre et Marie Curie Paris)On the gap between local recovery guarantees in structured compressed sensing and oracle estimates This is an on-going work with Ben Adcock and Simone Brugiapaglia (Simon Fraser University, Burnaby). First we will introduce a compressed sensing (CS) theory more compatible with real-life applications: we derive guarantees to ensure reconstruction of a structured sparse signal of interest while imposing structure in the acquisition (no Gaussian measurements here...). We will study how far those CS results are from oracle-type guarantees, and we will show that they are similar in terms of the required number of measurements. These results give an insight to design new optimal sampling strategies when realistic physical constraints are imposed in the acquisition. INI 1 11:45 to 12:30 Ioannis Kontoyiannis (University of Cambridge); (Athens University of Economics and Business)Small Big Data: Temporal structure in discrete time series The identification of useful temporal structure in discrete time series is an important component of algorithms used for many tasks in statistical inference and machine learning. Most of the early approaches developed were ineffective in practice, because the amount of data required for reliable modeling grew exponentially with memory length. On the other hand, many of the more modern methodological approaches that make use of more flexible and parsimonious models result in algorithms that do not scale well and are computationally ineffective for larger data sets. We will discuss a class of novel methodological tools for effective Bayesian inference and model selection for general discrete time series, which offer promising results on both small and big data. Our starting point is the development of a rich class of Bayesian hierarchical models for variable-memory Markov chains. The particular prior structure we adopt makes it possible to design effective, linear-time algorithms that can compute most of the important features of the resulting posterior and predictive distributions without resorting to MCMC. We have applied the resulting algorithms and methodological tools to numerous application-specific tasks (including on-line prediction, segmentation, classification, anomaly detection, entropy estimation, and causality testing) on data sets from different areas of application, including data compression, neuroscience, finance, genetics, and animal communication. Results on both simulated and real data will be presented, and brief comparisons with other approaches (including Bühlmann et al’s VLMC, Ron et al’s PST, and Raftery’s MTD) will be discussed. INI 1 12:30 to 14:00 Lunch @ Churchill College 14:00 to 14:45 Judith Rousseau (University of Oxford); (Université Paris-Dauphine)Asymptotics for ABC algorithms Approximate Bayesoan Computation algorithms (ABC) are used in cases where the likelihood is intractable. To simulate from the (approximate) posterior distribution a possiblity is to sample new data from the model and check is these new data are close in some sense to the true data. The output of this algorithms thus depends on how we define the notion of closeness, which is based on a choice of summary statistics and on a threshold. Inthis work we study the behaviour of the algorithm under the assumption that the summary statistics are concentrating on some deterministic quantity and characterize the asymptotic behaviour of the resulting approximate posterior distribution in terms of the threshold and the rate of concentration of the summary statistics. The case of misspecified models is also treated where we show that surprising asymptotic behaviour appears. INI 1 14:45 to 15:30 Andreas Elsener (ETH Zürich)Sharp oracle inequalities for stationary points of nonconvex penalised M-estimators Co-author: Sara van de Geer (ETH Zurich) Nonconvex loss functions are used in several areas of statistics and machine learning. They have several appealing properties as in the case of robust regression. We propose a general framework to derive sharp oracle inequalities for stationary points of penalised M-estimators with nonconvex loss. The penalisation term is assumed to be a weakly decomposable norm. We apply the general framework to sparse (additive) corrected linear regression, sparse PCA, and sparse robust regression. Finally, a new estimator called "robust SLOPE" is proposed to illustrate how to apply our framework to norms different from the l1-norm. INI 1 15:30 to 16:00 Afternoon Tea 16:00 to 16:45 Rajen Shah (University of Cambridge)The xyz algorithm for fast interaction search in high-dimensional data When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main effects and include interactions as products between individual variables. However, if the number of variables $p$ is large, as is common in genomic datasets, the computational cost of searching through $O(p^2)$ interactions can be prohibitive. In this talk I will introduce a new randomised algorithm called xyz that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. The underlying idea is to transform interaction search into a much simpler close pairs of points problem. We will see how strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $O(p^u)$ operations for $1  09:00 to 09:45 Alessandro Rudi (INRIA); (ENS - Paris)Falkon: fast and optimal kernel method for large scale machine learning Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt{n}) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures. INI 1 09:45 to 10:30 Philippe Rigollet (Massachusetts Institute of Technology)Learning determinantal point processes Co-authors: Victor-Emmanuel Brunel (MIT), Ankur Moitra (MIT), John Urschel (MIT) Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning (such as recommendation systems) where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. In this talk, I will present recent results related to this problem, specifically - Rates of convergence for the maximum likelihood estimator: by studying the local and global geometry of the expected log-likelihood function we are able to establish rates of convergence for the MLE and give a complete characterization of the cases where these are parametric. We also give a partial description of the critical points for the expected log-likelihood. - Optimal rates of convergence for this problem: these are achievable by the method of moments and are governed by a combinatorial parameter, which we call the cycle sparsity. - A fast combinatorial algorithm to implement the method of moments efficiently. The necessary background on DPPs will be given in the talk. Joint work with Victor-Emmanuel Brunel (M.I.T), Ankur Moitra (M.I.T) and John Urschel (M.I.T). INI 1 10:30 to 11:00 Morning Coffee 11:00 to 11:45 Shahar Mendelson (Technion - Israel Institute of Technology); (Australian National University)An optimal unrestricted learning procedure The question of prediction is of central importance in Statistical Learning Theory. The optimal tradeoff between accuracy and confidence and the identity of a procedure that attains that optimal tradeoff have been studied extensively over the years.In this talk I present some of ideas used in the recent solution of this problem: a procedure$\hat{f}$that attains the best possible accuracy/confidence tradeoff for a triplet$(F,X,Y)$consisting of an arbitrary class$F$, an unknown distribution$X$and an unknown target$Y$.Specifically, I explain why under minimal assumptions on$(F,X,Y)$, there is a natural critical level$r^*$that depends on the triplet and the sample size$N$, such that for any accuracy parameter$r>r^*$, one has$E((\hat{f}(X)-Y)^2|D) \leq \inf_{f \in F} E(f(X)-Y)^2+r^2$with probability at least$1-2\exp(-cN \min\{1,r^2/\sigma^2\})$,where$\sigma^2=\inf_{f \in F} E(f(X)-Y)^2$. INI 1 11:45 to 12:30 Garvesh Raskutti (University of Wisconsin-Madison)Variable selection and classification with large-scale presence only data Co-author: Hyebin Song (University of Wisconsin-Madison) In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses where the number of covariates$p$is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the \emph{PUlasso} algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal mean-squared error of$\frac{s \log p}{n}$, where$s$is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate$p\$ settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example.Related Linkshttps://arxiv.org/abs/1711.08129 - Link to Arxiv paper INI 1 12:30 to 14:00 Lunch @ Churchill College