Theoretical and algorithmic underpinnings of Big Data
Monday 15th January 2018 to Friday 19th January 2018
09:30 to 09:50  Registration  
09:50 to 10:00  Welcome from Christie Marr (INI Deputy Director)  
10:00 to 10:45 
Michael Jordan On GradientBased Optimization: Accelerated, Stochastic and Nonconvex
Many new theoretical challenges have arisen in the area of gradientbased
optimization for largescale statistical data analysis, driven by the needs of
applications and the opportunities provided by new hardware and software platforms.
I discuss several related, recent results in this area: (1) a new framework
for understanding Nesterov acceleration, obtained by taking a continuoustime,
Lagrangian/Hamiltonian/symplectic perspective, (2) a discussion of how to
escape saddle points efficiently in nonconvex optimization, and (3) the
acceleration of Langevin diffusion.

INI 1  
10:45 to 11:10  Morning Coffee  
11:10 to 11:55 
Rina Foygel Barber Robust inference with the knockoff filter
In this talk, I will present ongoing work on the knockoff filter for inference in regression. In a highdimensional model selection problem, we would like to select relevant features without too many false positives. The knockoff filter provides a tool for model selection by creating knockoff copies of each feature, testing the model selection algorithm for its ability to distinguish true from false covariates to control the false positives. In practice, the modeling assumptions that underlie the construction of the knockoffs may be violated, as we cannot know the exact dependence structure between the various features. Our ongoing work aims to determine and improve the robustness properties of the knockoff framework in this setting. We find that when knockoff features are constructed using estimated feature distributions whose errors are small in a KL divergence type measure, the knockoff filter provably controls the false discovery rate at only a slightly higher level. This work is joint with Emmanuel Candès and Richard Samworth.

INI 1  
11:55 to 12:40 
Jonas Peters The Hardness of Conditional Independence Testing
Testing for conditional independence between continuous random variables is considered to be a hard statistical problem. We provide a formalization of this result and show that a test with correct size does not have power against any alternative. It is thus impossible to obtain any guarantee if the relationships between the random variables can be arbitrarily complex. We propose a practical test that achieves the correct size if the conditional expectations are smooth enough such that they can be estimated from data. This is joint work with Rajen Shah.

INI 1  
12:40 to 14:10  Lunch @ Churchill College  
14:10 to 14:55 
Jianqing Fan Uniform pertubation analysis of eigenspaces and its applications to Community Detection, Ranking and Beyond
Coauthors: Emmanuel Abbe (Princeton University), Kaizheng Wang (Princeton University), Yiqiao Zhong (Princeton University) Spectral methods have been widely used for a large class of challenging problems, ranging from topK ranking via pairwise comparisons, community detection, factor analysis, among others. Analyses of these spectral methods require supernorm perturbation analysis of top eigenvectors. This allows us to UNIFORMLY approximate elements in eigenvectors by linear functions of the observed random matrix that can be analyzed further. We first establish such an infinitynorm pertubation bound for top eigenvectors and apply the idea to several challenging problems such as topK ranking, community detections, Z_2syncronization and matrix completion. We show that the spectral methods are indeed optimal for these problems. We illustrate these methods via simulations. 
INI 1  
14:55 to 15:40 
Nicolai Meinshausen Instruments, Images, and Anchors
Coauthors: Dominik Rothenhäusler, Peter Bühlmann, Christina Heinze (ETH Zurich), and Jonas Peters (Univ of Copenhagen)
I will present some ongoing work on two related projects. Firstly, some joint work with Jonas Peters, Dominik Rothenhaeulser and Peter Buhlmann on anchor regression, which is putting both standard regression and causal regression into a common framework and allows to interpolate between both in a continuous fashion. The anchor regression can be seen as a leastsquares term with a causal, instrumentalvariabletype, penalty. The chosen penalty is shown to correspond to the expected future magnitude of interventions and some risk bounds can be derived for the highdimensional setting. Second, in joint work with Christina Heinze, I will show an application of related ideas to a (noncausal) setting in image classification.

INI 1  
15:40 to 16:10  Afternoon Tea  
16:10 to 16:55 
Gareth Roberts The zigzag and superefficient sampling for Bayesian analysis of big data
Standard MCMC methods can scale poorly to big data settings due to the need to evaluate the likelihood at each iteration. There have been a number of approximate MCMC algorithms that use subsampling ideas to reduce this computational burden, but with the drawback that these algorithms no longer target the true posterior distribution. The talk will discuss a new family of Monte Carlo methods based upon a multidimensional version of the ZigZag process of (Bierkens, Roberts, 2016), a continuous time piecewise deterministic Markov process. While traditional MCMC methods are reversible by construction the ZigZag process offers a flexible nonreversible alternative. The dynamics of the ZigZag process correspond to a constant velocity model, with the velocity of the process switching at events from a point process. The rate of this point process can be related to the invariant distribution of the process. If we wish to target a given posterior distribution, then rates need to be set equal to the gradient of the log of the posterior. Unlike traditional MCMC, ZigZag process can be simulated without discretisation error, and give conditions for the process to be ergodic. Most importantly, I will discuss two generalisations which have good scaling properties for big data: firstly a subsampling version of the ZigZag process that is an example of an exact approximate scheme; and secondly a controlvariate variant of the subsampling idea to reduce the variance of our unbiased estimator. Very recent ergodic theory will also be described.

INI 1  
17:00 to 18:00  Welcome Wine Reception at INI 
09:00 to 09:45 
Andrea Montanari An Instability in Variational Methods for Learning Topic Models
Topic models are extremely useful to extract latent degrees of freedom form large unlabeled datasets. Variational Bayes algorithms are the approach most commonly used by practitioners to learn topic models. Their appeal lies in the promise of reducing the problem of variational inference to an optimization problem. I will show that, even within an idealized Bayesian scenario, variational methods display an instability that can lead to misleading results.
[Based on joint work with Behroz Ghorbani and Hamid Javadi]

INI 1  
09:45 to 10:30 
Pradeep Ravikumar The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities
Nonparametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose nearparametric assumptions on the form of the density functions. In this paper, we leverage recent developments in parametric graphical models to propose a class of nonparametric multivariate densities which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a nonparametric exponential family form.
Joint work with Mladen Kolar, Arun Sai Suggala.

INI 1  
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
Yingying Fan RANK: LargeScale Inference with Graphical Nonlinear Knockoffs
Coauthors: Emre Demirkaya (University of Southern Califnornia), Gaorong Li (Beijing University of Technology), Jinchi Lv (University of Southern Califnornia) Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general highdimensional nonlinear models. In this paper, we provide theoretical foundations on the power and robustness for the model free knockoffs procedure introduced recently in Cand`es, Fan, Janson and Lv (2016) in highdimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in highdimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified modelfree knockoffs method called graphical nonlinear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptoti cally controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knock offs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real data set is analyzed to further assess the performance of the suggested knockoffs procedure. Related Links 
INI 1  
11:45 to 12:30 
Andreas Krause Learning and inference in probabilistic submodular models
I will present our work on inference and learning in discrete probabilistic models defined through submodular functions. These generalize pairwise graphical models and determinantal point processes, express natural notions such as attractiveness and repulsion and allow to capture richly parameterized, longrange, highorder dependencies. The key idea is to use sub and supergradients of submodular functions, and exploit their combinatorial structure to efficiently optimize variational upper and lower bounds on the partition function. This approach allows to perform efficient approximate inference in any probabilistic model that factorizes into logsubmodular and logsupermodular potentials of arbitrary order. Our approximation is exact at the mode for logsupermodular distributions, and we provide bounds on the approximation quality of the logpartition function with respect to the curvature of the function. I will also discuss how to learn logsupermodular distributions via bilevel optimisation. In particular, we show how to compute gradients of the variational posterior, which allows integrating the models into modern deep architectures.
This talk is primarily based on joint work with Josip Djolonga

INI 1  
12:30 to 14:00  Lunch @ Churchill College  
14:00 to 14:45 
Quentin Berthet Optimal Link Prediction with Matrix Logistic Regression
We consider the problem of link prediction, based on partial observation of a large network and on covariates associated to its vertices. The generative model is formulated as matrix logistic regression. The performance of the model is analysed in a highdimensional regime under structural assumption. The minimax rate for the Frobenius norm risk is established and a combinatorial estimator based on the penalised maximum likelihood approach is shown to achieve it. Furthermore, it is shown that this rate cannot be attained by any algorithm computable in polynomial time, under a computational complexity assumption. (Joint work with Nicolai Baldin)

INI 1  
14:45 to 15:30 
Rainer von Sachs Selecting Groups of Variables for Prediction Problems in Chemometrics: Recent Regularization Approaches
This presentation addresses the problem of selecting important, potentially overlapping groups of predictor variables
in linear models such that the resulting model satisfies a balance between interpretability and prediction performance.
This is motivated by data from the field of chemometrics where, due to correlation between predictors from different
groups (i.e. variable group “overlap”), identifying groups during model estimation is particularly challenging.
In particular, we will highlight some issues of existing methods when they are applied to high dimensional data with
overlapping groups of variables. This will be demonstrated through comparison of their optimization criteria and
their performance on simulated data.
This is joint work in progress with Rebecca Marion, ISBA, Université catholique de Louvain, Belgium.

INI 1  
15:30 to 16:00  Afternoon Tea  
16:00 to 17:00  Lightning talks by junior researchers introducing posters  INI 1  
17:00 to 18:00  Poster Session 
09:00 to 09:45 
Alex d'Aspremont Sharpness, Restart and Compressed Sensing Performance
Joint work with Vincent Roulet (University of Washington) and Nicolas Boumal (Princeton University).
We show that several classical quantities controlling compressed sensing performance directly match parameters controlling algorithmic complexity. We first describe linearly convergent restart schemes on firstorder methods using a sharpness assumption. The Lojasievicz inequality shows that sharpness bounds on the minimum of convex optimization problems hold almost generically. Here, we show that sharpness directly controls the performance of restart schemes. For sparse recovery problems, sharpness at the optimum can be written as a condition number, given by the ratio between true signal sparsity and the largest signal size that can be recovered by the observation matrix. Overall, this means that in compressed sensing problems, a single parameter directly controls both computational complexity and recovery performance.

INI 1  
09:45 to 10:30 
Lorenzo Rosasco Optimal and efficient learning with random features
Random features approaches correspond to one hidden layer neural networks with random hidden units, and can be seen as approximate kernel methods. We study the statistical and computational properties of random features within a ridge regression scheme. We prove for the first time that a number of random features much smaller than the number of data points suffices for optimal statistical error, with a corresponding huge computational gain. We further analyze faster rates under refined conditions and the potential benefit of random features chosen according to adaptive sampling schemes.

INI 1  
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
Rebecca Willett Nonlinear Models for Matrix Completion
The past decade of research on matrix completion has shown it is possible to leverage linear dependencies to impute missing values in a lowrank matrix. However, the corresponding assumption that the data lies in or near a lowdimensional linear subspace is not always met in practice. Extending matrix completion theory and algorithms to exploit lowdimensional nonlinear structure in data will allow missing data imputation in a far richer class of problems. In this talk, I will describe several models of lowdimensional nonlinear structure and how these models can be used for matrix completion. In particular, we will explore matrix completion in the context of three different nonlinear models: single index models, in which a latent subspace model is transformed by a nonlinear mapping; unions of subspaces, in which data points lie in or near one of several subspaces; and nonlinear algebraic varieties, a polynomial generalization of classical linear subspaces. In these settings, we will explore novel and efficient algorithms for imputing missing values and new bounds on the amount of missing data that can be accurately imputed. The proposed algorithms are able to recover synthetically generated data up to predicted sample complexity bounds and outperform standard lowrank matrix completion in experiments with real recommender system and motion capture data.

INI 1  
11:45 to 12:30 
CarolaBibiane Schönlieb Learning from mistakes  learning optimally sparse image filters by quotient minimisation
Coauthors: Martin Benning (University of Cambridge), Guy Gilboa (Technion, Haifa), Joana Grah (University of Cambridge) Learning approaches have recently become very popular in the field of inverse problems. A large variety of methods has been established in recent years, ranging from bilevel learning to highdimensional machine learning techniques. Most learning approaches, however, only aim at fitting parametrised models to favourable training data whilst ig noring misfit training data completely. In this talk, we fol low up on the idea of learning parametrised regularisation functions by quotient minimisation. We consider one and higherdimensional filter functions to be learned and allow for fit and misfittraining data consisting of multiple func tions. We first present results resembling behaviour of well established derivativebased sparse regularisers like total variation or higherorder total variation in onedimension. Then, we introduce novel families of nonderivativebased regularisers. This is accomplished by learning favourable scales and geometric properties while at the same time avoiding unfavourable ones. 
INI 1  
12:30 to 14:00  Lunch @ Churchill College  
14:00 to 17:00  Free Afternoon  
19:30 to 22:00  Formal Dinner at St John's College (by invitation only) 
09:00 to 09:45 
Yi Yu Optimal Covariance Change Point Detection in High Dimension
Coauthors: Daren Wang (Carnegie Mellon University), Alessandro Rinaldo (Carnegie Mellon University) In this paper, we study covariance change point detection problem in high dimension. Specifically, we assume that the time series $X_i \in \mathbb{R}^p$, $i = 1, \ldots, n$ are independent $p$dimensional subGaussian random vectors and that the corresponding covariance matrices $\{\Sigma_i\}_{i=1}^n$ are stationary within segments and only change at certain time points. Our generic model setting allows $p$ grows with $n$ and we do not place any additional structural assumptions on the covariance matrices. We introduce algorithms based on binary segmentation (e.g. Vostrikova, 1981) and wild binary segmentation (Fryzlewicz, 2014) and establish the consistency results under suitable conditions. To improve the detection performance in high dimension, we propose Wild Binary Segmentation through Independent Projection (WBSIP). We show that WBSIP can optimally estimate the locations of the change points. Our analysis also reveals a phase transition effect based on our generic model assumption and to the best of our knowledge, this type of results have not been established elsewhere in the change point detection literature. 
INI 1  
09:45 to 10:30 
Alexandre Tsybakov Adaptive estimation of functionals under sparsity
Adaptive estimation of functionals in sparse mean model and in sparse regression exhibits some interesting effects. This talk focuses on estimation of the L_2norm of the target vector and of the variance of the noise. In the first problem, the ignorance of the noise level causes changes in the rates. Moreover, the form of the noise distribution also infuences the optimal rate. For example, the rates of estimating the variance differ depending on whether the noise is Gaussian or subGaussian without a precise knowledge of the distribution. Finally, for the sparse mean model, the subGaussian rate cannot be attained adaptively to the noise level on classes of noise distributions with polynomial tails, independently on how fast is the polynomial decay. Joint work with O.Collier and L.Comminges.

INI 1  
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
Claire Boyer On the gap between local recovery guarantees in structured compressed sensing and oracle estimates
This is an ongoing work with Ben Adcock and Simone Brugiapaglia (Simon Fraser University, Burnaby).
First we will introduce a compressed sensing (CS) theory more compatible with reallife applications: we derive guarantees to ensure reconstruction of a structured sparse signal of interest while imposing structure in the acquisition (no Gaussian measurements here...).
We will study how far those CS results are from oracletype guarantees, and we will show that they are similar in terms of the required number of measurements.
These results give an insight to design new optimal sampling strategies when realistic physical constraints are imposed in the acquisition.

INI 1  
11:45 to 12:30 
Ioannis Kontoyiannis Small Big Data: Temporal structure in discrete time series
The identification of useful temporal structure in discrete time series is an important component of algorithms used for many tasks in statistical inference and machine learning. Most of the early approaches developed were ineffective in practice, because the amount of data required for reliable modeling grew exponentially with memory length. On the other hand, many of the more modern methodological approaches that make use of more flexible and parsimonious models result in algorithms that do not scale well and are computationally ineffective for larger data sets. We will discuss a class of novel methodological tools for effective Bayesian inference and model selection for general discrete time series, which offer promising results on both small and big data. Our starting point is the development of a rich class of Bayesian hierarchical models for variablememory Markov chains. The particular prior structure we adopt makes it possible to design effective, lineartime algorithms that can compute most of the important features of the resulting posterior and predictive distributions without resorting to MCMC. We have applied the resulting algorithms and methodological tools to numerous applicationspecific tasks (including online prediction, segmentation, classification, anomaly detection, entropy estimation, and causality testing) on data sets from different areas of application, including data compression, neuroscience, finance, genetics, and animal communication. Results on both simulated and real data will be presented, and brief comparisons with other approaches (including Bühlmann et al’s VLMC, Ron et al’s PST, and Raftery’s MTD) will be discussed. 
INI 1  
12:30 to 14:00  Lunch @ Churchill College  
14:00 to 14:45 
Judith Rousseau Asymptotics for ABC algorithms
Approximate Bayesoan Computation algorithms (ABC) are used in cases where the likelihood is intractable. To simulate from the (approximate) posterior distribution a possiblity is to sample new data from the model and check is these new data are close in some sense to the true data. The output of this algorithms thus depends on how we define the notion of closeness, which is based on a choice of summary statistics and on a threshold. Inthis work we study the behaviour of the algorithm under the assumption that the summary statistics are concentrating on some deterministic quantity and characterize the asymptotic behaviour of the resulting approximate posterior distribution in terms of the threshold and the rate of concentration of the summary statistics. The case of misspecified models is also treated where we show that surprising asymptotic behaviour appears.

INI 1  
14:45 to 15:30 
Andreas Elsener Sharp oracle inequalities for stationary points of nonconvex penalised Mestimators
Coauthor: Sara van de Geer (ETH Zurich) Nonconvex loss functions are used in several areas of statistics and machine learning. They have several appealing properties as in the case of robust regression. We propose a general framework to derive sharp oracle inequalities for stationary points of penalised Mestimators with nonconvex loss. The penalisation term is assumed to be a weakly decomposable norm. We apply the general framework to sparse (additive) corrected linear regression, sparse PCA, and sparse robust regression. Finally, a new estimator called "robust SLOPE" is proposed to illustrate how to apply our framework to norms different from the l1norm. 
INI 1  
15:30 to 16:00  Afternoon Tea  
16:00 to 16:45 
Rajen Shah The xyz algorithm for fast interaction search in highdimensional data
When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main effects and include interactions as products between individual variables. However, if the number of variables $p$ is large, as is common in genomic datasets, the computational cost of searching through $O(p^2)$ interactions can be prohibitive. In this talk I will introduce a new randomised algorithm called xyz that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. The underlying idea is to transform interaction search into a much simpler close pairs of points problem. We will see how strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $O(p^u)$ operations for $1

INI 1 
09:00 to 09:45 
Alessandro Rudi Falkon: fast and optimal kernel method for large scale machine learning
Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt{n}) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures.

INI 1  
09:45 to 10:30 
Philippe Rigollet Learning determinantal point processes
Coauthors: VictorEmmanuel Brunel (MIT), Ankur Moitra (MIT), John Urschel (MIT) Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning (such as recommendation systems) where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. In this talk, I will present recent results related to this problem, specifically  Rates of convergence for the maximum likelihood estimator: by studying the local and global geometry of the expected loglikelihood function we are able to establish rates of convergence for the MLE and give a complete characterization of the cases where these are parametric. We also give a partial description of the critical points for the expected loglikelihood.  Optimal rates of convergence for this problem: these are achievable by the method of moments and are governed by a combinatorial parameter, which we call the cycle sparsity.  A fast combinatorial algorithm to implement the method of moments efficiently. The necessary background on DPPs will be given in the talk. Joint work with VictorEmmanuel Brunel (M.I.T), Ankur Moitra (M.I.T) and John Urschel (M.I.T). 
INI 1  
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
Shahar Mendelson An optimal unrestricted learning procedure
The question of prediction is of central importance in Statistical Learning Theory. The optimal tradeoff between accuracy and confidence and the identity of a procedure that attains that optimal tradeoff have been studied extensively over the years. In this talk I present some of ideas used in the recent solution of this problem: a procedure $\hat{f}$ that attains the best possible accuracy/confidence tradeoff for a triplet $(F,X,Y)$ consisting of an arbitrary class $F$, an unknown distribution $X$ and an unknown target $Y$. Specifically, I explain why under minimal assumptions on $(F,X,Y)$, there is a natural critical level $r^*$ that depends on the triplet and the sample size $N$, such that for any accuracy parameter $r>r^*$, one has $E((\hat{f}(X)Y)^2D) \leq \inf_{f \in F} E(f(X)Y)^2+r^2$ with probability at least $12\exp(cN \min\{1,r^2/\sigma^2\})$, where $\sigma^2=\inf_{f \in F} E(f(X)Y)^2$. 
INI 1  
11:45 to 12:30 
Garvesh Raskutti Variable selection and classification with largescale presence only data
Coauthor: Hyebin Song (University of WisconsinMadison) In various realworld problems, we are presented with positive and unlabelled data, referred to as presenceonly responses where the number of covariates $p$ is large. The combination of presenceonly responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the \emph{PUlasso} algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorizationminimization (MM) framework which is a generalization of the wellknown expectationmaximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speedups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal meansquared error of $\frac{s \log p}{n}$, where $s$ is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm outperforms stateoftheart algorithms in the moderate $p$ settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Related Links

INI 1  
12:30 to 14:00  Lunch @ Churchill College 