# Seminars (STSW01)

Videos and presentation materials from other INI events are also available.

Search seminar archive

Event When Speaker Title Presentation Material
STSW01 15th January 2018
10:00 to 10:45
Michael Jordan On Gradient-Based Optimization: Accelerated, Stochastic and Nonconvex
Many new theoretical challenges have arisen in the area of gradient-based optimization for large-scale statistical data analysis, driven by the needs of applications and the opportunities provided by new hardware and software platforms. I discuss several related, recent results in this area: (1) a new framework for understanding Nesterov acceleration, obtained by taking a continuous-time, Lagrangian/Hamiltonian/symplectic perspective, (2) a discussion of how to escape saddle points efficiently in nonconvex optimization, and (3) the acceleration of Langevin diffusion.
STSW01 15th January 2018
11:10 to 11:55
Rina Foygel Barber Robust inference with the knockoff filter
In this talk, I will present ongoing work on the knockoff filter for inference in regression. In a high-dimensional model selection problem, we would like to select relevant features without too many false positives. The knockoff filter provides a tool for model selection by creating knockoff copies of each feature, testing the model selection algorithm for its ability to distinguish true from false covariates to control the false positives. In practice, the modeling assumptions that underlie the construction of the knockoffs may be violated, as we cannot know the exact dependence structure between the various features. Our ongoing work aims to determine and improve the robustness properties of the knockoff framework in this setting. We find that when knockoff features are constructed using estimated feature distributions whose errors are small in a KL divergence type measure, the knockoff filter provably controls the false discovery rate at only a slightly higher level. This work is joint with Emmanuel Candès and Richard Samworth.
STSW01 15th January 2018
11:55 to 12:40
Jonas Peters The Hardness of Conditional Independence Testing
Testing for conditional independence between continuous random variables is considered to be a hard statistical problem. We provide a formalization of this result and show that a test with correct size does not have power against any alternative. It is thus impossible to obtain any guarantee if the relationships between the random variables can be arbitrarily complex. We propose a practical test that achieves the correct size if the conditional expectations are smooth enough such that they can be estimated from data. This is joint work with Rajen Shah.
STSW01 15th January 2018
14:10 to 14:55
Jianqing Fan Uniform pertubation analysis of eigenspaces and its applications to Community Detection, Ranking and Beyond
Co-authors: Emmanuel Abbe (Princeton University), Kaizheng Wang (Princeton University), Yiqiao Zhong (Princeton University)

Spectral methods have been widely used for a large class of challenging problems, ranging from top-K ranking via pairwise comparisons, community detection, factor analysis, among others. Analyses of these spectral methods require super-norm perturbation analysis of top eigenvectors. This allows us to UNIFORMLY approximate elements in eigenvectors by linear functions of the observed random matrix that can be analyzed further. We first establish such an infinity-norm pertubation bound for top eigenvectors and apply the idea to several challenging problems such as top-K ranking, community detections, Z_2-syncronization and matrix completion. We show that the spectral methods are indeed optimal for these problems. We illustrate these methods via simulations.
STSW01 15th January 2018
14:55 to 15:40
Nicolai Meinshausen Instruments, Images, and Anchors
Co-authors: Dominik Rothenhäusler, Peter Bühlmann, Christina Heinze (ETH Zurich), and Jonas Peters (Univ of Copenhagen) I will present some ongoing work on two related projects. Firstly, some joint work with Jonas Peters, Dominik Rothenhaeulser and Peter Buhlmann on anchor regression, which is putting both standard regression and causal regression into a common framework and allows to interpolate between both in a continuous fashion. The anchor regression can be seen as a least-squares term with a causal, instrumental-variable-type, penalty. The chosen penalty is shown to correspond to the expected future magnitude of interventions and some risk bounds can be derived for the high-dimensional setting. Second, in joint work with Christina Heinze, I will show an application of related ideas to a (non-causal) setting in image classification.
STSW01 15th January 2018
16:10 to 16:55
Gareth Roberts The zig-zag and super-efficient sampling for Bayesian analysis of big data
Standard MCMC methods can scale poorly to big data settings due to the need to evaluate the likelihood at each iteration. There have been a number of approximate MCMC algorithms that use sub-sampling ideas to reduce this computational burden, but with the drawback that these algorithms no longer target the true posterior distribution. The talk will discuss a new family of Monte Carlo methods based upon a multi-dimensional version of the Zig-Zag process of (Bierkens, Roberts, 2016), a continuous time piecewise deterministic Markov process. While traditional MCMC methods are reversible by construction the Zig-Zag process offers a flexible non-reversible alternative. The dynamics of the Zig-Zag process correspond to a constant velocity model, with the velocity of the process switching at events from a point process. The rate of this point process can be related to the invariant distribution of the process. If we wish to target a given posterior distribution, then rates need to be set equal to the gradient of the log of the posterior. Unlike traditional MCMC, Zig-Zag process can be simulated without discretisation error, and give conditions for the process to be ergodic. Most importantly, I will discuss two generalisations which have good scaling properties for big data: firstly a sub-sampling version of the Zig-Zag process that is an example of an exact approximate scheme; and secondly a control-variate variant of the sub-sampling idea to reduce the variance of our unbiased estimator. Very recent ergodic theory will also be described.
STSW01 16th January 2018
09:00 to 09:45
Andrea Montanari An Instability in Variational Methods for Learning Topic Models
Topic models are extremely useful to extract latent degrees of freedom form large unlabeled datasets. Variational Bayes algorithms are the approach most commonly used by practitioners to learn topic models. Their appeal lies in the promise of reducing the problem of variational inference to an optimization problem. I will show that, even within an idealized Bayesian scenario, variational methods display an instability that can lead to misleading results. [Based on joint work with Behroz Ghorbani and Hamid Javadi]
STSW01 16th January 2018
09:45 to 10:30
Pradeep Ravikumar The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities
Non-parametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose near-parametric assumptions on the form of the density functions. In this paper, we leverage recent developments in parametric graphical models to propose a class of non-parametric multivariate densities which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form. Joint work with Mladen Kolar, Arun Sai Suggala.
STSW01 16th January 2018
11:00 to 11:45
Yingying Fan RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs
Co-authors: Emre Demirkaya (University of Southern Califnornia), Gaorong Li (Beijing University of Technology), Jinchi Lv (University of Southern Califnornia)

Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this paper, we provide theoretical foundations on the power and robustness for the model- free knockoffs procedure introduced recently in Cand`es, Fan, Janson and Lv (2016) in high-dimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified model-free knockoffs method called graphical nonlinear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptoti cally controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knock- offs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real data set is analyzed to further assess the performance of the suggested knockoffs procedure.

STSW01 16th January 2018
11:45 to 12:30
Andreas Krause Learning and inference in probabilistic submodular models
I will present our work on inference and learning in discrete probabilistic models defined through submodular functions. These generalize pairwise graphical models and determinantal point processes, express natural notions such as attractiveness and repulsion and allow to capture richly parameterized, long-range, high-order dependencies. The key idea is to use sub- and supergradients of submodular functions, and exploit their combinatorial structure to efficiently optimize variational upper and lower bounds on the partition function. This approach allows to perform efficient approximate inference in any probabilistic model that factorizes into log-submodular and log-supermodular potentials of arbitrary order. Our approximation is exact at the mode for log-supermodular distributions, and we provide bounds on the approximation quality of the log-partition function with respect to the curvature of the function. I will also discuss how to learn log-supermodular distributions via bi-level optimisation. In particular, we show how to compute gradients of the variational posterior, which allows integrating the models into modern deep architectures. This talk is primarily based on joint work with Josip Djolonga
STSW01 16th January 2018
14:00 to 14:45
Quentin Berthet Optimal Link Prediction with Matrix Logistic Regression
We consider the problem of link prediction, based on partial observation of a large network and on covariates associated to its vertices. The generative model is formulated as matrix logistic regression. The performance of the model is analysed in a high-dimensional regime under structural assumption. The minimax rate for the Frobenius norm risk is established and a combinatorial estimator based on the penalised maximum likelihood approach is shown to achieve it. Furthermore, it is shown that this rate cannot be attained by any algorithm computable in polynomial time, under a computational complexity assumption. (Joint work with Nicolai Baldin)
STSW01 16th January 2018
14:45 to 15:30
Rainer von Sachs Selecting Groups of Variables for Prediction Problems in Chemometrics: Recent Regularization Approaches
This presentation addresses the problem of selecting important, potentially overlapping groups of predictor variables in linear models such that the resulting model satisfies a balance between interpretability and prediction performance. This is motivated by data from the field of chemometrics where, due to correlation between predictors from different groups (i.e. variable group “overlap”), identifying groups during model estimation is particularly challenging. In particular, we will highlight some issues of existing methods when they are applied to high dimensional data with overlapping groups of variables. This will be demonstrated through comparison of their optimization criteria and their performance on simulated data. This is joint work in progress with Rebecca Marion, ISBA, Université catholique de Louvain, Belgium.
STSW01 17th January 2018
09:00 to 09:45
Alex d'Aspremont Sharpness, Restart and Compressed Sensing Performance
Joint work with Vincent Roulet (University of Washington) and Nicolas Boumal (Princeton University). We show that several classical quantities controlling compressed sensing performance directly match parameters controlling algorithmic complexity. We first describe linearly convergent restart schemes on first-order methods using a sharpness assumption. The Lojasievicz inequality shows that sharpness bounds on the minimum of convex optimization problems hold almost generically. Here, we show that sharpness directly controls the performance of restart schemes. For sparse recovery problems, sharpness at the optimum can be written as a condition number, given by the ratio between true signal sparsity and the largest signal size that can be recovered by the observation matrix. Overall, this means that in compressed sensing problems, a single parameter directly controls both computational complexity and recovery performance.
STSW01 17th January 2018
09:45 to 10:30
Lorenzo Rosasco Optimal and efficient learning with random features
Random features approaches correspond to one hidden layer neural networks with random hidden units, and can be seen as approximate kernel methods. We study the statistical and computational properties of random features within a ridge regression scheme. We prove for the first time that a number of random features much smaller than the number of data points suffices for optimal statistical error, with a corresponding huge computational gain. We further analyze faster rates under refined conditions and the potential benefit of random features chosen according to adaptive sampling schemes.
STSW01 17th January 2018
11:00 to 11:45
Rebecca Willett Nonlinear Models for Matrix Completion
The past decade of research on matrix completion has shown it is possible to leverage linear dependencies to impute missing values in a low-rank matrix. However, the corresponding assumption that the data lies in or near a low-dimensional linear subspace is not always met in practice. Extending matrix completion theory and algorithms to exploit low-dimensional nonlinear structure in data will allow missing data imputation in a far richer class of problems. In this talk, I will describe several models of low-dimensional nonlinear structure and how these models can be used for matrix completion. In particular, we will explore matrix completion in the context of three different nonlinear models: single index models, in which a latent subspace model is transformed by a nonlinear mapping; unions of subspaces, in which data points lie in or near one of several subspaces; and nonlinear algebraic varieties, a polynomial generalization of classical linear subspaces. In these settings, we will explore novel and efficient algorithms for imputing missing values and new bounds on the amount of missing data that can be accurately imputed. The proposed algorithms are able to recover synthetically generated data up to predicted sample complexity bounds and outperform standard low-rank matrix completion in experiments with real recommender system and motion capture data.
STSW01 17th January 2018
11:45 to 12:30
Carola-Bibiane Schönlieb Learning from mistakes - learning optimally sparse image filters by quotient minimisation
Co-authors: Martin Benning (University of Cambridge), Guy Gilboa (Technion, Haifa), Joana Grah (University of Cambridge)

Learning approaches have recently become very popular in the field of inverse problems. A large variety of methods has been established in recent years, ranging from bi-level learning to high-dimensional machine learning techniques. Most learning approaches, however, only aim at fitting parametrised models to favourable training data whilst ig- noring misfit training data completely. In this talk, we fol- low up on the idea of learning parametrised regularisation functions by quotient minimisation. We consider one- and higher-dimensional filter functions to be learned and allow for fit- and misfit-training data consisting of multiple func- tions. We first present results resembling behaviour of well- established derivative-based sparse regularisers like total variation or higher-order total variation in one-dimension. Then, we introduce novel families of non-derivative-based regularisers. This is accomplished by learning favourable scales and geometric properties while at the same time avoiding unfavourable ones.
STSW01 18th January 2018
09:00 to 09:45
Yi Yu Optimal Covariance Change Point Detection in High Dimension
Co-authors: Daren Wang (Carnegie Mellon University), Alessandro Rinaldo (Carnegie Mellon University)

In this paper, we study covariance change point detection problem in high dimension. Specifically, we assume that the time series  $X_i \in \mathbb{R}^p$, $i = 1, \ldots, n$ are independent $p$-dimensional sub-Gaussian random vectors and that the corresponding covariance matrices $\{\Sigma_i\}_{i=1}^n$ are stationary within segments and only change at certain time points.  Our generic model setting allows $p$ grows with $n$ and we do not place any additional structural assumptions on the covariance matrices.  We introduce algorithms based on binary segmentation (e.g. Vostrikova, 1981) and wild binary segmentation (Fryzlewicz, 2014) and establish the consistency results under suitable conditions.  To improve the detection performance in high dimension, we propose   Wild Binary Segmentation through Independent Projection (WBSIP).  We show that WBSIP can optimally estimate the locations of the change points.  Our analysis also reveals a phase transition effect based on our generic model assumption and to the best of our knowledge, this type of results have not been established elsewhere in the change point detection literature.
STSW01 18th January 2018
09:45 to 10:30
Alexandre Tsybakov Adaptive estimation of functionals under sparsity
Adaptive estimation of functionals in sparse mean model and in sparse regression exhibits some interesting effects. This talk focuses on estimation of the L_2-norm of the target vector and of the variance of the noise. In the first problem, the ignorance of the noise level causes changes in the rates. Moreover, the form of the noise distribution also infuences the optimal rate. For example, the rates of estimating the variance differ depending on whether the noise is Gaussian or sub-Gaussian without a precise knowledge of the distribution. Finally, for the sparse mean model, the sub-Gaussian rate cannot be attained adaptively to the noise level on classes of noise distributions with polynomial tails, independently on how fast is the polynomial decay. Joint work with O.Collier and L.Comminges.
STSW01 18th January 2018
11:00 to 11:45
Claire Boyer On the gap between local recovery guarantees in structured compressed sensing and oracle estimates
This is an on-going work with Ben Adcock and Simone Brugiapaglia (Simon Fraser University, Burnaby). First we will introduce a compressed sensing (CS) theory more compatible with real-life applications: we derive guarantees to ensure reconstruction of a structured sparse signal of interest while imposing structure in the acquisition (no Gaussian measurements here...). We will study how far those CS results are from oracle-type guarantees, and we will show that they are similar in terms of the required number of measurements. These results give an insight to design new optimal sampling strategies when realistic physical constraints are imposed in the acquisition.
STSW01 18th January 2018
11:45 to 12:30
Ioannis Kontoyiannis Small Big Data: Temporal structure in discrete time series
The identification of useful temporal structure in discrete time series is an important component of algorithms used for many tasks in statistical inference and machine learning. Most of the early approaches developed were ineffective in practice, because the amount of data required for reliable modeling grew exponentially with memory length. On the other hand, many of the more modern methodological approaches that make use of more flexible and parsimonious models result in algorithms that do not scale well and are computationally ineffective for larger data sets.

We will discuss a class of novel methodological tools for effective Bayesian inference and model selection for general discrete time series, which offer promising results on both small and big data. Our starting point is the development of a rich class of Bayesian hierarchical models for variable-memory Markov chains. The particular prior structure we adopt makes it possible to design effective, linear-time algorithms that can compute most of the important features of the resulting posterior and predictive distributions without resorting to MCMC.

We have applied the resulting algorithms and methodological tools to numerous application-specific tasks (including on-line prediction, segmentation, classification, anomaly detection, entropy estimation, and causality testing) on data sets from different areas of application, including data compression, neuroscience, finance, genetics, and animal communication. Results on both simulated and real data will be presented, and brief comparisons with other approaches (including Bühlmann et al’s VLMC, Ron et al’s PST, and Raftery’s MTD) will be discussed.
STSW01 18th January 2018
14:00 to 14:45
Judith Rousseau Asymptotics for ABC algorithms
Approximate Bayesoan Computation algorithms (ABC) are used in cases where the likelihood is intractable. To simulate from the (approximate) posterior distribution a possiblity is to sample new data from the model and check is these new data are close in some sense to the true data. The output of this algorithms thus depends on how we define the notion of closeness, which is based on a choice of summary statistics and on a threshold. Inthis work we study the behaviour of the algorithm under the assumption that the summary statistics are concentrating on some deterministic quantity and characterize the asymptotic behaviour of the resulting approximate posterior distribution in terms of the threshold and the rate of concentration of the summary statistics. The case of misspecified models is also treated where we show that surprising asymptotic behaviour appears.
STSW01 18th January 2018
14:45 to 15:30
Andreas Elsener Sharp oracle inequalities for stationary points of nonconvex penalised M-estimators
Co-author: Sara van de Geer (ETH Zurich)

Nonconvex loss functions are used in several areas of statistics and machine learning. They have several appealing properties as in the case of robust regression. We propose a general framework to derive sharp oracle inequalities for stationary points of penalised M-estimators with nonconvex loss. The penalisation term is assumed to be a weakly decomposable norm. We apply the general framework to sparse (additive) corrected linear regression, sparse PCA, and sparse robust regression. Finally, a new estimator called "robust SLOPE" is proposed to illustrate how to apply our framework to norms different from the l1-norm.
STSW01 18th January 2018
16:00 to 16:45
Rajen Shah The xyz algorithm for fast interaction search in high-dimensional data
When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main effects and include interactions as products between individual variables. However, if the number of variables $p$ is large, as is common in genomic datasets, the computational cost of searching through $O(p^2)$ interactions can be prohibitive. In this talk I will introduce a new randomised algorithm called xyz that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. The underlying idea is to transform interaction search into a much simpler close pairs of points problem. We will see how strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $O(p^u)$ operations for $1 STSW01 19th January 2018 09:00 to 09:45 Alessandro Rudi Falkon: fast and optimal kernel method for large scale machine learning Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt{n}) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures. STSW01 19th January 2018 09:45 to 10:30 Philippe Rigollet Learning determinantal point processes Co-authors: Victor-Emmanuel Brunel (MIT), Ankur Moitra (MIT), John Urschel (MIT) Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning (such as recommendation systems) where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. In this talk, I will present recent results related to this problem, specifically - Rates of convergence for the maximum likelihood estimator: by studying the local and global geometry of the expected log-likelihood function we are able to establish rates of convergence for the MLE and give a complete characterization of the cases where these are parametric. We also give a partial description of the critical points for the expected log-likelihood. - Optimal rates of convergence for this problem: these are achievable by the method of moments and are governed by a combinatorial parameter, which we call the cycle sparsity. - A fast combinatorial algorithm to implement the method of moments efficiently. The necessary background on DPPs will be given in the talk. Joint work with Victor-Emmanuel Brunel (M.I.T), Ankur Moitra (M.I.T) and John Urschel (M.I.T). STSW01 19th January 2018 11:00 to 11:45 Shahar Mendelson An optimal unrestricted learning procedure The question of prediction is of central importance in Statistical Learning Theory. The optimal tradeoff between accuracy and confidence and the identity of a procedure that attains that optimal tradeoff have been studied extensively over the years. In this talk I present some of ideas used in the recent solution of this problem: a procedure$\hat{f}$that attains the best possible accuracy/confidence tradeoff for a triplet$(F,X,Y)$consisting of an arbitrary class$F$, an unknown distribution$X$and an unknown target$Y$. Specifically, I explain why under minimal assumptions on$(F,X,Y)$, there is a natural critical level$r^*$that depends on the triplet and the sample size$N$, such that for any accuracy parameter$r>r^*$, one has$E((\hat{f}(X)-Y)^2|D) \leq \inf_{f \in F} E(f(X)-Y)^2+r^2$with probability at least$1-2\exp(-cN \min\{1,r^2/\sigma^2\})$, where$\sigma^2=\inf_{f \in F} E(f(X)-Y)^2$. STSW01 19th January 2018 11:45 to 12:30 Garvesh Raskutti Variable selection and classification with large-scale presence only data Co-author: Hyebin Song (University of Wisconsin-Madison) In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses where the number of covariates$p$is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the \emph{PUlasso} algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal mean-squared error of$\frac{s \log p}{n}$, where$s$is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate$p\$ settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example.