# Seminars (STSW04)

Videos and presentation materials from other INI events are also available.

Search seminar archive

Event When Speaker Title Presentation Material
STSW04 25th June 2018
11:00 to 11:45
Iain Johnstone Eigenstructure in high dimensional random effects models
The eigenstructure of i.i.d. samples from high dimensional data is known to show phenomena unexpected from experience with low dimensions -- eigenvalue spreading, bias and eigenvector inconsistency. Motivated by some problems in quantitative genetics, we now consider high dimensional data with more than one level of variation, and more specifically the spectral properties of estimators of the various components of variance matrices. We briefly describe bulk and edge eigenvalue distributions in this setting, and then focus more attention on properties and estimation of 'spiked' models. This is joint work with Zhou Fan and Yi Sun.
STSW04 25th June 2018
11:45 to 12:30
Philippe Rigollet Uncoupled isotonic regression via minimum Wasserstein deconvolution
Isotonic regression is a standard problem in shape constrained estimation where the goal is to estimate an unknown nondecreasing regression function $f$ from independent pairs $(x_i,y_i)$ where $\E[y_i]=f(x_i), i=1, \ldots n$. While this problem is well understood both statistically and computationally, much less is known about its uncoupled counterpart where one is given uncoupled $\{x_1, \ldots, x_n\}$ and $\{y_1, \ldots, y_n\}$. In this work, we leverage tools from optimal transport theory to derive minimax rates under weak moments conditions on $y_i$ together with an efficient algorithm. Both upper and lower bounds are articulated around moment-matching arguments that are also pertinent to learning mixtures of distributions and deconvolution. [Joint work with Jonathan Weed (MIT)]
STSW04 25th June 2018
14:00 to 14:45
Matthew Stephens On applications of Empirical Bayes approaches to the Normal Means problem
The normal means problem is very simple: given normally-distributed observations with known variances and unknown means, estimate the means. That is, given X_j \sim N(\theta_j, \sigma_j^2, estimate \theta_j. A key idea is that one can do better than the maximum likelihood estimates, \hat{\theta}_j= \X_j, in particular by use of appropriate "shrinkage" estimators. One attractive way to perform shrinkage estimation in practice is to use Empirical Bayes methods. That is, to assume that \theta_j are independent and identically distributed from some distribution g that is to be estimated from the data. Then, given such an estimate \hat{g}, the posterior distributions of \theta_j can be computed to perform inference. We call this the "Empirical Bayes Normal Means" (EBNM) problem.

Despite its simplicity, solving the EBNM problem has a wide range of practical applications. Here we present some flexible non-parametric approaches we have recently developed for solving the EBNM problem, and describe their application to several different settings: false discovery rate (FDR) estimation, non-parametric smoothing, and sparse matrix factorization problems (ie sparse factor analysis and sparse principal components analysis).

STSW04 25th June 2018
14:45 to 15:30
Jana Jankova Asymptotic Inference for Eigenstructure of Large Covariance Matrices
A vast number of methods have been proposed in literature for point estimation of eigenstructure of covariance matrices in high-dimensional settings. In this work, we study uncertainty quantification and propose methodology for inference and hypothesis testing for individual loadings of the covariance matrix. We base our methodology on a Lasso-penalized M-estimator which, despite non-convexity, may be solved by a polynomial-time algorithm such as coordinate or gradient descent. Our results provide theoretical guarantees on asymptotic normality of the new estimators and may be used for valid hypothesis testing and variable selection. These results are achieved under a sparsity condition relating the number of non-zero loadings, sample size, dimensionality of the covariance matrix and spectrum of the covariance matrix. This talk is based on joint work with Sara van de Geer.
STSW04 25th June 2018
16:00 to 16:45
Flori Bunea A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations. The model links all cell probabilities, collected in a p x n matrix, via the assumption that this matrix can be factorized as the product of two non-negative matrices A and W. Topic models have been originally developed in text mining, when one browses through n documents, based on a dictionary of p words, and covering K topics. In this terminology, the matrix A is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in this talk. Notably, the unique A is required to satisfy what is commonly known as the anchor word assumption, under which A has an unknown number of rows respectively proportional to K-dimensional canonical basis vectors. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the K vertices of a simplex. This crucial step in the estimation of A requires K to be known, and cannot be easily extended to the more realistic set-up when K is unknown. In this talk, we take a different view on anchor word estimation, and on the estimation of the word-topic matrix A. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any n, p, K and document length N, and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We illustrate, via simulations, that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
STSW04 26th June 2018
09:00 to 09:45
Guy Bresler Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure
The prototypical high-dimensional statistics problem entails finding a structured signal in noise. Many of these problems exhibit a curious phenomenon: the amount of data needed by all known computationally efficient algorithms far exceeds what is needed for inefficient algorithms that search over all possible structures. A line of work initiated by Berthet and Rigollet in 2013 has aimed to explain these gaps by reducing from conjecturally hard problems in computer science. However, the delicate nature of average-case reductions has limited the applicability of this approach. In this work we introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. These include tight lower bounds for Planted Independent Set, Planted Dense Subgraph, Sparse Spiked Wigner, Sparse PCA, as well as for new models that we introduce. Joint work with Matthew Brennan and Wasim Huleihel.
STSW04 26th June 2018
09:45 to 10:30
Chao Gao Reduced Isotonic Regression
Consider an $n$-dimensional vector $X$ with mean $\theta^*$. In this talk, we consider $\theta^*$ that is both nondecreasing and has a piecewise constant structure. We establish the exact minimax rate of estimating such monotone functions, and thus give a non-trivial answer to an open problem in the shape-constrained analysis literature. The minimax rate involves an interesting iterated logarithmic dependence on the dimension. We then develop a penalized least-squares procedure for estimating $\theta^*$ adaptively. This estimator is shown to achieve the derived minimax rate without the knowledge of the number of pieces in linear time. We further allow the model to be misspecified and derive oracle inequalities with the optimal rates for the proposed estimator. This is a joint work with Fang Han and Cun-hui Zhang.
STSW04 26th June 2018
11:00 to 11:45
Ryan Tibshirani Dykstra’s Algorithm, ADMM, and Coordinate Descent: Connections, Insights, and Extensions
We study connections between Dykstra’s algorithm for projecting onto an intersection of convex sets, the augmented Lagrangian method of multipliers or ADMM, and block coordinate descent. We prove that coordinate descent for a regularized regression problem, in which the penalty is a separable sum of support functions, is exactly equivalent to Dykstra’s algorithm applied to the dual problem. ADMM on the dual problem is also seen to be equivalent, in the special case of two sets, with one being a linear subspace. These connections, aside from being interesting in their own right, suggest new ways of analyzing and extending coordinate descent. For example, from existing convergence theory on Dykstra’s algorithm over polyhedra, we discern that coordinate descent for the lasso problem converges at an (asymptotically) linear rate. We also develop two parallel versions of coordinate descent, based on the Dykstra and ADMM connections. Finally, we discuss the implications of this work for backfitting in additive models.
STSW04 26th June 2018
11:45 to 12:30
Peter Bickel Two network scale challenges:Constructing and fitting hierarchical block models and fitting large block models using the mean field method
Work with S.Bhattacharyya,T.Li,E.Levina,S.Mukherjee,P.Sarkar Networks are a complex type of structure presenting itself in many applications . They are usually represented by a graph ,with possibly weighted edges plus additional covariates (such as directions).Block models have been studied for some time as basic approximations to ergodic stationary probability models for single graphs.A huge number of fitting methods have been developed for these models some of which we will touch on. The mean field method in which an increasing number of parameters must be fitted is used not only for multiple membership block models but also in applications such as LDA.if the graph is too large poor behaviour of the method can be seen.. We have developed what we call "patch methods " for fitting which help both computationally and inferentially in such situations bur much further analysis is needed. It is intuitively clear but mathematically unclear how knowledge of the model having nested scales helps in fitting large scale as opposed to small scale parameters.We will discuss this issue through an example,
STSW04 26th June 2018
14:00 to 14:45
Marten Herman Wegkamp Adaptive estimation of the rank of the regression coefficient matrix
We consider the multivariate response regression problem with a regression coefficient matrix of low, unknown rank. In this setting, we analyze a new criterion for selecting the optimal reduced rank. This criterion does not require estimation of the unknown variance of the noise, nor depends on a delicate choice of a tuning parameter. We develop an iterative, fully data-driven procedure, that adapts to the optimal signal-to-noise ratio. This procedure finds the true rank in a few steps with overwhelming probability. At each step, our estimate increases, while at the same time it does not exceed the true rank. Our finite sample results hold for any sample size and any dimension, even when the number of responses and of covariates grow much faster than the number of observations. An extensive simulation study that confirms our theoretical findings in both low and high dimensional settings. This is joint work with Xin Bing.
STSW04 26th June 2018
14:45 to 15:30
Maryam Fazel Competitive Online Algorithms for Budgeted Allocation with Application to Online Experiment Design
We consider an online resource allocation problem, where the goal is to maximize a function of a positive semidefinite (PSD) matrix with a scalar budget constraint. The problem data arrives online, and the algorithm needs to make an irrevocable decision at each step. Of particular interest are classic experiment design problems in the online setting, with the algorithm deciding whether to allocate budget to each experiment as new experiments become available sequentially. We analyze two classes of primal-dual algorithms and provide bounds on their competitive ratios. Our analysis relies on a smooth surrogate of the objective function that needs to satisfy a new diminishing-returns property. We then formulate a convex optimization problem to directly optimize this competitive ratio bound; this design problem can be solved offline before the data start arriving. We give examples of computing the smooth surrogate for D-optimal and A-optimal experiment design. This is joint work with Reza Eghbali and James Saunderson.
STSW04 26th June 2018
16:00 to 16:45
Alex d'Aspremont An Approximate Shapley-Folkman Theorem.
with Thomas Kerdreux and Igor Colin. The Shapley-Folkman theorem shows that Minkowski averages of uniformly bounded sets tend to be convex when the number of terms in the sum becomes much larger than the ambient dimension. In optimization, \citet{Aubi76} show that this produces an a priori bound on the duality gap of separable nonconvex optimization problems involving finite sums. This bound is highly conservative and depends on unstable quantities, and we relax it in several directions to show that non convexity can have a much milder impact on finite sum minimization problems such as empirical risk minimization and multi-task classification. As a byproduct, we show a new version of Maurey's classical approximate Carath\'eodory lemma where we sample a significant fraction of the coefficients, without replacement.
STSW04 27th June 2018
09:00 to 09:45
Urvashi Oswal Selection and Clustering of Correlated variables using OWL/GrOWL regularizers
In high-dimensional linear regression problems, it is likely that several covariates are highly correlated e.g. in fMRI data, certain voxels or brain regions may have very correlated activation patterns. Using standard sparsity-inducing regularization (such as Lasso) in such scenarios is known to be unsatisfactory, as it leads to the selection of a subset of the highly correlated covariates. For engineering purposes and/or scientific interpretability, it is often desirable to explicitly identify all of the covariates that are relevant for modeling the data. In this talk, I will present clustering properties and error bounds of Ordered Weighted $\ell_1$ (OWL) regularization for linear regression, and a group variant for multi-task regression called Group OWL (GrOWL). I will present the application of OWL/GrOWL in three settings: (1) Brain networks (2) Subspace clustering and (3) Deep Learning. I will demonstrate, in theory and experiments, how OWL/GrOWL deal with strongly correlated covariates by automatically clustering and averaging regression coefficients associated with those covariates. This is joint work with Robert Nowak, Mário Figueiredo, Tim Rogers and Chris Cox.
STSW04 27th June 2018
09:45 to 10:30
Elizaveta Levina Matrix completion in network analysis
Matrix completion is an active area of research in itself, and a natural tool to apply to network data, since many real networks are observed incompletely and/or with noise. However, developing matrix completion algorithms for networks requires taking into account the network structure. This talk will discuss three examples of matrix completion used for network tasks. First, we discuss the use of matrix completion for cross-validation on network data, a long-standing problem in network analysis. Two other examples focus on reconstructing incompletely observed networks, with structured missingness resulting from network sampling mechanisms. One scenario we consider is egocentric sampling, where a set of nodes is selected first and then their connections to the entire network are observed. Another scenario focuses on data from surveys, where people are asked to name a given number of friends. We show that matrix completion, when used appropriately, can generally be very helpful in solving network problems.
STSW04 27th June 2018
11:00 to 11:45
Garvesh Raskutti Estimating sparse additive auto-regressive network models
Consider a multi-variate time series, which may correspond to spike train responses for multiple neurons in a brain, crime event data across multiple regions, and many others. An important challenge associated with these time series models is to estimate an influence network between the d variables, especially when the number of variables d is large meaning we are in the high-dimensional setting. Prior work has focused on parametric vector auto-regressive models. However, parametric approaches are somewhat restrictive in practice. In this paper, we use the non-parametric sparse additive model (SpAM) framework to address this challenge. Using a combination of beta and phi-mixing properties of Markov chains and empirical process techniques for reproducing kernel Hilbert spaces (RKHSs), we provide upper bounds on mean-squared error in terms of the sparsity s, logarithm of the dimension logd, number of time points T, and the smoothness of the RKHSs. Our rates are sharp up to logarithm factors in many cases. We also provide numerical experiments that support our theoretical results and display potential advantages of using our non-parametric SpAM framework for a Chicago crime dataset.
STSW04 27th June 2018
11:45 to 12:30
Peter Bartlett Representation, optimization and generalization properties of deep neural networks
Deep neural networks have improved the state-of-the-art performance for prediction problems across an impressive range of application areas. This talk describes some recent results in three directions. First, we investigate the impact of depth on representational properties of deep residual networks, which compute near-identity maps at each layer, showing how their representational power improves with depth and that the functional optimization landscape has the desirable property that stationary points are optimal. Second, we study the implications for optimization in deep linear networks, showing how the success of a family of gradient descent algorithms that regularize towards the identity function depends on a positivity condition of the regression function. Third, we consider how the performance of deep networks on training data compares to their predictive accuracy, we demonstrate deviation bounds that scale with a certain "spectral complexity," and we compare the behavior of these bounds with the observed performance of these networks in practical problems.

Joint work with Steve Evans, Dylan Foster, Dave Helmbold, Phil Long, and Matus Telgarsky.

STSW04 28th June 2018
09:00 to 09:45
Tong Zhang Candidates vs. Noises Estimation for Large Multi-Class Classification Problem
In practice, there has been sigificant interest in multi-class classification problems where the number of classes is large. Computationally such applications require statistical methods with run time sublinear in the number of classes. A number of methods such as Noise-Contrastive Estimation (NCE) and variations have been proposed in recent years to address this problem. However, the existing methods are not statistically efficient compared to multi-class logistic regression, which is the maximum likelihood estimate. In this talk, I will describe a new method called Candidate v.s. Noises Estimation (CANE) that selects a small subset of candidate classes and samples the remaining classes. We show that CANE is always consistent and computationally efficient. Moreover, the resulting estimator has low statistical variance approaching that of the maximum likelihood estimator, when the observed label belongs to the selected candidates with high probability. Extensive experimental results show that CANE achieves better prediction accuracy over a number of the state-of-the-art tree classifiers, while it gains significant speedup compared to standard multi-class logistic regression.
STSW04 28th June 2018
09:45 to 10:30
Aurore Delaigle Estimating a covariance function from fragments of functional data
Functional data are often observed only partially, in the form of fragments. In that case, the standard approaches for estimating the covariance function do not work because entire parts of the domain are completely unobserved. In previous work, Delaigle and Hall (2013, 2016) have suggested ways of estimating the covariance function, based for example on Markov assumptions. In this work we take a completely different approach which does not rely on such assumptions. We show that, using a tensor product approach, it is possible to reconstruct the covariance function using observations located only on the diagonal of its domain.
STSW04 28th June 2018
11:00 to 11:45
Francis Bach Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model. (Joint work with Loucas Pillaud-Vivien and Alessandro Rudi)

STSW04 28th June 2018
11:45 to 12:30
Edward Ionides Monte Carlo adjusted profile likelihood, with applications to spatiotemporal and phylodynamic inference.
Partially observed nonlinear stochastic dynamic systems raise inference challenges. Sequential Monte Carlo (SMC) methods provide a route to accessing the likelihood function. However, despite the advantage of applicability to a wide class of nonlinear models, standard SMC methods have a limitation that they scale poorly to large systems. We present a profile likelihood approach, properly adjusted for Monte Carlo uncertainty, that enables likelihood-based inference in systems for which Monte Carlo error remains large despite stretching the limits of available computational resources. Together with state-of-the-art SMC algorithms, this technique permits effective inference on some scientific problems in panel time series analysis, spatiotemporal modeling, and inferring population dynamic models from genetic sequence data. The results presented are joint work with Carles Breto, Joonha Park, Alex Smith and Aaron King.
STSW04 28th June 2018
14:00 to 14:45
Jinchi Lv Asymptotics of Eigenvectors and Eigenvalues for Large Structured Random Matrices
Characterizing the exact asymptotic distributions of high-dimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. This paper establishes the asymptotic properties of the spiked eigenvectors and eigenvalues for the generalized Wigner random matrix, where the mean matrix is assumed to have a low-rank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we provide novel asymptotic expansions for the general linear combination and further show that the linear combination is asymptotically normal after some normalization, where the weight vector can be an arbitrary unit vector. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications. This is a joint work with Jianqing Fan, Yingying Fan and Xiao Han.
STSW04 28th June 2018
14:45 to 15:30
Rui Castro Are there needles in a (moving) haystack? Adaptive sensing for detection and estimation of static and dynamically evolving signals
STSW04 28th June 2018
16:00 to 16:45
Pradeep Ravikumar Robust Estimation via Robust Gradient Estimation
A common assumption in the training of machine learning systems is that the data is sufficiently clean and well-behaved: there are very few or no outliers, or that the distribution of the data does not have very long tails. As machine learning finds wider usage, these assumptions are increasingly indefensible. The key question then is how to perform estimation that is robust to departure from these assumptions; and which has been of classical interest, with seminal contributions due to Box, Tukey, Huber, Hampel, and several others. Loosely, there seemed to be a computation-robustness tradeoff, practical estimators did have strong robustness guarantees, while estimators with strong robustness guarantees were computationally impractical. In our work, we provide a new class of computationally-efficient class of estimators for risk minimization that are provably robust to a variety of robustness settings, such as arbitrary oblivious contamination, and heavy-tailed data, among others. Our workhorse is a novel robust variant of gradient descent, and we provide conditions under which our gradient descent variant provides accurate and robust estimators in any general convex risk minimization problem. These results provide some of the first computationally tractable and provably robust estimators for general statistical models. Joint work with Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan.
STSW04 29th June 2018
09:00 to 09:45
Alexandre Tsybakov Does data interpolation contradict statistical optimality?
We exhibit estimators that interpolate the data, and yet achieve optimal rates of convergence for the problems of nonparametric regression and prediction with square loss. This curious observation goes against the usual intuition that a good statistical procedure should forego the exact fit to data in favor of a more smooth representation. A motivation for this work is the recent focus within the machine learning community on the out-of-sample performance of neural networks. These flexible models are typically trained to fit the data exactly (either in their sign or in the actual value), and yet they have good prediction behaviour. This talk is based on joint work with Misha Belkin and Sasha Rakhlin.
STSW04 29th June 2018
09:45 to 10:30
Vincent Vu Group invariance and computational sufficiency
Statistical sufficiency formalizes the notion of data reduction. In the decision theoretic interpretation, once a model is chosen all inferences should be based on a sufficient statistic. However, suppose we start with a set of methods that share a sufficient statistic rather than a specific model. Is it possible to reduce the data beyond the statistic and yet still be able to compute all of the methods? In this talk, I'll present some progress towards a theory of "computational sufficiency" and show that strong reductions _can_ be made for large classes of penalized M-estimators by exploiting hidden symmetries in the underlying optimization problems. These reductions can (1) enable efficient computation and (2) reveal hidden connections between seemingly disparate methods. As a main example, I'll show how the theory provides a surprising answer to the following question: "What do the Graphical Lasso, sparse PCA, single-linkage clustering, and L1 penalized Ising model selection all have in common?"
STSW04 29th June 2018
11:00 to 11:45
Richard Samworth Data perturbation for data science
When faced with a dataset and a problem of interest, should we propose a statistical model and use that to inform an appropriate algorithm, or dream up a potential algorithm and then seek to justify it? The former is the more traditional statistical approach, but the latter appears to be becoming more popular. I will discuss a class of algorithms that belong in the second category, namely those that involve data perturbation (e.g. subsampling, random projections, artificial noise, knockoffs,...). As examples, I will consider Complementary Pairs Stability Selection for variable selection and sparse PCA via random projections. This will involve joint work with Rajen Shah, Milana Gataric and Tengyao Wang.
STSW04 29th June 2018
11:45 to 12:30
Domagoj Ćevid, Peter Bühlmann Deconfounding using Spectral Transformations
High-dimensional regression methods which rely on the sparsity of the ground truth, such as the Lasso, might break down in the presence of confounding variables. If a latent variable affects both the response and the predictors, the correlation between them changes. This phenomenon can be represented as a linear model where the sparse coefficient vector has been perturbed. We will present our work on this problem. We investigate and propose some spectral transformations for the data which serve as input for the Lasso. We discuss assumptions for achieving the optimal error rate and illustrate the performance on a genomic dataset. The approach is easy to use and leads to convincing results. The talk is based on joint work with Nicolai Meinshausen.