skip to content
 

Seminars (STS)

Videos and presentation materials from other INI events are also available.

Search seminar archive

Event When Speaker Title Presentation Material
STSW01 15th January 2018
10:00 to 10:45
Michael Jordan On Gradient-Based Optimization: Accelerated, Stochastic and Nonconvex
Many new theoretical challenges have arisen in the area of gradient-based optimization for large-scale statistical data analysis, driven by the needs of applications and the opportunities provided by new hardware and software platforms. I discuss several related, recent results in this area: (1) a new framework for understanding Nesterov acceleration, obtained by taking a continuous-time, Lagrangian/Hamiltonian/symplectic perspective, (2) a discussion of how to escape saddle points efficiently in nonconvex optimization, and (3) the acceleration of Langevin diffusion.
STSW01 15th January 2018
11:10 to 11:55
Rina Foygel Barber Robust inference with the knockoff filter
In this talk, I will present ongoing work on the knockoff filter for inference in regression. In a high-dimensional model selection problem, we would like to select relevant features without too many false positives. The knockoff filter provides a tool for model selection by creating knockoff copies of each feature, testing the model selection algorithm for its ability to distinguish true from false covariates to control the false positives. In practice, the modeling assumptions that underlie the construction of the knockoffs may be violated, as we cannot know the exact dependence structure between the various features. Our ongoing work aims to determine and improve the robustness properties of the knockoff framework in this setting. We find that when knockoff features are constructed using estimated feature distributions whose errors are small in a KL divergence type measure, the knockoff filter provably controls the false discovery rate at only a slightly higher level. This work is joint with Emmanuel Candès and Richard Samworth.
STSW01 15th January 2018
11:55 to 12:40
Jonas Peters The Hardness of Conditional Independence Testing
Testing for conditional independence between continuous random variables is considered to be a hard statistical problem. We provide a formalization of this result and show that a test with correct size does not have power against any alternative. It is thus impossible to obtain any guarantee if the relationships between the random variables can be arbitrarily complex. We propose a practical test that achieves the correct size if the conditional expectations are smooth enough such that they can be estimated from data. This is joint work with Rajen Shah.
STSW01 15th January 2018
14:10 to 14:55
Jianqing Fan Uniform pertubation analysis of eigenspaces and its applications to Community Detection, Ranking and Beyond
Co-authors: Emmanuel Abbe (Princeton University), Kaizheng Wang (Princeton University), Yiqiao Zhong (Princeton University)

Spectral methods have been widely used for a large class of challenging problems, ranging from top-K ranking via pairwise comparisons, community detection, factor analysis, among others. Analyses of these spectral methods require super-norm perturbation analysis of top eigenvectors. This allows us to UNIFORMLY approximate elements in eigenvectors by linear functions of the observed random matrix that can be analyzed further. We first establish such an infinity-norm pertubation bound for top eigenvectors and apply the idea to several challenging problems such as top-K ranking, community detections, Z_2-syncronization and matrix completion. We show that the spectral methods are indeed optimal for these problems. We illustrate these methods via simulations.
STSW01 15th January 2018
14:55 to 15:40
Nicolai Meinshausen Instruments, Images, and Anchors
Co-authors: Dominik Rothenhäusler, Peter Bühlmann, Christina Heinze (ETH Zurich), and Jonas Peters (Univ of Copenhagen) I will present some ongoing work on two related projects. Firstly, some joint work with Jonas Peters, Dominik Rothenhaeulser and Peter Buhlmann on anchor regression, which is putting both standard regression and causal regression into a common framework and allows to interpolate between both in a continuous fashion. The anchor regression can be seen as a least-squares term with a causal, instrumental-variable-type, penalty. The chosen penalty is shown to correspond to the expected future magnitude of interventions and some risk bounds can be derived for the high-dimensional setting. Second, in joint work with Christina Heinze, I will show an application of related ideas to a (non-causal) setting in image classification.
STSW01 15th January 2018
16:10 to 16:55
Gareth Roberts The zig-zag and super-efficient sampling for Bayesian analysis of big data
Standard MCMC methods can scale poorly to big data settings due to the need to evaluate the likelihood at each iteration. There have been a number of approximate MCMC algorithms that use sub-sampling ideas to reduce this computational burden, but with the drawback that these algorithms no longer target the true posterior distribution. The talk will discuss a new family of Monte Carlo methods based upon a multi-dimensional version of the Zig-Zag process of (Bierkens, Roberts, 2016), a continuous time piecewise deterministic Markov process. While traditional MCMC methods are reversible by construction the Zig-Zag process offers a flexible non-reversible alternative. The dynamics of the Zig-Zag process correspond to a constant velocity model, with the velocity of the process switching at events from a point process. The rate of this point process can be related to the invariant distribution of the process. If we wish to target a given posterior distribution, then rates need to be set equal to the gradient of the log of the posterior. Unlike traditional MCMC, Zig-Zag process can be simulated without discretisation error, and give conditions for the process to be ergodic. Most importantly, I will discuss two generalisations which have good scaling properties for big data: firstly a sub-sampling version of the Zig-Zag process that is an example of an exact approximate scheme; and secondly a control-variate variant of the sub-sampling idea to reduce the variance of our unbiased estimator. Very recent ergodic theory will also be described.
STSW01 16th January 2018
09:00 to 09:45
Andrea Montanari An Instability in Variational Methods for Learning Topic Models
Topic models are extremely useful to extract latent degrees of freedom form large unlabeled datasets. Variational Bayes algorithms are the approach most commonly used by practitioners to learn topic models. Their appeal lies in the promise of reducing the problem of variational inference to an optimization problem. I will show that, even within an idealized Bayesian scenario, variational methods display an instability that can lead to misleading results. [Based on joint work with Behroz Ghorbani and Hamid Javadi]
STSW01 16th January 2018
09:45 to 10:30
Pradeep Ravikumar The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities
Non-parametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose near-parametric assumptions on the form of the density functions. In this paper, we leverage recent developments in parametric graphical models to propose a class of non-parametric multivariate densities which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form. Joint work with Mladen Kolar, Arun Sai Suggala.
STSW01 16th January 2018
11:00 to 11:45
Yingying Fan RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs
Co-authors: Emre Demirkaya (University of Southern Califnornia), Gaorong Li (Beijing University of Technology), Jinchi Lv (University of Southern Califnornia)

Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this paper, we provide theoretical foundations on the power and robustness for the model- free knockoffs procedure introduced recently in Cand`es, Fan, Janson and Lv (2016) in high-dimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified model-free knockoffs method called graphical nonlinear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptoti cally controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knock- offs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real data set is analyzed to further assess the performance of the suggested knockoffs procedure.

Related Links
STSW01 16th January 2018
11:45 to 12:30
Andreas Krause Learning and inference in probabilistic submodular models
I will present our work on inference and learning in discrete probabilistic models defined through submodular functions. These generalize pairwise graphical models and determinantal point processes, express natural notions such as attractiveness and repulsion and allow to capture richly parameterized, long-range, high-order dependencies. The key idea is to use sub- and supergradients of submodular functions, and exploit their combinatorial structure to efficiently optimize variational upper and lower bounds on the partition function. This approach allows to perform efficient approximate inference in any probabilistic model that factorizes into log-submodular and log-supermodular potentials of arbitrary order. Our approximation is exact at the mode for log-supermodular distributions, and we provide bounds on the approximation quality of the log-partition function with respect to the curvature of the function. I will also discuss how to learn log-supermodular distributions via bi-level optimisation. In particular, we show how to compute gradients of the variational posterior, which allows integrating the models into modern deep architectures. This talk is primarily based on joint work with Josip Djolonga
STSW01 16th January 2018
14:00 to 14:45
Quentin Berthet Optimal Link Prediction with Matrix Logistic Regression
We consider the problem of link prediction, based on partial observation of a large network and on covariates associated to its vertices. The generative model is formulated as matrix logistic regression. The performance of the model is analysed in a high-dimensional regime under structural assumption. The minimax rate for the Frobenius norm risk is established and a combinatorial estimator based on the penalised maximum likelihood approach is shown to achieve it. Furthermore, it is shown that this rate cannot be attained by any algorithm computable in polynomial time, under a computational complexity assumption. (Joint work with Nicolai Baldin)
STSW01 16th January 2018
14:45 to 15:30
Rainer von Sachs Selecting Groups of Variables for Prediction Problems in Chemometrics: Recent Regularization Approaches
This presentation addresses the problem of selecting important, potentially overlapping groups of predictor variables in linear models such that the resulting model satisfies a balance between interpretability and prediction performance. This is motivated by data from the field of chemometrics where, due to correlation between predictors from different groups (i.e. variable group “overlap”), identifying groups during model estimation is particularly challenging. In particular, we will highlight some issues of existing methods when they are applied to high dimensional data with overlapping groups of variables. This will be demonstrated through comparison of their optimization criteria and their performance on simulated data. This is joint work in progress with Rebecca Marion, ISBA, Université catholique de Louvain, Belgium.
STSW01 17th January 2018
09:00 to 09:45
Alex d'Aspremont Sharpness, Restart and Compressed Sensing Performance
Joint work with Vincent Roulet (University of Washington) and Nicolas Boumal (Princeton University). We show that several classical quantities controlling compressed sensing performance directly match parameters controlling algorithmic complexity. We first describe linearly convergent restart schemes on first-order methods using a sharpness assumption. The Lojasievicz inequality shows that sharpness bounds on the minimum of convex optimization problems hold almost generically. Here, we show that sharpness directly controls the performance of restart schemes. For sparse recovery problems, sharpness at the optimum can be written as a condition number, given by the ratio between true signal sparsity and the largest signal size that can be recovered by the observation matrix. Overall, this means that in compressed sensing problems, a single parameter directly controls both computational complexity and recovery performance.
STSW01 17th January 2018
09:45 to 10:30
Lorenzo Rosasco Optimal and efficient learning with random features
Random features approaches correspond to one hidden layer neural networks with random hidden units, and can be seen as approximate kernel methods. We study the statistical and computational properties of random features within a ridge regression scheme. We prove for the first time that a number of random features much smaller than the number of data points suffices for optimal statistical error, with a corresponding huge computational gain. We further analyze faster rates under refined conditions and the potential benefit of random features chosen according to adaptive sampling schemes.
STSW01 17th January 2018
11:00 to 11:45
Rebecca Willett Nonlinear Models for Matrix Completion
The past decade of research on matrix completion has shown it is possible to leverage linear dependencies to impute missing values in a low-rank matrix. However, the corresponding assumption that the data lies in or near a low-dimensional linear subspace is not always met in practice. Extending matrix completion theory and algorithms to exploit low-dimensional nonlinear structure in data will allow missing data imputation in a far richer class of problems. In this talk, I will describe several models of low-dimensional nonlinear structure and how these models can be used for matrix completion. In particular, we will explore matrix completion in the context of three different nonlinear models: single index models, in which a latent subspace model is transformed by a nonlinear mapping; unions of subspaces, in which data points lie in or near one of several subspaces; and nonlinear algebraic varieties, a polynomial generalization of classical linear subspaces. In these settings, we will explore novel and efficient algorithms for imputing missing values and new bounds on the amount of missing data that can be accurately imputed. The proposed algorithms are able to recover synthetically generated data up to predicted sample complexity bounds and outperform standard low-rank matrix completion in experiments with real recommender system and motion capture data.
STSW01 17th January 2018
11:45 to 12:30
Carola-Bibiane Schönlieb Learning from mistakes - learning optimally sparse image filters by quotient minimisation
Co-authors: Martin Benning (University of Cambridge), Guy Gilboa (Technion, Haifa), Joana Grah (University of Cambridge)

Learning approaches have recently become very popular in the field of inverse problems. A large variety of methods has been established in recent years, ranging from bi-level learning to high-dimensional machine learning techniques. Most learning approaches, however, only aim at fitting parametrised models to favourable training data whilst ig- noring misfit training data completely. In this talk, we fol- low up on the idea of learning parametrised regularisation functions by quotient minimisation. We consider one- and higher-dimensional filter functions to be learned and allow for fit- and misfit-training data consisting of multiple func- tions. We first present results resembling behaviour of well- established derivative-based sparse regularisers like total variation or higher-order total variation in one-dimension. Then, we introduce novel families of non-derivative-based regularisers. This is accomplished by learning favourable scales and geometric properties while at the same time avoiding unfavourable ones.
STSW01 18th January 2018
09:00 to 09:45
Yi Yu Optimal Covariance Change Point Detection in High Dimension
Co-authors: Daren Wang (Carnegie Mellon University), Alessandro Rinaldo (Carnegie Mellon University) 

In this paper, we study covariance change point detection problem in high dimension. Specifically, we assume that the time series  $X_i \in \mathbb{R}^p$, $i = 1, \ldots, n$ are independent $p$-dimensional sub-Gaussian random vectors and that the corresponding covariance matrices $\{\Sigma_i\}_{i=1}^n$ are stationary within segments and only change at certain time points.  Our generic model setting allows $p$ grows with $n$ and we do not place any additional structural assumptions on the covariance matrices.  We introduce algorithms based on binary segmentation (e.g. Vostrikova, 1981) and wild binary segmentation (Fryzlewicz, 2014) and establish the consistency results under suitable conditions.  To improve the detection performance in high dimension, we propose   Wild Binary Segmentation through Independent Projection (WBSIP).  We show that WBSIP can optimally estimate the locations of the change points.  Our analysis also reveals a phase transition effect based on our generic model assumption and to the best of our knowledge, this type of results have not been established elsewhere in the change point detection literature.
STSW01 18th January 2018
09:45 to 10:30
Alexandre Tsybakov Adaptive estimation of functionals under sparsity
Adaptive estimation of functionals in sparse mean model and in sparse regression exhibits some interesting effects. This talk focuses on estimation of the L_2-norm of the target vector and of the variance of the noise. In the first problem, the ignorance of the noise level causes changes in the rates. Moreover, the form of the noise distribution also infuences the optimal rate. For example, the rates of estimating the variance differ depending on whether the noise is Gaussian or sub-Gaussian without a precise knowledge of the distribution. Finally, for the sparse mean model, the sub-Gaussian rate cannot be attained adaptively to the noise level on classes of noise distributions with polynomial tails, independently on how fast is the polynomial decay. Joint work with O.Collier and L.Comminges.
STSW01 18th January 2018
11:00 to 11:45
Claire Boyer On the gap between local recovery guarantees in structured compressed sensing and oracle estimates
This is an on-going work with Ben Adcock and Simone Brugiapaglia (Simon Fraser University, Burnaby). First we will introduce a compressed sensing (CS) theory more compatible with real-life applications: we derive guarantees to ensure reconstruction of a structured sparse signal of interest while imposing structure in the acquisition (no Gaussian measurements here...). We will study how far those CS results are from oracle-type guarantees, and we will show that they are similar in terms of the required number of measurements. These results give an insight to design new optimal sampling strategies when realistic physical constraints are imposed in the acquisition.
STSW01 18th January 2018
11:45 to 12:30
Ioannis Kontoyiannis Small Big Data: Temporal structure in discrete time series
The identification of useful temporal structure in discrete time series is an important component of algorithms used for many tasks in statistical inference and machine learning. Most of the early approaches developed were ineffective in practice, because the amount of data required for reliable modeling grew exponentially with memory length. On the other hand, many of the more modern methodological approaches that make use of more flexible and parsimonious models result in algorithms that do not scale well and are computationally ineffective for larger data sets.

We will discuss a class of novel methodological tools for effective Bayesian inference and model selection for general discrete time series, which offer promising results on both small and big data. Our starting point is the development of a rich class of Bayesian hierarchical models for variable-memory Markov chains. The particular prior structure we adopt makes it possible to design effective, linear-time algorithms that can compute most of the important features of the resulting posterior and predictive distributions without resorting to MCMC.

We have applied the resulting algorithms and methodological tools to numerous application-specific tasks (including on-line prediction, segmentation, classification, anomaly detection, entropy estimation, and causality testing) on data sets from different areas of application, including data compression, neuroscience, finance, genetics, and animal communication. Results on both simulated and real data will be presented, and brief comparisons with other approaches (including Bühlmann et al’s VLMC, Ron et al’s PST, and Raftery’s MTD) will be discussed. 
STSW01 18th January 2018
14:00 to 14:45
Judith Rousseau Asymptotics for ABC algorithms
Approximate Bayesoan Computation algorithms (ABC) are used in cases where the likelihood is intractable. To simulate from the (approximate) posterior distribution a possiblity is to sample new data from the model and check is these new data are close in some sense to the true data. The output of this algorithms thus depends on how we define the notion of closeness, which is based on a choice of summary statistics and on a threshold. Inthis work we study the behaviour of the algorithm under the assumption that the summary statistics are concentrating on some deterministic quantity and characterize the asymptotic behaviour of the resulting approximate posterior distribution in terms of the threshold and the rate of concentration of the summary statistics. The case of misspecified models is also treated where we show that surprising asymptotic behaviour appears.
STSW01 18th January 2018
14:45 to 15:30
Andreas Elsener Sharp oracle inequalities for stationary points of nonconvex penalised M-estimators
Co-author: Sara van de Geer (ETH Zurich)

Nonconvex loss functions are used in several areas of statistics and machine learning. They have several appealing properties as in the case of robust regression. We propose a general framework to derive sharp oracle inequalities for stationary points of penalised M-estimators with nonconvex loss. The penalisation term is assumed to be a weakly decomposable norm. We apply the general framework to sparse (additive) corrected linear regression, sparse PCA, and sparse robust regression. Finally, a new estimator called "robust SLOPE" is proposed to illustrate how to apply our framework to norms different from the l1-norm. 
STSW01 18th January 2018
16:00 to 16:45
Rajen Shah The xyz algorithm for fast interaction search in high-dimensional data
When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main effects and include interactions as products between individual variables. However, if the number of variables $p$ is large, as is common in genomic datasets, the computational cost of searching through $O(p^2)$ interactions can be prohibitive. In this talk I will introduce a new randomised algorithm called xyz that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. The underlying idea is to transform interaction search into a much simpler close pairs of points problem. We will see how strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $O(p^u)$ operations for $1
STSW01 19th January 2018
09:00 to 09:45
Alessandro Rudi Falkon: fast and optimal kernel method for large scale machine learning
Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt{n}) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures.
STSW01 19th January 2018
09:45 to 10:30
Philippe Rigollet Learning determinantal point processes
Co-authors: Victor-Emmanuel Brunel (MIT), Ankur Moitra (MIT), John Urschel (MIT)

Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning (such as recommendation systems) where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. In this talk, I will present recent results related to this problem, specifically - Rates of convergence for the maximum likelihood estimator: by studying the local and global geometry of the expected log-likelihood function we are able to establish rates of convergence for the MLE and give a complete characterization of the cases where these are parametric. We also give a partial description of the critical points for the expected log-likelihood. - Optimal rates of convergence for this problem: these are achievable by the method of moments and are governed by a combinatorial parameter, which we call the cycle sparsity. - A fast combinatorial algorithm to implement the method of moments efficiently.

The necessary background on DPPs will be given in the talk.

Joint work with Victor-Emmanuel Brunel (M.I.T), Ankur Moitra (M.I.T) and John Urschel (M.I.T).
STSW01 19th January 2018
11:00 to 11:45
Shahar Mendelson An optimal unrestricted learning procedure
The question of prediction is of central importance in Statistical Learning Theory. The optimal tradeoff between accuracy and confidence and the identity of a procedure that attains that optimal tradeoff have been studied extensively over the years.

In this talk I present some of ideas used in the recent solution of this problem: a procedure $\hat{f}$ that attains the best possible accuracy/confidence tradeoff for a triplet $(F,X,Y)$ consisting of an arbitrary class $F$, an unknown distribution $X$ and an unknown target $Y$.

Specifically, I explain why under minimal assumptions on $(F,X,Y)$, there is a natural critical level $r^*$ that depends on the triplet and the sample size $N$, such that for any accuracy parameter $r>r^*$, one has $E((\hat{f}(X)-Y)^2|D) \leq \inf_{f \in F} E(f(X)-Y)^2+r^2$ with probability at least $1-2\exp(-cN \min\{1,r^2/\sigma^2\})$,
where $\sigma^2=\inf_{f \in F} E(f(X)-Y)^2$.
STSW01 19th January 2018
11:45 to 12:30
Garvesh Raskutti Variable selection and classification with large-scale presence only data
Co-author: Hyebin Song (University of Wisconsin-Madison)

In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses where the  number of covariates $p$ is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the \emph{PUlasso} algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal mean-squared error of $\frac{s \log p}{n}$, where $s$ is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate $p$ settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example.

Related Links
STS 23rd January 2018
11:00 to 12:00
Anru Zhang Singular Value Decomposition for High-dimensional High-order Data
High-dimensional high-order data arise in many modern scientific applications including genomics, brain imaging, and social science. In this talk, we consider the methods, theories, and computations for tensor singular value decomposition (tensor SVD), which aims to extract the hidden low-rank structure from high-dimensional high-order data. First, comprehensive results are developed on both the statistical and computational limits for tensor SVD under the general scenario. This problem exhibits three different phases according to signal-noise-ratio (SNR), and the minimax-optimal statistical and/or computational results are developed in each of the regimes. In addition, we further consider the sparse tensor singular value decomposition which allows more robust estimation under sparsity structural assumptions. A novel sparse tensor alternating thresholding algorithm is proposed. Both the optimal theoretical results and numerical analyses are provided to guarantee the performance of the proposed procedure.



STS 25th January 2018
11:00 to 12:00
Nathalie Akakpo Multivariate intensity estimation via hyperbolic wavelet selection
We propose a new statistical procedure that can in some sense overcome the curse of dimensionality without structural assumptions on the function to estimate. It relies on a least-squares type penalized criterion and a new collection of models built from hyperbolic biorthogonal wavelet bases. We study its properties in a unifying intensity estimation framework, where an oracle-type inequality and adaptation to mixed dominating smoothness are shown to hold. We also explain how to implement the estimator with an algorithm whose complexity is manageable. (Reference: JMVA, Volume 161, p.32-57)



STS 30th January 2018
11:00 to 12:00
Chao Zheng Revisiting Huber’s M-Estimation: A Tuning-Free Approach
We introduce a novel scheme to choose the scale or robustification parameter in Huber’s method for mean estimation and linear regression in both low and high dimensional settings, which is tuning-free. For robustly estimating the mean of a univariate distribution, we first consider the adaptive Huber estimator with the robustification parameter calibrated via the censored equation approach. Our theoretical results provide finite sample guarantees for both the estimation and calibration parts. To further reduce the computational complexity, we next develop an alternating M-estimation procedure, which simultaneously estimates the mean and variance in a sequential manner. This idea can be naturally extended to regression problems in both low and high dimensions. We provide simple and fast algorithms to implement this procedure under various scenarios and study the numerical performance through simulations.



STS 1st February 2018
11:00 to 12:00
Qiwei Yao Testing for High-dimensional White Noise
Testing for white noise is a fundamental problem in statistical inference, as many testing problems in linear modelling can be transformed into a white noise test. While the celebrated Box-Pierce test and its variants tests are often applied for model diagnosis, their relevance in the context of high-dimensional modeling is not well understood, as the asymptotic null distributions are established for fixed dimensions. Furthermore, those tests typically lose power when the dimension of time series is relatively large in relation to the sample size. In this talk, we introduce two new omnibus tests for high-dimensional time series.

The first method uses the maximum absolute autocorrelations and cross-correlations of the component series as the testing statistic. Based on an approximation by the L-infinity norm of a normal random vector, the critical value of the test can be evaluated by bootstrapping from a multivariate normal distribution. In contrast to the conventional white noise test, the new method is proved to be valid for testing departure from white noise that is not independent and identically distributed.

The second test statistic is defined as the sum of squared singular values of the first q lagged sample autocovariance matrices. Therefore, it encapsulates all the serial correlations (up to the time lag q) within and across all component series. Using the tools from random matrix theory, we derive the normal limiting distributions when both the dimension and the sample size diverge to infinity.




STS 8th February 2018
11:00 to 12:00
Darren Wilkinson A compositional approach to scalable statistical modelling and computation
In statistics, and in life, we typically solve big problems by (recursively) breaking them down into smaller problems that we can solve more easily, and then compose the solutions of the smaller problems to provide a solution to the big problem that we are really interested in. This "divide and conquer" approach is necessary for the development of genuinely scalable models and algorithms. It is therefore unfortunate that statistical models and algorithms are not usually formulated in a composable way, and that the programming languages typically used for scientific and statistical computing also fail to naturally support composition of models, data and computation. The mathematical subject of category theory is in many ways the mathematical study of composition, and provides significant insight into the development of more compositional models of computation. Functional programming languages which are strongly influenced by category theory turn out to be much better suited to the development of scalable statistical algorithms than the imperative programming languages more commonly used. Expressing algorithms in a functional/categorical way is not only more elegant, concise and less error-prone, but provides numerous more tangible benefits, such as automatic parallelisation and distribution of algorithms. I will illustrate the concepts using examples such as the statistical analysis of streaming data, image analysis, numerical integration of PDEs, particle filtering, Gibbs sampling, and probabilistic programming, using concepts from category theory such as functors, monads and comonads. Illustrative code snippets will given using the Scala programming language.
STS 13th February 2018
11:00 to 12:00
Arlene Kim Adaptation in log-concave density estimation
The log-concave maximum likelihood estimator of a density on R^d on a sample of size n is known to attain the minimax optimal rate of convergence up to a log factor when d = 2 and d = 3. In this talk, I will review the univariate adaptation result, and will present new results on adaptation properties in the multivariate setting. This is based on joint work with Oliver Feng, Aditya Guntuboyina and Richard Samworth



STS 15th February 2018
11:00 to 12:00
Clifford Lam Nonlinear Shrinkage Estimation in Quadratic Inference Function Analysis for Correlated Data
Quadratic inference function (QIF) analysis is more efficient than the generalized estimating equations (GEE) approach when the working covariance matrices for the data are misspecified. Since QIF naturally requires a weighting matrix which is the inverse of a sample covariance matrix of non-identically distributed data, finite sample performance can be greatly affected when the number of independent data points is not large enough, which is usually the case in cluster randomized trials or many longitudinal studies. While nonlinear shrinkage is very successful in regularizing the extreme eigenvalues of a sample covariance matrix, the method is only restricted to independent and identically distributed data. We propose a novel nonlinear shrinkage approach for a sample covariance matrix of non-identically distributed data, which improves finite sample performance of QIF, and gives room for increasing the potential number of working correlation structures for even better performance. Together with a nonlinearly shrunk weighting matrix, we derive the asymptotic normality of the parameter estimators, and give an estimator for the asymptotic covariance matrix. We demonstrate the performance of the proposed method through simulation experiments and a real data analysis.



STS 20th February 2018
11:00 to 12:00
Alastair Young A Hybrid Block Bootstrap For Sample Quantiles Under Weak Dependence
The subsampling bootstrap and the moving blocks bootstrap provide effective methods for nonparametric inference with weakly dependent data. Both are based on the notion of resampling (overlapping) blocks of successive observations from a data sample: in the former single blocks are sampled, while the latter splices together random blocks to yield bootstrap series of the same length as the original data sample. Here we discuss a general theory for block bootstrap distribution estimation for sample quantiles, under mild strong mixing assumptions. A hybrid between subsampling and the moving blocks bootstrap is shown to give theoretical benefits, and startling improvements in accuracy in distribution estimation in important practical settings. An intuitive procedure for empirical selection of the optimal number of blocks and their length is proposed. The conclusion that bootstrap samples should be of smaller size than the original data sample has significant implications for computational efficiency and scalability of bootstrap methodologies in dependent data settings. This is joint work with Todd Kuffner and Stephen Lee and is described at https://arxiv.org/abs/1710.02537.



STS 22nd February 2018
11:00 to 12:00
Varun Jog Information-theoretic perspectives on learning algorithms
In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. We overview some recent work [Xu and Raginsky (2017)] that bounds generalization error of empirical risk minimization based on the mutual information I(S;W) between the algorithm input S and the algorithm output W. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure, such as stochastic gradient Langevin dynamics (SGLD). We describe certain shortcomings of mutual information-based bounds, and propose alternate bounds that employ the Wasserstein metric from optimal transport theory. We compare the Wasserstein metric-based bounds with the mutual information-based bounds and show that for a class of data generating distributions, the former leads to stronger bounds on the generalization error.



STS 27th February 2018
11:00 to 12:00
Danning Li Testing covariance matrices in high dimensions
Testing covariance structure is of significant interest in many areas of high-dimensional inference. Using extreme-value form statistics to test against sparse alternatives and using quadratic form statistics to test against dense alternatives are two important testing procedures for high-dimensional independence. However, quadratic form statistics suffer from low power against sparse alternatives, and extreme-value form statistics suffer from low power against dense alternatives with small disturbances. It would be important and appealing to derive powerful testing procedures against general alternatives (either dense or sparse), which is more realistic in real-world applications. Under the ultra high-dimensional setting, we propose two novel testing procedures with explicit limiting distributions to boost the power against general alternatives.



OFBW38 28th February 2018
10:00 to 10:10
Richard Samworth, Jane Leeks Welcome and Introduction
OFBW38 28th February 2018
10:10 to 10:50
Sergio Bacallado Making sense of Human Microbiome Data
OFBW38 28th February 2018
10:50 to 11:30
Po-Ling Loh Data Science for Networked Data
OFBW38 28th February 2018
11:50 to 12:20
Chris Harbron Delivering Better Treatments to the Right Patients Faster and More Safely : Opportunities for Big Data to Improve Pharmaceutical R&D
OFBW38 28th February 2018
12:20 to 12:50
Nick Goldman Genomics Research at EBI: Challenges in Statistical Scaling
OFBW38 28th February 2018
12:50 to 13:00
Questions and Discussions
OFBW38 28th February 2018
14:00 to 14:30
Andrew Richards Future Challenges for Scaling Big Data at the National Grid
OFBW38 28th February 2018
14:30 to 15:00
Sofia Olhede Scalable Inference for Complex Environmental Data
OFBW38 28th February 2018
15:00 to 15:10
Questions and Discussions
OFBW38 28th February 2018
15:30 to 16:00
Dave Yearling Big Data Challenges in the Performance of Large IP Networks
OFBW38 28th February 2018
16:00 to 16:30
GCHQ Speaker Taking Big Data Upstream: Trading Storage for Approximation
OFBW38 28th February 2018
16:30 to 16:50
Questions, Discussion and Wrap-up
STS 1st March 2018
11:00 to 12:00
Patrick Rubin-Delanchy Progress on the connection between spectral embedding and network models used by the probability, statistics and machine-learning communities
In this talk, I give theoretical and methodological results, based on work spanning Johns Hopkins, the Heilbronn Institute for Mathematical Research, Imperial and Bristol, regarding the connection between various graph spectral methods and commonly used network models which are popular in the probability, statistics and machine-learning communities. An attractive feature of the results is that they lead to very simple take-home messages for network data analysis: a) when using spectral embedding, consider eigenvectors from both ends of the spectrum; b) when implementing spectral clustering, use Gaussian mixture models, not k-means; c) when interpreting spectral embedding, think of "mixtures of behaviour" rather than "distance". Results are illustrated with cyber-security applications.



STS 6th March 2018
11:00 to 12:00
Rahul Mazumder New directions in solving structured nonconvex problems in multivariate statistics
Nonconvex problems arise frequently in modern applied statistics and machine learning, posing outstanding challenges from a computational and statistical viewpoint. Continuous especially convex optimization, has played a key role in our computational understanding of (relaxations or approximations of)  these problems. However, some other well-grounded techniques in mathematical optimization (for example, mixed integer optimization) have not been explored to their fullest potential. When the underlying statistical problem becomes difficult, simple convex relaxations and/or greedy methods have shortcomings. Fortunately, many of these can be ameliorated by using estimators that can be posed as solutions to structured discrete optimization problems. To this end, I will demonstrate how techniques in modern computational mathematical optimization (especially, discrete optimization) can be used to address the canonical problem of best-subset selection and cousins. I will describe how recent algorithms based on local combinatorial optimization can lead to high quality solutions in times comparable to (or even faster than) the fastest algorithms based on L1-regularization. I will also discuss the relatively less understood low Signal to Noise ratio regime, where usual subset selection performs unfavorably from a  statistical viewpoint; and propose simple alternatives that rely on nonconvex optimization. If time permits, I will outline problems arising in the context robust statistics (least median squares/least trimmed squares), low-rank factor analysis and nonparametric function estimation where, these techniques seem to be promising.



STS 8th March 2018
11:00 to 12:00
Tengyao Wang Isotonic regression in general dimensions
We study the least squares regression function estimator over the class of real-valued functions on [0,1]^d that are increasing in each coordinate.  For uniformly bounded signals and with a fixed, cubic lattice design, we establish that the estimator achieves the minimax rate of order n^{-min(2/(d+2),1/d)} in the empirical L_2 loss, up to poly-logarithmic factors.  Further, we prove a sharp oracle inequality, which reveals in particular that when the true regression function is piecewise constant on $k$ hyperrectangles, the least squares estimator enjoys a faster, adaptive rate of convergence of (k/n)^{min(1,2/d)}, again up to poly-logarithmic factors.  Previous results are confined to the case d≤2.  Finally, we establish corresponding bounds (which are new even in the case d=2) in the more challenging random design setting.  There are two surprising features of these results: first, they demonstrate that it is possible for a global empirical risk minimisation procedure to be rate optimal up to poly-logarithmic factors even when the corresponding entropy integral for the function class diverges rapidly; second, they indicate that the adaptation rate for shape-constrained estimators can be strictly worse than the parametric rate.



STS 13th March 2018
11:00 to 12:00
Regina Liu i-Fusion: Individualized Fusion Learning
Inferences from different data sources can often be fused together to yield more powerful findings than those from individual sources alone. We present a new fusion learning approach, called ‘i-Fusion’, for drawing efficient individualized inference by fusing the leanings from relevant data sources. i-Fusion is robust for handling heterogeneity arising from diverse data sources, and is ideally suited for goal-directed applications such as precision medicine. Specifically, i-Fusion summarizes individual inferences as confidence distributions (CDs), adaptively forms a clique of individuals that bear relevance to the target individual, and then suitably combines the CDs from those relevant individuals in the clique to draw inference for the target individual. In essence, i-Fusion strategically ‘borrows strength’ from relevant individuals to improve efficiency while retaining its inference validity. Computationally, i-Fusion is parallel in nature and scales up well in comparison with competing approaches. The performance of the approach is demonstrated by simulations and real data applications.

This is joint work with Jieli Shen and Minge Xie, Rutgers University.




STS 15th March 2018
11:00 to 12:00
Mihaela van der Schaar Causal Inference for Treatment Effects: A Theory and Associated Learning Algorithms
We investigate the problem of estimating the causal effect of a treatment on individual subjects from observational data; this is a central problem in various application domains, including healthcare, social sciences, and online advertising. We first develop a theoretical foundation of causal inference for individualized treatment effects based on information theory. Next, we use this theory, to construct an information-optimal Bayesian causal inference algorithm.  This algorithm embeds the potential outcomes in a vector-valued reproducing kernel Hilbert space and uses a multi-task Gaussian process prior over that space to infer the individualized causal effects. We show that our algorithm significantly outperforms the state-of-the-art causal inference algorithms. The talk will conclude with a discussion of the impact of this work on precision medicine and clinical trials.



STSW02 19th March 2018
10:00 to 11:00
Victor Panaretos Procrustes Analysis of Covariance Operators and Optimal Transport of Gaussian Processes
Covariance operators are fundamental in functional data analysis, providing the canonical means to analyse functional variation via the celebrated Karhunen-Loève expansion. These operators may themselves be subject to variation, for instance in contexts where multiple functional populations are to be compared. Statistical techniques to analyse such variation are intimately linked with the choice of metric on covariance operators, and the intrinsic infinite-dimensionality of these operators. We will describe the manifold-like geometry of the space of trace-class infinite-dimensional covariance operators and associated key statistical properties, under the recently proposed infinite-dimensional version of the Procrustes metric. In particular, we will identify this space with that of centred Gaussian processes equipped with the Wasserstein metric of optimal transportation. The identification allows us to provide a description of those aspects of the geometry that are important in terms of statistical inference, and establish key properties of the Fréchet mean of a random sample of covariances, as well as generative models that are canonical for such metrics. The latter will allow us to draw connections with the problem of registration of warped functional data. Based on joint work with V. Masarotto (EPFL) and Y. Zemel (Göttingen).
STSW02 19th March 2018
11:30 to 12:30
Jim Ramsay Dynamic Smoothing Meets Gravity
authors: Jim Ramsay, Michelle Carey and Juan Li institutions: McGill University, University College Dublin, and McGill University Systems of differential equations are often used to model buffering processes that modulate a non-smooth high-energy input so as to produce an output that is smooth and that distributes the energy load over time and space. Handwriting is buffered in this way. We show that the smooth complex script that spells `"statistics" in Chinese can be represented as buffered version of a series of 46 equal-interval step inputs. The buffer consists of three undamped oscillating springs, one for each orthogonal coordinate. The periods of oscillation vary slightly over the three coordinate in a way that reflects the masses that are moved by muscle activations. Our analyses of data on juggling three balls and on lip motion during speech confirm that this model works for a wide variety of human motions. We use the term "dynamic smoothing" for the estimation of a structured input functional object along with the buffer characteristics.
STSW02 19th March 2018
13:30 to 14:30
Bodhisattva Sen Adaptive Confidence Bands for Shape-Restricted Regression in Multi-dimension using Multiscale Tests
Co-author: Pratyay Datta (Columbia University)

We consider a multidimensional (continuous) Gaussian white noise regression model. We define suitable multiscale tests in this scenario. Utilizing these tests we construct confidence bands for the underlying regression function with guaranteed coverage probability, assuming that the underlying function is isotonic or convex. These confidence bands are shown to be asymptotically optimal in an appropriate sense. Computational issues will also be discussed.
STSW02 19th March 2018
14:30 to 15:30
Richard Samworth Isotonic regression in general dimensions
Co-authors: Qiyang Han (University of Washington), Tengyao Wang (University of Cambridge), Sabyasachi Chatterjee (University of Illinois)

We study the least squares regression function estimator over the class of real-valued functions on $[0,1]^d$ that are increasing in each coordinate. For uniformly bounded signals and with a fixed, cubic lattice design, we establish that the estimator achieves the minimax rate of order $n^{−min\{2/(d+2),1/d\}}$ in the empirical $L_2$ loss, up to poly-logarithmic factors. Further, we prove a sharp oracle inequality, which reveals in particular that when the true regression function is piecewise constant on $k$ hyperrectangles, the least squares estimator enjoys a faster, adaptive rate of convergence of $(k/n)^{min(1,2/d)}$, again up to poly-logarithmic factors. Previous results are confined to the case $d\leq 2$. Finally, we establish corresponding bounds (which are new even in the case $d=2$) in the more challenging random design setting. There are two surprising features of these results: first, they demonstrate that it is possible for a global empirical risk minimisation procedure to be rate optimal up to poly-logarithmic factors even when the corresponding entropy integral for the function class diverges rapidly; second, they indicate that the adaptation rate for shape-constrained estimators can be strictly worse than the parametric rate. 
STSW02 19th March 2018
16:00 to 17:00
Rainer von Sachs Intrinsic wavelet regression for curves and surfaces of Hermitian positive definite matrices
Co-author: Joris Chau (ISBA, UC Louvain)

In multivariate time series analysis, non-degenerate autocovariance and spectral density matrices are necessarily Hermitian and positive definite. An estimation methodology which preserves these properties is developed based on intrinsic wavelet transforms being applied to nonparametric wavelet regression for curves in the non-Euclidean space of Hermitian positive definite matrices. Via intrinsic average-interpolation in a Riemannian manifold equipped with a natural invariant Riemannian metric, we derive the wavelet coefficient decay and linear wavelet thresholding convergence rates of intrinsically smooth curves. Applying this more specifically to nonparametric spectral density estimation, an important property of the intrinsic linear or nonlinear wavelet spectral estimator under the invariant Riemannian metric is that it is independent of the choice of coordinate system of the time series, in contrast to most existing approaches. As a generalisation of this one-dimensional denoising of matrix-valued curves in the Riemannian manifold we also present higher-dimensional intrinsic wavelet transforms, applied in particular to time-varying spectral estimation of non-stationary multivariate time series, i.e. surfaces of Hermitian positive definite matrices.

Related Links
STSW02 20th March 2018
09:00 to 10:00
Robert Nowak Learning Low-Dimensional Metrics
This talk discusses the problem of learning a low-dimensional Euclidean metric from distance comparisons. Specifically, consider a set of n items with high-dimensional features and suppose we are given a set of (possibly noisy) distance comparisons of the form sign(dist(x,y) − dist(x,z)), where x, y, and z are the features associated with three such items. The goal is to learn the distance function that generates such comparisons. The talk focuses on several key issues pertaining to the theoretical foundations of metric learning: 1) optimization methods for learning general low-dimensional (low-rank) metrics as well as sparse metrics; 2) upper and lower (minimax) bounds on prediction error; 3) quantification of the sample complexity of metric learning in terms of the dimension of the feature space and the dimension/rank of the underlying metric; 4) bounds on the accuracy of the learned metric relative to the underlying true generative metric. Our results involve novel mathematical approaches to the metric learning problem and shed new light on the special case of ordinal embedding (aka non-metric multidimensional scaling). This is joint work with Lalit Jain and Blake Mason.
STSW02 20th March 2018
10:00 to 11:00
Matthew Reimherr Manifold Data Analysis with Applications to High-Resolution 3D Imaging
Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on developing tools for handling such data. In this work we presents a new subfield of functional data analysis, FDA, which we call Manifold Data Analysis, or MDA. MDA is concerned with the statistical analysis of samples where one or more variables measured on each unit is a manifold, thus resulting in as many manifolds as we have units. We propose a framework that converts manifolds into functional objects, an efficient 2-step functional principal component method, and a manifold-on-scalar regression model. This work is motivated by an anthropological application involving 3D facial imaging data, which is discussed extensively throughout. The proposed framework is used to understand how individual characteristics, such as age and genetic ancestry, influence the shape of the human face.
STSW02 20th March 2018
11:30 to 12:30
Davide Pigoli Speech as object data: exploring cross-linguistic changes in Romance languages
Exploring phonetic change between languages is of particular importance in the understanding of the history and geographical spread of languages. While many studies have considered differences in textual form or in phonetic transcription, it is somewhat more difficult to analyse speech recordings in this manner, although this is usually the dominant mode of transmission. Here, we propose a novel approach to explore phonetic changes, using log-spectrograms of speech recordings. After pre-processing the data to remove inherent individual differences, we identify time and frequency covariance functions as a feature of the language; in contrast, the mean depends mostly on the word that has been uttered. We use these means and covariances to postulate paths between languages, and we illustrate some preliminary results obtained when the model is applied to recordings of speakers of a few Romance languages. This is part of a joint work with P.Z. Hadjipantelis, J.S. Coleman and J.A.D. Aston.
STSW02 20th March 2018
13:30 to 14:30
Michelle Carey Uncertainty quantification for Geo-spatial process
Co-author: James Ramsay (Prof)

Geo spatial data are observations of a process that are collected in conjunction with reference to their geographical location. This type of data is abundant in many scientific fields, some examples include: population census, social and demographic (health, justice, education), economic (business surveys, trade, transport, tourism, agriculture, etc.) and environmental (atmospheric and oceanographic) data. They are often distributed over irregularly shaped spatial domains with complex boundaries and interior holes. Modelling approaches must account for the spatial dependence over these irregular domains as well as describing there temporal evolution.

Dynamic systems modelling has a huge potential in statistics, as evidenced by the amount of activity in functional data analysis. Many seemingly complex forms of functional variation can be more simply represented as a set of differential equations, either ordinary or partial.

In this talk, I will present a class of semi parametric regression models with differential regularization in the form of PDEs. This methodology will be called Data2PDE “Data to Partial Differential Equations". Data2PDE characterizes spatial processes that evolve over complex geometries in the presence of uncertain, incomplete and often noisy observations and prior knowledge regarding the physical principles of the process characterized by a PDE.
STSW02 20th March 2018
14:30 to 15:30
Ian Dryden Object Data Driven Discovery
Object data analysis is an important tool in the many disciplines where the data have much richer structure than the usual numbers or vectors. An initial question to ask is: what are the most basic data units? i.e. what are the atoms of the data? We describe an introduction to this topic, where the statistical analysis of object data has a wide variety of applications. An important aspect of the analysis is to reduce the dimension to a small number key features while respecting the geometry of the manifold in which objects lie. Three case studies are given which exemplify the types of issues that are encountered: i) Describing changes in variability in damaged DNA, ii) Testing for geometrical differences in carotid arteries, where patients are at high or low risk of aneurysm, iii) clustering enzymes observed over time. In all three applications the structure of the data manifolds is important, in particular the manifold of covariance matrices, unlabelled size-and-shape space and shape space.
STSW02 20th March 2018
16:00 to 17:00
Fang Yao Functional regression on manifold with contamination
We propose a new perspective on functional regression with a predictor process via the concept of manifold that is intrinsically finite-dimensional and embedded in an infinite-dimensional functional space, where the predictor is contaminated with discrete/noisy measurements. By a novel method of functional local linear manifold smoothing, we achieve a polynomial rate of convergence that adapts to the intrinsic manifold dimension and the level of sampling/noise contamination with a phase transition phenomenon depending on their interplay. This is in contrast to the logarithmic convergence rate in the literature of functional nonparametric regression. We demonstrate that the proposed method enjoys favorable finite sample performance relative to commonly used methods via simulated and real data examples. (Joint with Zhenhua Lin)
STSW02 21st March 2018
09:00 to 10:00
Rebecca Willett Graph Total Variation for Inverse Problems with Highly Correlated Designs
Co-authors: Garvesh Raskutti (University of Wisconsin), Yuan Li (University of Wisconsin)

Sparse high-dimensional linear regression and inverse problems have received substantial attention over the past two decades. Much of this work assumes that explanatory variables are only mildly correlated. However, in modern applications ranging from functional MRI to genome-wide association studies, we observe highly correlated explanatory variables and associated design matrices that do not exhibit key properties (such as the restricted eigenvalue condition). In this talk, I will describe novel methods for robust sparse linear regression in these settings. Using side information about the strength of correlations among explanatory variables, we form a graph with edge weights corresponding to pairwise correlations. This graph is used to define a graph total variation regularizer that promotes similar weights for correlated explanatory variables. I will show how the graph structure encapsulated by this regularizer interacts with correlated design matrices to yield provably a ccurate estimates. The proposed approach outperforms standard methods in a variety of experiments on simulated and real fMRI data.

This is joint work with Yuan Li and Garvesh Raskutti.
STSW02 21st March 2018
10:00 to 11:00
Sofia Olhede Small and Large Scale Network Features
Comparing and contrasting networks is hindered by their strongly non-Euclidean structure. I will discuss how one determines “optimal” features to compare two different networks of different sparsity and size. As the topology of any complex system is key to understanding its structure and function, the result will be developed from topological ideas. Fundamentally, algebraic topology guarantees that any system represented by a network can be understood through its closed paths. The length of each path provides a notion of scale, which is vitally important in characterizing dominant modes of system behavior. Here, by combining topology with scale, we prove the existence of universal features which reveal the dominant scales of any network. We use these features to compare several canonical network types in the context of a social media discussion which evolves through the sharing of rumors, leaks and other news. Our analysis enables for the first time a universal understanding of the balance between loops and tree-like structure across network scales, and an assessment of how this balance interacts with the spreading of information online. Crucially, our results allow networks to be quantified and compared in a purely model-free way that is theoretically sound, fully automated, and inherently scalable.



STSW02 21st March 2018
11:30 to 12:30
Johannes Schmidt-hieber Statistical theory for deep neural networks with ReLU activation function
The universal approximation theorem states that neural networks are capable of approximating any continuous function up to a small error that depends on the size of the network. The expressive power of a network does, however, not guarantee that deep networks perform well on data. For that, control of the statistical estimation risk is needed. In the talk, we derive statistical theory for fitting deep neural networks to data generated from the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to logarithmic factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the n etwork. Specifically, we consider large networks with number of potential parameters being much bigger than the sample size. Interestingly, the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that scaling the network depth with the logarithm of the sample size is natural.

Related Links
STSW02 22nd March 2018
09:00 to 10:00
Jingjing Zou Mixed Effects Model on Functional Manifolds / Sampling Directed Networks
I would like to talk about two projects. Co-authors of Mixed Effects Model on Functional Manifolds: John Aston (University of Cambridge), Lexin Li (UC Berkeley) We propose a generalized mixed effects model to study effects of subject-specific covariates on geometric and functional features of the subjects' surfaces. Here the covariates include both time-invariant covariates which affect both the geometric and functional features, and time-varying covariates which result in longitudinal changes in the functional textures. In addition, we extend the usual mixed effects model to model the covariance between a subject's geometric deformation and functional textures on the surface. Co-authors of Sampling Directed Networks: Richard Davis (Columbia University), Gennady Samorodnitsky (Cornell University), Zhi-Li Zhang (University of Minnesota). We propose a sampling procedure for the nodes in a network with the goal of estimating uncommon population features of the entire network. Such features might include tail behavior of the in-degree and out-degree distributions and as well as their joint distribution. Our procedure is based on selecting random initial nodes and then following the path of linked nodes in a structured fashion. In this procedure, targeted nodes with desired features, such as large in-degree, will have a larger probability of being retained. In order to construct nearly unbiased estimates of the quantities of interest, weights associated with the sampled nodes must be calculated. We will illustrate this procedure and compare it with a sampling scheme based on multiple random walks on several data sets including webpage network data and Google+ social network data.
STSW02 22nd March 2018
10:00 to 11:00
Hao Chen New two-sample tests based on adjacency
Two-sample tests for multivariate data and non-Euclidean data are widely used in many fields.  We study a nonparametric testing procedure that utilizes graphs representing the similarity among observations.  It can be applied to any data types as long as an informative similarity measure on the sample space can be defined.  Existing tests based on a similarity graph lack power either for location or for scale alternatives. A new test is proposed that utilizes a common pattern overlooked previously, and it works for both types of alternatives.  The test exhibits substantial power gains in simulation studies. Its asymptotic permutation null distribution is derived and shown to work well under finite samples, facilitating its application to large data sets.  Another new test statistic will also be discussed that addresses the problem of the classic test of the type under unequal sample sizes.  Both tests are illustrated on an application of comparing networks under different conditions.
STSW02 22nd March 2018
11:30 to 12:30
Alexander Aue Limiting spectral distributions for a class of high-dimensional time series
This talk discusses extensions to the time series case of the Marcenko-Pastur law on limiting spectral distributions (LSDs) for the eigenvalues of high-dimensional sample covariance matrices. The main result will be on establishing a non-linear integral equation characterizing the LSD in terms of its Stieltjes transform. Intuition will be presented for the simple case of a first-order moving average time series and evidence will be provided, indicating the applicability of the result to problems involving to the estimation of certain quadratic forms as they arise, for example, when dealing with the Markowitz portfolio problem. The talk is based on joint work with Haoyang Liu (Florida State) and Debashis Paul (UC Davis).
STSW02 22nd March 2018
13:30 to 14:30
Regina Liu Fusion and Individualized Fusion Learning from Diverse Data Sources by Confidence Distribution
Inferences from different data sources can often be fused together to yield more powerful findings than those from individual sources alone. We present a new approach for fusion learning by using the so-called confidence distributions (CD). We further develop the individualized fusion learning, ‘iFusion’, for drawing efficient individualized inference by fusing the leanings from relevant data sources. This approach is robust for handling heterogeneity arising from diverse data sources, and is ideally suited for goal-directed applications such as precision medicine. In essence, iFusion strategically ‘borrows strength’ from relevant individuals to improve efficiency while retaining its inference validity. Computationally, the fusion approach here is parallel in nature and scales up well in comparison with competing approaches. The performance of the approach is demonstrated by simulation studies and risk valuation  of aircraft landing data.


STSW02 22nd March 2018
14:30 to 15:30
Debashis Paul Spectral estimation for a class of high-dimensional linear processes
We present results about the limiting behavior of the empirical distribution of eigenvalues of weighted integrals of the sample periodogram for a class of high-dimensional linear processes. The processes under consideration are characterized by having simultaneously diagonalizable coefficient matrices. We make use of these asymptotic results, derived under the setting where the dimension and sample size are comparable, to formulate an estimation strategy for the distribution of eigenvalues of the coefficients of the linear process. This approach generalizes existing works on estimation of the spectrum of an unknown covariance matrix for high-dimensional i.i.d. observations.  

(Joint work with Jamshid Namdari and Alexander Aue)



STSW02 22nd March 2018
16:00 to 17:00
Eardi Lila Statistical Analysis of Functions on Surfaces, with an application to Medical Imaging
Co-author: John Aston (University of Cambridge)

In Functional Data Analysis, data are commonly assumed to be smooth functions on a fixed interval of the real line. In this work, we introduce a comprehensive framework for the analysis of functional data, whose domain is a two-dimensional manifold and the domain itself is subject to variability from sample to sample. We formulate a statistical model for such data, that we call Functions on Surfaces, which enables a joint representation of the geometric and functional aspects, and propose an associated estimation framework. We apply the proposed framework to the analysis of neuroimaging data of cortical thickness, acquired from the brains of different subjects, and thus lying on domains with different geometries.
STSW02 23rd March 2018
09:00 to 10:00
Sara Anna van de Geer A concentration interval for the Lasso
We consider the linear model and the Lasso estimator. Our goal is to provide upper and lower bounds for the prediction error that are close to each other. We assume that the active components of the vector of regression coefficients are sufficiently large in absolute value (in a sense that will be specified) and that the tuning parameter is suitably chosen. The bounds depend on so-called compatibility constants. We will present the definition of the compatibility constants and discuss their relation with restricted eigenvalues. As an example, we consider the the least squares estimator with total variation penalty and present bounds with small gap. For the case of random design, we assume that the rows of the design matrix are i.i.d.copies of a Gaussian random vector. We assume that the largest eigenvalue of the covariance matrix remains bounded and establish under some mild compatibility conditions upper and lower bounds with ratio tending to one.
STSW02 23rd March 2018
10:00 to 11:00
Miguel del alamo Multiscale Bounded Variation Regularization
Co-authors: Housen Li (University of Goettingen), Axel Munk (University of Goettingen)

In nonparametric regression and inverse problems, variational methods based on bounded variation (BV) penalties are a well-known and established tool for yielding edge-preserving reconstructions, which is a desirable feature in many applications. Despite its practical success, the theory behind BV-regularization is poorly understood: most importantly, there is a lack of convergence guarantees in spatial dimension d\geq 2.

In this talk we present a variational estimator that combines a BV penalty and a multiscale constraint, and prove that it converges to the truth at the optimal rate. Our theoretical analysis relies on a proper analysis of the multiscale constraint, which is motivated by the statistical properties of the noise, and relates in a natural way to certain Besov spaces of negative smoothness. Further, the main novelty of our approach is the use of refined interpolation inequalities between function spaces. We also illustrate the performance of these variational estimators in simulations on signals and images.
STSW02 23rd March 2018
11:30 to 12:30
Piotr Fryzlewicz Multiscale methods and recursion in data science
The talk starts on a general note: we first attempt to define a "multiscale" method / algorithm as a recursive program acting on a dataset in a suitable way. Wavelet transformations, unbalanced wavelet transformations and binary segmentation are all examples of multiscale methods in this sense. Using the example of binary segmentation, we illustrate the benefits of the recursive formulation of multiscale algorithms from the software implementation and theoretical tractability viewpoints.

We then turn more specific and study the canonical problem of a-posteriori detection of multiple change-points in the mean of a piecewise-constant signal observed with noise. Even in this simple set-up, many publicly available state-of-the-art methods struggle for certain classes of signals. In particular, this misperformance is observed in methods that work by minimising a "fit to the data plus a penalty" criterion, the reason being that it is challenging to think of a penalty that works well over a wide range of signal classes. To overcome this issue, we propose a new approach whereby methods learn from the data as they proceed, and, as a result, operate differently for different signal classes. As an example of this approach, we revisit our earlier change-point detection algorithm, Wild Binary Segmentation, and make it data-adaptive by equipping it with a recursive mechanism for deciding "on the fly" how many sub-samples of the input data to draw, and w here to draw them. This is in contrast to the original Wild Binary Segmentation, which is not recursive. We show that this significantly improves the algorithm particularly for signals with frequent change-points.

Related Links
STS 27th March 2018
11:00 to 12:00
Adam Sykulski Spatiotemporal modelling and parameter estimation of anisotropic particle trajectories
Trajectories of moving objects are collected everywhere and in massive volumes. In the first part of this talk I will present a framework for stochastically modelling such trajectories in time over two-dimensional space. I will show an application to modelling fluid particles in ocean turbulence, where trajectories are typically anisotropic due to spherical dynamics. In the second part, I will discuss computationally-efficient parameter estimation for massive datasets, where we propose an important modification to the Whittle likelihood to significantly reduce bias at no additional computational cost. We extend these estimation methods to spatiotemporal trajectories, such that we can estimate parameters that reveal the nature of the anisotropic structure.



STS 29th March 2018
11:00 to 12:00
Steffen Grunewalder Compressed Empirical Measures
I will present results on compressed representations of expectation operators with a particular emphasis on expectations with respect to empirical measures. Such expectations are a cornerstone of non-parametric statistics and compressed representations are of great value when dealing with large sample sizes and computationally expensive methods. I will focus on a conditional gradient like algorithm to generate such representations in infinite dimensional function spaces. In particular, I will discuss extensions of classical convergence results to uniformly smooth Banach spaces (think Lp, 1


STS 3rd April 2018
11:00 to 12:00
Jon August Wellner Inference for the mode of a log-concave density: a likelihood ratio test and confidence intervals
I will discuss a likelihood ratio test for the mode of a log-concave density. The new test is based on comparison of the log-likelihoods corresponding to the unconstrained maximum likelihood estimator of a log-concave density and the constrained maximum likelihood estimator, where the constraint is that the mode of the density is fixed, say at m. The constrained estimators have many properties in common with the unconstrained estimators discussed by Walther (2001), Pal, Woodroofe, and Meyer (2007), Dümbgen and Rufibach (2009), and Balabdaoui, Rufibach and Wellner (2010), but they differ from the unconstrained estimator under the null hypothesis on n^{−1/5} neighborhoods of the mode m. Using joint limiting properties of the unconstrained and constrained estimators we show that under the null hypothesis (and strict curvature of - log f at the mode), the likelihood ratio statistic is asymptotically pivotal: that is, it converges in distribution to a limiting distribution which is free of nuisance parameters, thus playing the role of the chi-squared distribution in classical parametric statistical problems. By inverting this family of tests, we obtain new (likelihood ratio based) confidence intervals for the mode of a log-concave density f. These new intervals do not depend on any smoothing parameters. We study the new confidence intervals via Monte Carlo studies and illustrate them with several real data sets. The new confidence intervals seem to have several advantages over existing procedures.   This talk is based on joint work with Charles Doss.



STS 5th April 2018
11:00 to 12:00
Tobias Kley Sequential detection of structural changes in irregularly observed data
Online surveillance of time series is traditionally done with the aim to identify changes in the marginal distribution under the assumption that the data between change-points is stationary and that new data is observed at constant frequency. In many situations of interest to data analysts, the classical approach is too restrictive to be used unmodified. We propose a unified system for the monitoring of structural changes in streams of data where we use generalised likelihood ratio-type statistics in the sequential testing problem, obtaining the flexibility to account for the various types of changes that are practically relevant (such as, for example, changes in the trend of the mean). The method is applicable to sequences where new observations are allowed to arrive irregularly. Early identification of changes in the trend of financial data can assist to make trading more profitably. In an empirical illustration we apply the procedure to intra-day prices of components of the NASDAQ-100 stock market index. This project is joint work with Piotr Fryzlewicz.



STS 10th April 2018
11:00 to 12:00
Heather Battey Large numbers of explanatory variables
The lasso and its variants are powerful methods for regression analysis when there are a small number of study individuals and a large number of potential explanatory variables. There results a single model, while there may be several simple representations equally compatible with the data. I will outline a different approach, whose aim is essentially a confidence set of models. A probabilistic assessment of the method will be given. 

The talk is based on joint work with David R Cox.




STS 12th April 2018
11:00 to 12:00
Georg Hahn Recent advances in quantum annealing and outlook on its potential in statistics
Since the 1970s, the potential of quantum computing has been a field of extensive research, particularly its advantages and disadvantages over classical computing. This research, however, was theoretical since physical quantum devices were unavailable. With the recent availability of the first adiabatic computers (or quantum annealers), computational mathematics and statistics (as all other computational sciences) are provided with a new means of great potential. This talk will begin with an introduction to quantum annealing and proceed with a presentation of recent advances in the field. Special focus will be given to two topics: Solving the NP-hard problem of finding cliques in a graph and the reduction of binary quadratic forms for scalable quantum annealing. Further relevant works will be discussed, especially those exploring the statistical properties of quantum annealing. To stimulate discussion, the talk will highlight future directions of research, in particular the statistical analysis of the (empirical) distribution of annealing solutions, the characterisation of classes of statistical methods allowing formulations suitable for quantum computation (and hence almost instant solutions), and the exploitation of the inherent randomness in adiabatic computing for statistical purposes.



STS 24th April 2018
11:00 to 12:00
Ian McKeague Causal trees and Wigner's semicircle law
Various aspects of standard model particle physics might be explained by a suitably rich algebra acting on itself, as suggested recently by Furey (2015). This talk discusses the statistical behavior of large causal tree diagrams that combine freely independent elements in such an algebra. It is shown that some of the familiar limiting distributions in random matrix theory (namely the Marchenko-Pastur law and Wigner's semicircle law) emerge in this setting as limits of normalized sums-over-paths of non-negative elements of the algebra assigned to the edges of the tree. These results are established in the setting of non-commutative probability. Trees with classically independent positive edge weights (random multiplicative cascades) were originally proposed by Mandelbrot as a model displaying the fractal features of turbulence. The novelty of the present approach is the use of non-commutative (free) probability to allow the edge weights to take values in an algebra. Potential applications in theoretical neuroscience related to Alan Turing's famous "Imitation Game" paper are also discussed.



STS 26th April 2018
11:00 to 12:00
Jacob Bien High-dimensional variable selection when features are sparse
It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which a large number of columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from biology (e.g., rare species) to natural language processing (e.g., rare words). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. An application to online hotel reviews demonstrates the gain in accuracy achievable by proper treatment of rare words. This is joint work with Xiaohan Yan.



STS 1st May 2018
11:00 to 12:00
Dino Sejdinovic Approximate kernel embeddings of distributions
Kernel embeddings of distributions and the Maximum Mean Discrepancy (MMD), the resulting probability metric, are useful tools for fully nonparametric hypothesis testing and for learning on distributional inputs; i.e., where labels are only observed at an aggregate level. I will give an overview of this framework and describe the use of large-scale approximations to kernel embeddings in the context of Bayesian approaches to learning on distributions and in the context of distributional covariate shift; e.g., where measurement noise on the training inputs differs from that on the testing inputs.



STS 3rd May 2018
11:00 to 12:00
Sandipan Roy Multiple change-point estimation in high-dimensional Gaussian graphical models
We consider the consistency properties of a regularised estimator for the simultaneous identification of both changepoints and graphical dependency structure in multivariate time-series. Traditionally, estimation of Gaussian Graphical Models (GGM) is performed in an i.i.d setting. More recently, such models have been extended to allow for changes in the distribution, but only where changepoints are known a-priori. In this work, we study the Group-Fused Graphical Lasso (GFGL) which penalises partial-correlations with an L1 penalty while simultaneously inducing block-wise smoothness over time to detect multiple changepoints. We present a proof of consistency for the estimator, both in terms of changepoints, and the structure of the graphical models in each segment. Several synthetic experiments and a real data application validate the performance of the proposed methodology.



STS 8th May 2018
11:00 to 12:00
Qiyang Han Least squares estimation: Beyond Gaussian regression models
We study the convergence rate of the least squares estimator (LSE) in a regression model with possibly heavy-tailed errors. Despite its importance in practical applications, theoretical understanding of this problem has been limited. We first show that from a worst-case perspective, the convergence rate of the LSE in a general non-parametric regression model is given by the maximum of the Gaussian regression rate and the noise rate induced by the errors. In the more difficult statistical model where the errors only have a second moment, we further show that the sizes of the 'localized envelopes' of the model give a sharp interpolation for the convergence rate of the LSE between the worst-case rate and the (optimal) parametric rate. These results indicate both certain positive and negative aspects of the LSE as an estimation procedure in a heavy-tailed regression setting. The key technical innovation is a new multiplier inequality that sharply controls the size of the multiplier empirical process associated with the LSE, which also finds applications in shape-restricted and sparse linear regression problems.



STS 10th May 2018
11:00 to 12:00
Tatyana Krivobokova Partial least squares for dependent data
We consider the linear and kernel partial least squares algorithms for dependent data and study the consequences of ignoring the dependence both theoretically and numerically. For linear partial least squares estimator we derive convergence rates and show that ignoring non-stationary dependence structures can lead to inconsistent estimation. For kernel partial least squares estimator we establish convergence rates under a source and an effective dimensionality conditions. It is shown both theoretically and in simulations that long range dependence results in slower convergence rates. A protein dynamics example illustrates our results and shows high predictive power of partial least squares.
This is joint work with Marco Singer, Axel Munk and Bert de Groot.




STS 15th May 2018
11:00 to 12:00
Jana Jankova Inference for eigenstructure of high-dimensional covariance matrices
Sparse principal component analysis (PCA) has become one of the most widely used techniques for dimensionality reduction in high-dimensional datasets. The main challenge underlying sparse PCA is to estimate the first vector of loadings of the population covariance matrix, provided that
only a certain number of loadings are non-zero. A vast number of methods have been proposed in literature for point estimation of eigenstructure of the covariance matrix. In this work, we study uncertainty quantification and propose methodology for inference and hypothesis testing for individual loadings and for the largest eigenvalue of the covariance matrix. We base our methodology on a Lasso-penalized M-estimator which, despite non-convexity, may be solved by a polynomial-time algorithm such as coordinate or gradient descent. Our results provide theoretical guarantees for asymptotic normality of the new estimators and may be used for valid hypothesis testing and variable selection. 




STS 17th May 2018
11:00 to 12:00
Xinghao Qiao Regression with Dependent Functional Errors-in-Predictors
Functional regression is an important topic in functional data analysis. Traditionally, in functional regression, one often assumes that samples of the functional predictor are independent realizations of an underlying stochastic process, and are observed over a grid of points contaminated by independent and identically distributed measurement errors. However, in practice, the dynamic dependence across different curves may exist and the parametric assumption on the measurement error covariance structure could be unrealistic. In this paper, we consider functional linear regression with serially dependent functional predictors, when the contamination of predictors by measurement error is "genuinely functional" with fully nonparametric covariance structure. Inspired by the fact that the autocovariance operator of the observed functional predictor automatically filters out the impact from the unobservable measurement error, we propose a novel generalized-method-of-moments estimator of the slope function. The asymptotic properties of the resulting estimators under different scenarios are established. We also demonstrate that the proposed method significantly outperforms possible competitors through intensive simulation studies. Finally, the proposed method is applied to a public financial dataset, revealing some interesting findings. This is a joint work with Cheng Chen and Shaojun Guo.
STS 22nd May 2018
11:00 to 12:00
Marc Hallin Multivariate Distribution and Quantile Functions, Ranks and Signs: A measure transportation approach
Unlike the real line, the d-dimensional space R^d, for d > 1, is not canonically ordered. As a consequence, such fundamental and strongly order-related univariate concepts as quantile and distribution functions, and their empirical counterparts, involving ranks and signs, do not canonically extend to the multivariate context. Palliating that lack of a canonical ordering has remained an open problem for more than half a century, and has generated an abundant literature, motivating, among others, the development of statistical depth and copula-based methods. We show here that, unlike the many definitions that have been proposed in the literature, the measure transportation-based ones   introduced in Chernozhukov et al. (2017) enjoy all the properties (distribution-freeness and preservation of semiparametric efficiency) that make univariate quantiles and ranks successful tools for semiparametric statistical inference. We therefore propose a new center-outward definition of multivariate distribution and quantile functions, along with their empirical counterparts, for which we establish a Glivenko-Cantelli result. Our approach, based on results by McCann (1995), is geometric rather than analytical and, contrary to the Monge-Kantorovich one in Chernozhukov et al. (2017) (which assumes compact supports or finite second-order moments), does not require any moment assumptions. The  resulting ranks and signs are shown to be strictly distribution-free, and maximal invariant under the action of transformations (namely, the gradients of convex functions, which thus are playing the role of order-preserving transformations) generating the family of absolutely continuous distributions; this, in view of a general result by Hallin and Werker (2003), implies preservation of semiparametric efficiency. As for the resulting quantiles, they are equivariant under the same transformations, which confirms the order-preserving nature of gradients of convex function.



STS 29th May 2018
11:00 to 12:00
Victor Panaretos Amplitude and phase variation of point processes
The amplitude variation of a real random field X(t) consists in its random oscillations in its range space (the "y-axis"), typically encapsulated by its (co)variation around a mean level. In contrast, phase variation refers to fluctuations in its domain (the "x-axis"), often caused by random time changes or spatial deformations. We consider the problem of identifiably formalising similar notions for (potentially spatial) point processes, and of nonparametrically separating them based on realisations of i.i.d. copies of the phase-varying point process. The key element of our approach is the use of the theory of optimal transportation of measure, which is proven to be the natural formalism for the problem under the usual assumptions imposed. It is shown to allow the consistent separation of the two types of variation for point processes over Euclidean domains, under no parametric restrictions, including convergence rates, and even asymptotic distributions in some cases. (Based on joint work with Y. Zemel, Göttingen.




STS 30th May 2018
10:00 to 11:00
Edward George The remarkable flexibility of BART
For the canonical regression setup where one wants to discover the relationship between Y and a p-dimensional vector x, BART (Bayesian Additive Regression Trees) approximates the conditional mean E[Y|x] with a sum of regression trees model, where each tree is constrained by a regularization prior to be a weak learner. Fitting and inference are accomplished via a scalable iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. To further illustrate the modeling flexibility of BART, we introduce two elaborations, MBART and HBART. Exploiting the potential monotonicity of E[Y|x] in components of x, MBART incorporates such monotonicity with a multivariate basis of monotone trees, thereby enabling estimation of the decomposition of E[Y|x] into its unique monotone components. To allow for the possibility of heteroscedasticity, HBART incorporates an additional product of regression trees model component for the conditional variance, thereby providing simultaneous inference about both E[Y|x] and Var[Y|x]. (This is joint research with Hugh Chipman, Matt Pratola, Rob McCulloch and Tom Shively.)




STS 31st May 2018
11:00 to 12:00
Yining Chen Narrowest-over-threshold detection of multiple change-points and change-point-like features
We propose a new, generic and flexible methodology for nonparametric function estimation, in which we first estimate the number and locations of any features that may be present in the function, and then estimate the function parametrically between each pair of neighbouring detected features. Examples of features handled by our methodology include change-points in the piecewise-constant signal model, kinks in the piecewise-linear signal model, and other similar irregularities, which we also refer to as generalised change-points. Our methodology works with only minor modifications across a range of generalised change-point scenarios, and we achieve such a high degree of generality by proposing and using a new multiple generalised change-point detection device, termed Narrowest-Over-Threshold (NOT). NOT offers highly competitive performance when being applied together with some information criteria. For selected scenarios, we show the consistency and near-optimality of NOT in detecting the number and locations of generalised change-points. The computational complexity as well as its extensions will also be discussed.

(Joint work with Rafal Baranowski and Piotr Fryzlewicz.)




STS 5th June 2018
11:00 to 12:00
Jennifer Wadsworth Spatial extremes: A conditional approach
The past decade has seen a huge effort in modelling the extremes of spatial processes. Significant challenges include the development of models with an appropriate asymptotic justification for the tail; ensuring model assumptions are compatible with the data; and the fitting of these models to (at least reasonably) high-dimensional datasets. I will review basic ideas of modelling spatial extremes, and introduce an approach based on the (multivariate) conditional extreme value model of Heffernan and Tawn (2004, JRSSB) and Heffernan and Resnick (2007, Ann. Appl. Prob.). Advantages of the conditional approach include its asymptotic motivation, flexibility, and comparatively simple inference, meaning that it can be fitted to reasonably large datasets. I will illustrate ideas by presenting parts of an application (in progress) to temperature extremes across Australia.




STS 7th June 2018
11:00 to 12:00
Timothy Cannings Classification with imperfect training labels
We study the effect of imperfect training data labels on the performance of classification methods. In a general setting, where the probability that an observation in the training dataset is mislabelled may depend on both the feature vector and the true label, we bound the excess risk of an arbitrary classifier trained with imperfect labels in terms of its excess risk for predicting a noisy label. This reveals conditions under which a classifier trained with imperfect labels remains consistent for classifying uncorrupted test data points. Furthermore, under stronger conditions, we derive detailed asymptotic properties for the popular k-nearest neighbour (k-nn), Support Vector Machine (SVM) and Linear Discriminant Analysis (LDA) classifiers. One consequence of these results is that the k-nn and SVM classifiers are robust to imperfect training labels, in the sense that the rate of convergence of the excess risks of these classifiers remains unchanged; in fact, it even turns out that in some cases, imperfect labels may improve the performance of these methods. On the other hand, the LDA classifier is shown to be typically inconsistent in the presence of label noise unless the prior probabilities of each class are equal.
STS 12th June 2018
11:00 to 12:00
Shahin Tavakoli Tests for separability in nonparametric covariance operators of random surfaces
The assumption of separability of the covariance operator for a random image or hypersurface can be of substantial use in applications, especially in situations where the accurate estimation of the full covariance structure is unfeasible, either for computational reasons or due to a small sample size. However, inferential tools to verify this assumption are somewhat lacking in high-dimensional or functional settings where this assumption is most relevant. We propose here to test separability by focusing on K-dimensional projections of the difference between the covariance operator and its nonparametric separable approximation. The subspace we project onto is one generated by the eigenfunctions estimated under the separability hypothesis, negating the need to ever estimate the full non-separable covariance. We show that the rescaled difference of the sample covariance operator with its separable approximation is asymptotically Gaussian. As a by-product of this result, we derive asymptotically pivotal tests under Gaussian assumptions, and propose bootstrap methods for approximating the distribution of the test statistics when multiple eigendirections are taken into account. We probe the finite sample performance through simulations studies, and present an application to log-spectrogram images from a phonetic linguistics dataset. This is joint work with Davide Pigoli (KCL) and John Aston (Cambridge)
STS 14th June 2018
10:00 to 11:00
Patrick Rebeschini Multi-agent learning: Implicit regularization and order-optimal gossip
In distributed machine learning, data are stored and processed in multiple locations by different agents. Each agent is represented by a node in a graph, and communication is allowed between neighbours. In the decentralised setting typical of peer-to-peer networks, there is no central authority that can aggregate information from all the nodes. A typical setting involves agents cooperating with their peers to learn models that can perform better on new, unseen data. In this talk, we present the first results on the generalisation capabilities of distributed stochastic gradient descent methods. Using algorithmic stability, we derive upper bounds for the test error and provide a principled approach for implicit regularization, tuning the learning rate and the stopping time as a function of the graph topology. We also present a new Gossip protocol for the aggregation step in distributed methods that can yield order-optimal communication complexity. Based on non-reversible Markov chains, our protocol is local and does not require global routing, hence improving existing methods. (Joint work with Dominic Richards)




STS 19th June 2018
11:00 to 12:00
Yanyuan Ma A superpopulation treatment to case-control data analysis
We study the regression relationship among covariates in case-control data, an area known as the secondary analysis of case-control studies. The context is such that only the form of the regression mean is specified, so that we allow an arbitrary regression error distribution, which can depend on the covariates and thus can be heteroscedastic. Under mild regularity conditions we establish the theoretical identifiability of such models. Previous work in this context has either (a) specified a fully parametric distribution for the regression errors, (b) specified a homoscedastic distribution for the regression errors, (c) has specified the rate of disease in the population (we refer this as true population), or (d) has made a rare disease approximation. We construct a class of semiparametric estimation procedures that rely on none of these. The estimators differ from the usual semiparametric ones in that they draw conclusions about the true population, while technically operating in a hypothetic superpopulation. We also construct estimators with a unique feature, in that they are robust against the misspecification of the regression error distribution in terms of variance structure, while all other nonparametric effects are estimated despite of the biased samples. We establish the asymptotic properties of the estimators and illustrate their finite sample performance through simulation studies.




STS 22nd June 2018
10:00 to 11:00
Victor Brunel Concentration of the empirical contours of Tukey’s halfspace depth
Tukey's halfspace depth has attracted much interest in data analysis, because it is a natural way of measuring the notion of depth relative to a cloud of points or, more generally, to a probability measure. Given an i.i.d. sample, we investigate the concentration of upper level sets of the Tukey depth relative to that sample around their population version. We show that under some mild assumptions on the underlying probability measure, concentration occurs at a parametric rate and we deduce moment inequalities at that same rate. In a computational prospective, we study the concentration of a discretized version of the empirical upper level sets




STS 22nd June 2018
16:00 to 17:00
Peter Bühlmann Rothschild Lecture: Causality, Invariance and Robustness
Is it a cause or an effect? This simple but fundamental question has a long history in science and society. Randomised studies serve as the gold standard for inferring causality but they are often expensive or even
impossible to do due to ethical reasons. Recent approaches try to "substitute in part" the randomised studies by models, algorithms and statistical methods. Perhaps surprisingly, heterogeneity in potentially
large-scale data can be beneficially exploited for causal inference and novel robustness, with wide-ranging prospects for various applications. The key idea relies on a notion of probabilistic invariance: it opens up new insights with connections to frameworks used in robust optimisation and economics.




STSW04 25th June 2018
11:00 to 11:45
Iain Johnstone Eigenstructure in high dimensional random effects models
The eigenstructure of i.i.d. samples from high dimensional data is known to show phenomena unexpected from experience with low dimensions -- eigenvalue spreading, bias and eigenvector inconsistency. Motivated by some problems in quantitative genetics, we now consider high dimensional data with more than one level of variation, and more specifically the spectral properties of estimators of the various components of variance matrices. We briefly describe bulk and edge eigenvalue distributions in this setting, and then focus more attention on properties and estimation of 'spiked' models. This is joint work with Zhou Fan and Yi Sun.
STSW04 25th June 2018
11:45 to 12:30
Philippe Rigollet Uncoupled isotonic regression via minimum Wasserstein deconvolution
Isotonic regression is a standard problem in shape constrained estimation where the goal is to estimate an unknown nondecreasing regression function $f$ from independent pairs $(x_i,y_i)$ where $\E[y_i]=f(x_i), i=1, \ldots n$. While this problem is well understood both statistically and computationally, much less is known about its uncoupled counterpart where one is given uncoupled $\{x_1, \ldots, x_n\}$ and $\{y_1, \ldots, y_n\}$. In this work, we leverage tools from optimal transport theory to derive minimax rates under weak moments conditions on $y_i$ together with an efficient algorithm. Both upper and lower bounds are articulated around moment-matching arguments that are also pertinent to learning mixtures of distributions and deconvolution. [Joint work with Jonathan Weed (MIT)]
STSW04 25th June 2018
14:00 to 14:45
Matthew Stephens On applications of Empirical Bayes approaches to the Normal Means problem
The normal means problem is very simple: given normally-distributed observations with known variances and unknown means, estimate the means. That is, given X_j \sim N(\theta_j, \sigma_j^2, estimate \theta_j. A key idea is that one can do better than the maximum likelihood estimates, \hat{\theta}_j= \X_j, in particular by use of appropriate "shrinkage" estimators. One attractive way to perform shrinkage estimation in practice is to use Empirical Bayes methods. That is, to assume that \theta_j are independent and identically distributed from some distribution g that is to be estimated from the data. Then, given such an estimate \hat{g}, the posterior distributions of \theta_j can be computed to perform inference. We call this the "Empirical Bayes Normal Means" (EBNM) problem.

Despite its simplicity, solving the EBNM problem has a wide range of practical applications. Here we present some flexible non-parametric approaches we have recently developed for solving the EBNM problem, and describe their application to several different settings: false discovery rate (FDR) estimation, non-parametric smoothing, and sparse matrix factorization problems (ie sparse factor analysis and sparse principal components analysis).

STSW04 25th June 2018
14:45 to 15:30
Jana Jankova Asymptotic Inference for Eigenstructure of Large Covariance Matrices
A vast number of methods have been proposed in literature for point estimation of eigenstructure of covariance matrices in high-dimensional settings. In this work, we study uncertainty quantification and propose methodology for inference and hypothesis testing for individual loadings of the covariance matrix. We base our methodology on a Lasso-penalized M-estimator which, despite non-convexity, may be solved by a polynomial-time algorithm such as coordinate or gradient descent. Our results provide theoretical guarantees on asymptotic normality of the new estimators and may be used for valid hypothesis testing and variable selection. These results are achieved under a sparsity condition relating the number of non-zero loadings, sample size, dimensionality of the covariance matrix and spectrum of the covariance matrix. This talk is based on joint work with Sara van de Geer.
STSW04 25th June 2018
16:00 to 16:45
Flori Bunea A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations. The model links all cell probabilities, collected in a p x n matrix, via the assumption that this matrix can be factorized as the product of two non-negative matrices A and W. Topic models have been originally developed in text mining, when one browses through n documents, based on a dictionary of p words, and covering K topics. In this terminology, the matrix A is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in this talk. Notably, the unique A is required to satisfy what is commonly known as the anchor word assumption, under which A has an unknown number of rows respectively proportional to K-dimensional canonical basis vectors. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the K vertices of a simplex. This crucial step in the estimation of A requires K to be known, and cannot be easily extended to the more realistic set-up when K is unknown. In this talk, we take a different view on anchor word estimation, and on the estimation of the word-topic matrix A. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any n, p, K and document length N, and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We illustrate, via simulations, that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
STSW04 26th June 2018
09:00 to 09:45
Guy Bresler Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure
The prototypical high-dimensional statistics problem entails finding a structured signal in noise. Many of these problems exhibit a curious phenomenon: the amount of data needed by all known computationally efficient algorithms far exceeds what is needed for inefficient algorithms that search over all possible structures. A line of work initiated by Berthet and Rigollet in 2013 has aimed to explain these gaps by reducing from conjecturally hard problems in computer science. However, the delicate nature of average-case reductions has limited the applicability of this approach. In this work we introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. These include tight lower bounds for Planted Independent Set, Planted Dense Subgraph, Sparse Spiked Wigner, Sparse PCA, as well as for new models that we introduce. Joint work with Matthew Brennan and Wasim Huleihel.
STSW04 26th June 2018
09:45 to 10:30
Chao Gao Reduced Isotonic Regression
Consider an $n$-dimensional vector $X$ with mean $\theta^*$. In this talk, we consider $\theta^*$ that is both nondecreasing and has a piecewise constant structure. We establish the exact minimax rate of estimating such monotone functions, and thus give a non-trivial answer to an open problem in the shape-constrained analysis literature. The minimax rate involves an interesting iterated logarithmic dependence on the dimension. We then develop a penalized least-squares procedure for estimating $\theta^*$ adaptively. This estimator is shown to achieve the derived minimax rate without the knowledge of the number of pieces in linear time. We further allow the model to be misspecified and derive oracle inequalities with the optimal rates for the proposed estimator. This is a joint work with Fang Han and Cun-hui Zhang.
STSW04 26th June 2018
11:00 to 11:45
Ryan Tibshirani Dykstra’s Algorithm, ADMM, and Coordinate Descent: Connections, Insights, and Extensions
We study connections between Dykstra’s algorithm for projecting onto an intersection of convex sets, the augmented Lagrangian method of multipliers or ADMM, and block coordinate descent. We prove that coordinate descent for a regularized regression problem, in which the penalty is a separable sum of support functions, is exactly equivalent to Dykstra’s algorithm applied to the dual problem. ADMM on the dual problem is also seen to be equivalent, in the special case of two sets, with one being a linear subspace. These connections, aside from being interesting in their own right, suggest new ways of analyzing and extending coordinate descent. For example, from existing convergence theory on Dykstra’s algorithm over polyhedra, we discern that coordinate descent for the lasso problem converges at an (asymptotically) linear rate. We also develop two parallel versions of coordinate descent, based on the Dykstra and ADMM connections. Finally, we discuss the implications of this work for backfitting in additive models.
STSW04 26th June 2018
11:45 to 12:30
Peter Bickel Two network scale challenges:Constructing and fitting hierarchical block models and fitting large block models using the mean field method
Work with S.Bhattacharyya,T.Li,E.Levina,S.Mukherjee,P.Sarkar Networks are a complex type of structure presenting itself in many applications . They are usually represented by a graph ,with possibly weighted edges plus additional covariates (such as directions).Block models have been studied for some time as basic approximations to ergodic stationary probability models for single graphs.A huge number of fitting methods have been developed for these models some of which we will touch on. The mean field method in which an increasing number of parameters must be fitted is used not only for multiple membership block models but also in applications such as LDA.if the graph is too large poor behaviour of the method can be seen.. We have developed what we call "patch methods " for fitting which help both computationally and inferentially in such situations bur much further analysis is needed. It is intuitively clear but mathematically unclear how knowledge of the model having nested scales helps in fitting large scale as opposed to small scale parameters.We will discuss this issue through an example,
STSW04 26th June 2018
14:00 to 14:45
Marten Herman Wegkamp Adaptive estimation of the rank of the regression coefficient matrix
We consider the multivariate response regression problem with a regression coefficient matrix of low, unknown rank. In this setting, we analyze a new criterion for selecting the optimal reduced rank. This criterion does not require estimation of the unknown variance of the noise, nor depends on a delicate choice of a tuning parameter. We develop an iterative, fully data-driven procedure, that adapts to the optimal signal-to-noise ratio. This procedure finds the true rank in a few steps with overwhelming probability. At each step, our estimate increases, while at the same time it does not exceed the true rank. Our finite sample results hold for any sample size and any dimension, even when the number of responses and of covariates grow much faster than the number of observations. An extensive simulation study that confirms our theoretical findings in both low and high dimensional settings. This is joint work with Xin Bing.
STSW04 26th June 2018
14:45 to 15:30
Maryam Fazel Competitive Online Algorithms for Budgeted Allocation with Application to Online Experiment Design
We consider an online resource allocation problem, where the goal is to maximize a function of a positive semidefinite (PSD) matrix with a scalar budget constraint. The problem data arrives online, and the algorithm needs to make an irrevocable decision at each step. Of particular interest are classic experiment design problems in the online setting, with the algorithm deciding whether to allocate budget to each experiment as new experiments become available sequentially. We analyze two classes of primal-dual algorithms and provide bounds on their competitive ratios. Our analysis relies on a smooth surrogate of the objective function that needs to satisfy a new diminishing-returns property. We then formulate a convex optimization problem to directly optimize this competitive ratio bound; this design problem can be solved offline before the data start arriving. We give examples of computing the smooth surrogate for D-optimal and A-optimal experiment design. This is joint work with Reza Eghbali and James Saunderson.
STSW04 26th June 2018
16:00 to 16:45
Alex d'Aspremont An Approximate Shapley-Folkman Theorem.
with Thomas Kerdreux and Igor Colin. The Shapley-Folkman theorem shows that Minkowski averages of uniformly bounded sets tend to be convex when the number of terms in the sum becomes much larger than the ambient dimension. In optimization, \citet{Aubi76} show that this produces an a priori bound on the duality gap of separable nonconvex optimization problems involving finite sums. This bound is highly conservative and depends on unstable quantities, and we relax it in several directions to show that non convexity can have a much milder impact on finite sum minimization problems such as empirical risk minimization and multi-task classification. As a byproduct, we show a new version of Maurey's classical approximate Carath\'eodory lemma where we sample a significant fraction of the coefficients, without replacement.
STSW04 27th June 2018
09:00 to 09:45
Urvashi Oswal Selection and Clustering of Correlated variables using OWL/GrOWL regularizers
In high-dimensional linear regression problems, it is likely that several covariates are highly correlated e.g. in fMRI data, certain voxels or brain regions may have very correlated activation patterns. Using standard sparsity-inducing regularization (such as Lasso) in such scenarios is known to be unsatisfactory, as it leads to the selection of a subset of the highly correlated covariates. For engineering purposes and/or scientific interpretability, it is often desirable to explicitly identify all of the covariates that are relevant for modeling the data. In this talk, I will present clustering properties and error bounds of Ordered Weighted $\ell_1$ (OWL) regularization for linear regression, and a group variant for multi-task regression called Group OWL (GrOWL). I will present the application of OWL/GrOWL in three settings: (1) Brain networks (2) Subspace clustering and (3) Deep Learning. I will demonstrate, in theory and experiments, how OWL/GrOWL deal with strongly correlated covariates by automatically clustering and averaging regression coefficients associated with those covariates. This is joint work with Robert Nowak, Mário Figueiredo, Tim Rogers and Chris Cox.
STSW04 27th June 2018
09:45 to 10:30
Elizaveta Levina Matrix completion in network analysis
Matrix completion is an active area of research in itself, and a natural tool to apply to network data, since many real networks are observed incompletely and/or with noise. However, developing matrix completion algorithms for networks requires taking into account the network structure. This talk will discuss three examples of matrix completion used for network tasks. First, we discuss the use of matrix completion for cross-validation on network data, a long-standing problem in network analysis. Two other examples focus on reconstructing incompletely observed networks, with structured missingness resulting from network sampling mechanisms. One scenario we consider is egocentric sampling, where a set of nodes is selected first and then their connections to the entire network are observed. Another scenario focuses on data from surveys, where people are asked to name a given number of friends. We show that matrix completion, when used appropriately, can generally be very helpful in solving network problems.
STSW04 27th June 2018
11:00 to 11:45
Garvesh Raskutti Estimating sparse additive auto-regressive network models
Consider a multi-variate time series, which may correspond to spike train responses for multiple neurons in a brain, crime event data across multiple regions, and many others. An important challenge associated with these time series models is to estimate an influence network between the d variables, especially when the number of variables d is large meaning we are in the high-dimensional setting. Prior work has focused on parametric vector auto-regressive models. However, parametric approaches are somewhat restrictive in practice. In this paper, we use the non-parametric sparse additive model (SpAM) framework to address this challenge. Using a combination of beta and phi-mixing properties of Markov chains and empirical process techniques for reproducing kernel Hilbert spaces (RKHSs), we provide upper bounds on mean-squared error in terms of the sparsity s, logarithm of the dimension logd, number of time points T, and the smoothness of the RKHSs. Our rates are sharp up to logarithm factors in many cases. We also provide numerical experiments that support our theoretical results and display potential advantages of using our non-parametric SpAM framework for a Chicago crime dataset.
STSW04 27th June 2018
11:45 to 12:30
Peter Bartlett Representation, optimization and generalization properties of deep neural networks
Deep neural networks have improved the state-of-the-art performance for prediction problems across an impressive range of application areas. This talk describes some recent results in three directions. First, we investigate the impact of depth on representational properties of deep residual networks, which compute near-identity maps at each layer, showing how their representational power improves with depth and that the functional optimization landscape has the desirable property that stationary points are optimal. Second, we study the implications for optimization in deep linear networks, showing how the success of a family of gradient descent algorithms that regularize towards the identity function depends on a positivity condition of the regression function. Third, we consider how the performance of deep networks on training data compares to their predictive accuracy, we demonstrate deviation bounds that scale with a certain "spectral complexity," and we compare the behavior of these bounds with the observed performance of these networks in practical problems.

Joint work with Steve Evans, Dylan Foster, Dave Helmbold, Phil Long, and Matus Telgarsky.

STSW04 28th June 2018
09:00 to 09:45
Tong Zhang Candidates vs. Noises Estimation for Large Multi-Class Classification Problem
In practice, there has been sigificant interest in multi-class classification problems where the number of classes is large. Computationally such applications require statistical methods with run time sublinear in the number of classes. A number of methods such as Noise-Contrastive Estimation (NCE) and variations have been proposed in recent years to address this problem. However, the existing methods are not statistically efficient compared to multi-class logistic regression, which is the maximum likelihood estimate. In this talk, I will describe a new method called Candidate v.s. Noises Estimation (CANE) that selects a small subset of candidate classes and samples the remaining classes. We show that CANE is always consistent and computationally efficient. Moreover, the resulting estimator has low statistical variance approaching that of the maximum likelihood estimator, when the observed label belongs to the selected candidates with high probability. Extensive experimental results show that CANE achieves better prediction accuracy over a number of the state-of-the-art tree classifiers, while it gains significant speedup compared to standard multi-class logistic regression.
STSW04 28th June 2018
09:45 to 10:30
Aurore Delaigle Estimating a covariance function from fragments of functional data
Functional data are often observed only partially, in the form of fragments. In that case, the standard approaches for estimating the covariance function do not work because entire parts of the domain are completely unobserved. In previous work, Delaigle and Hall (2013, 2016) have suggested ways of estimating the covariance function, based for example on Markov assumptions. In this work we take a completely different approach which does not rely on such assumptions. We show that, using a tensor product approach, it is possible to reconstruct the covariance function using observations located only on the diagonal of its domain.
STSW04 28th June 2018
11:00 to 11:45
Francis Bach Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model. (Joint work with Loucas Pillaud-Vivien and Alessandro Rudi)




STSW04 28th June 2018
11:45 to 12:30
Edward Ionides Monte Carlo adjusted profile likelihood, with applications to spatiotemporal and phylodynamic inference.
Partially observed nonlinear stochastic dynamic systems raise inference challenges. Sequential Monte Carlo (SMC) methods provide a route to accessing the likelihood function. However, despite the advantage of applicability to a wide class of nonlinear models, standard SMC methods have a limitation that they scale poorly to large systems. We present a profile likelihood approach, properly adjusted for Monte Carlo uncertainty, that enables likelihood-based inference in systems for which Monte Carlo error remains large despite stretching the limits of available computational resources. Together with state-of-the-art SMC algorithms, this technique permits effective inference on some scientific problems in panel time series analysis, spatiotemporal modeling, and inferring population dynamic models from genetic sequence data. The results presented are joint work with Carles Breto, Joonha Park, Alex Smith and Aaron King.
STSW04 28th June 2018
14:00 to 14:45
Jinchi Lv Asymptotics of Eigenvectors and Eigenvalues for Large Structured Random Matrices
Characterizing the exact asymptotic distributions of high-dimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. This paper establishes the asymptotic properties of the spiked eigenvectors and eigenvalues for the generalized Wigner random matrix, where the mean matrix is assumed to have a low-rank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we provide novel asymptotic expansions for the general linear combination and further show that the linear combination is asymptotically normal after some normalization, where the weight vector can be an arbitrary unit vector. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications. This is a joint work with Jianqing Fan, Yingying Fan and Xiao Han.
STSW04 28th June 2018
14:45 to 15:30
Rui Castro Are there needles in a (moving) haystack? Adaptive sensing for detection and estimation of static and dynamically evolving signals
In this work we investigate the problem of testing for the presence of dynamically evolving sparse signals. This is motivated by settings were the signal of interest changes rapidly while measurements are being collected (e.g., monitoring of the electromagnetic spectrum, or detection of hot spots of a rapidly spreading disease). We model the signal as an n dimensional vector that has s non-zero components, and at each time a fraction of the non-zero components may change their location. The statistical problem is to decide whether the signal is a zero vector or in fact it has non-zero components. This decision is based on m noisy observations of individual signal components. We consider two different sensing paradigms, namely adaptive and non-adaptive sensing. For non-adaptive sensing the choice of components to measure has to be decided before the data collection process started, while for adaptive sensing one can adjust the sensing process based on observations collected earlier. We characterize the difficulty of this detection problem in both sensing paradigms, with special interest to the speed of change of the active components. In addition we provide an adaptive sensing algorithm for this problem and contrast its performance to that of best non-adaptive detection algorithms. Interestingly, when the number of measurements is small (on the order n/s), there is a significant difference in performance between the two sensing paradigms, as non-adaptive sensing is unable to exploit the dynamic nature of the signal (based on joint works with Ervin Tánczos).
STSW04 28th June 2018
16:00 to 16:45
Pradeep Ravikumar Robust Estimation via Robust Gradient Estimation
A common assumption in the training of machine learning systems is that the data is sufficiently clean and well-behaved: there are very few or no outliers, or that the distribution of the data does not have very long tails. As machine learning finds wider usage, these assumptions are increasingly indefensible. The key question then is how to perform estimation that is robust to departure from these assumptions; and which has been of classical interest, with seminal contributions due to Box, Tukey, Huber, Hampel, and several others. Loosely, there seemed to be a computation-robustness tradeoff, practical estimators did have strong robustness guarantees, while estimators with strong robustness guarantees were computationally impractical. In our work, we provide a new class of computationally-efficient class of estimators for risk minimization that are provably robust to a variety of robustness settings, such as arbitrary oblivious contamination, and heavy-tailed data, among others. Our workhorse is a novel robust variant of gradient descent, and we provide conditions under which our gradient descent variant provides accurate and robust estimators in any general convex risk minimization problem. These results provide some of the first computationally tractable and provably robust estimators for general statistical models. Joint work with Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan.
STSW04 29th June 2018
09:00 to 09:45
Alexandre Tsybakov Does data interpolation contradict statistical optimality?
We exhibit estimators that interpolate the data, and yet achieve optimal rates of convergence for the problems of nonparametric regression and prediction with square loss. This curious observation goes against the usual intuition that a good statistical procedure should forego the exact fit to data in favor of a more smooth representation. A motivation for this work is the recent focus within the machine learning community on the out-of-sample performance of neural networks. These flexible models are typically trained to fit the data exactly (either in their sign or in the actual value), and yet they have good prediction behaviour. This talk is based on joint work with Misha Belkin and Sasha Rakhlin.
STSW04 29th June 2018
09:45 to 10:30
Vincent Vu Group invariance and computational sufficiency
Statistical sufficiency formalizes the notion of data reduction. In the decision theoretic interpretation, once a model is chosen all inferences should be based on a sufficient statistic. However, suppose we start with a set of methods that share a sufficient statistic rather than a specific model. Is it possible to reduce the data beyond the statistic and yet still be able to compute all of the methods? In this talk, I'll present some progress towards a theory of "computational sufficiency" and show that strong reductions _can_ be made for large classes of penalized M-estimators by exploiting hidden symmetries in the underlying optimization problems. These reductions can (1) enable efficient computation and (2) reveal hidden connections between seemingly disparate methods. As a main example, I'll show how the theory provides a surprising answer to the following question: "What do the Graphical Lasso, sparse PCA, single-linkage clustering, and L1 penalized Ising model selection all have in common?"
STSW04 29th June 2018
11:00 to 11:45
Richard Samworth Data perturbation for data science
When faced with a dataset and a problem of interest, should we propose a statistical model and use that to inform an appropriate algorithm, or dream up a potential algorithm and then seek to justify it? The former is the more traditional statistical approach, but the latter appears to be becoming more popular. I will discuss a class of algorithms that belong in the second category, namely those that involve data perturbation (e.g. subsampling, random projections, artificial noise, knockoffs,...). As examples, I will consider Complementary Pairs Stability Selection for variable selection and sparse PCA via random projections. This will involve joint work with Rajen Shah, Milana Gataric and Tengyao Wang.
STSW04 29th June 2018
11:45 to 12:30
Domagoj Ćevid, Peter Bühlmann Deconfounding using Spectral Transformations
High-dimensional regression methods which rely on the sparsity of the ground truth, such as the Lasso, might break down in the presence of confounding variables. If a latent variable affects both the response and the predictors, the correlation between them changes. This phenomenon can be represented as a linear model where the sparse coefficient vector has been perturbed. We will present our work on this problem. We investigate and propose some spectral transformations for the data which serve as input for the Lasso. We discuss assumptions for achieving the optimal error rate and illustrate the performance on a genomic dataset. The approach is easy to use and leads to convincing results. The talk is based on joint work with Nicolai Meinshausen.
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons