Videos and presentation materials from other INI events are also available.
Event | When | Speaker | Title | Presentation Material |
---|---|---|---|---|
STSW01 |
15th January 2018 10:00 to 10:45 |
Michael Jordan |
On Gradient-Based Optimization: Accelerated, Stochastic and Nonconvex
Many new theoretical challenges have arisen in the area of gradient-based
optimization for large-scale statistical data analysis, driven by the needs of
applications and the opportunities provided by new hardware and software platforms.
I discuss several related, recent results in this area: (1) a new framework
for understanding Nesterov acceleration, obtained by taking a continuous-time,
Lagrangian/Hamiltonian/symplectic perspective, (2) a discussion of how to
escape saddle points efficiently in nonconvex optimization, and (3) the
acceleration of Langevin diffusion.
|
![]() |
STSW01 |
15th January 2018 11:10 to 11:55 |
Rina Foygel Barber |
Robust inference with the knockoff filter
In this talk, I will present ongoing work on the knockoff filter for inference in regression. In a high-dimensional model selection problem, we would like to select relevant features without too many false positives. The knockoff filter provides a tool for model selection by creating knockoff copies of each feature, testing the model selection algorithm for its ability to distinguish true from false covariates to control the false positives. In practice, the modeling assumptions that underlie the construction of the knockoffs may be violated, as we cannot know the exact dependence structure between the various features. Our ongoing work aims to determine and improve the robustness properties of the knockoff framework in this setting. We find that when knockoff features are constructed using estimated feature distributions whose errors are small in a KL divergence type measure, the knockoff filter provably controls the false discovery rate at only a slightly higher level. This work is joint with Emmanuel Candès and Richard Samworth.
|
![]() |
STSW01 |
15th January 2018 11:55 to 12:40 |
Jonas Peters |
The Hardness of Conditional Independence Testing
Testing for conditional independence between continuous random variables is considered to be a hard statistical problem. We provide a formalization of this result and show that a test with correct size does not have power against any alternative. It is thus impossible to obtain any guarantee if the relationships between the random variables can be arbitrarily complex. We propose a practical test that achieves the correct size if the conditional expectations are smooth enough such that they can be estimated from data. This is joint work with Rajen Shah.
|
![]() |
STSW01 |
15th January 2018 14:10 to 14:55 |
Jianqing Fan |
Uniform pertubation analysis of eigenspaces and its applications to Community Detection, Ranking and Beyond
Co-authors: Emmanuel Abbe (Princeton University), Kaizheng Wang (Princeton University), Yiqiao Zhong (Princeton University) Spectral methods have been widely used for a large class of challenging problems, ranging from top-K ranking via pairwise comparisons, community detection, factor analysis, among others. Analyses of these spectral methods require super-norm perturbation analysis of top eigenvectors. This allows us to UNIFORMLY approximate elements in eigenvectors by linear functions of the observed random matrix that can be analyzed further. We first establish such an infinity-norm pertubation bound for top eigenvectors and apply the idea to several challenging problems such as top-K ranking, community detections, Z_2-syncronization and matrix completion. We show that the spectral methods are indeed optimal for these problems. We illustrate these methods via simulations. |
![]() |
STSW01 |
15th January 2018 14:55 to 15:40 |
Nicolai Meinshausen |
Instruments, Images, and Anchors
Co-authors: Dominik Rothenhäusler, Peter Bühlmann, Christina Heinze (ETH Zurich), and Jonas Peters (Univ of Copenhagen)
I will present some ongoing work on two related projects. Firstly, some joint work with Jonas Peters, Dominik Rothenhaeulser and Peter Buhlmann on anchor regression, which is putting both standard regression and causal regression into a common framework and allows to interpolate between both in a continuous fashion. The anchor regression can be seen as a least-squares term with a causal, instrumental-variable-type, penalty. The chosen penalty is shown to correspond to the expected future magnitude of interventions and some risk bounds can be derived for the high-dimensional setting. Second, in joint work with Christina Heinze, I will show an application of related ideas to a (non-causal) setting in image classification.
|
|
STSW01 |
15th January 2018 16:10 to 16:55 |
Gareth Roberts |
The zig-zag and super-efficient sampling for Bayesian analysis of big data
Standard MCMC methods can scale poorly to big data settings due to the need to evaluate the likelihood at each iteration. There have been a number of approximate MCMC algorithms that use sub-sampling ideas to reduce this computational burden, but with the drawback that these algorithms no longer target the true posterior distribution. The talk will discuss a new family of Monte Carlo methods based upon a multi-dimensional version of the Zig-Zag process of (Bierkens, Roberts, 2016), a continuous time piecewise deterministic Markov process. While traditional MCMC methods are reversible by construction the Zig-Zag process offers a flexible non-reversible alternative. The dynamics of the Zig-Zag process correspond to a constant velocity model, with the velocity of the process switching at events from a point process. The rate of this point process can be related to the invariant distribution of the process. If we wish to target a given posterior distribution, then rates need to be set equal to the gradient of the log of the posterior. Unlike traditional MCMC, Zig-Zag process can be simulated without discretisation error, and give conditions for the process to be ergodic. Most importantly, I will discuss two generalisations which have good scaling properties for big data: firstly a sub-sampling version of the Zig-Zag process that is an example of an exact approximate scheme; and secondly a control-variate variant of the sub-sampling idea to reduce the variance of our unbiased estimator. Very recent ergodic theory will also be described.
|
![]() |
STSW01 |
16th January 2018 09:00 to 09:45 |
Andrea Montanari |
An Instability in Variational Methods for Learning Topic Models
Topic models are extremely useful to extract latent degrees of freedom form large unlabeled datasets. Variational Bayes algorithms are the approach most commonly used by practitioners to learn topic models. Their appeal lies in the promise of reducing the problem of variational inference to an optimization problem. I will show that, even within an idealized Bayesian scenario, variational methods display an instability that can lead to misleading results.
[Based on joint work with Behroz Ghorbani and Hamid Javadi]
|
![]() |
STSW01 |
16th January 2018 09:45 to 10:30 |
Pradeep Ravikumar |
The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities
Non-parametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose near-parametric assumptions on the form of the density functions. In this paper, we leverage recent developments in parametric graphical models to propose a class of non-parametric multivariate densities which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form.
Joint work with Mladen Kolar, Arun Sai Suggala.
|
![]() |
STSW01 |
16th January 2018 11:00 to 11:45 |
Yingying Fan |
RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs
Co-authors: Emre Demirkaya (University of Southern Califnornia), Gaorong Li (Beijing University of Technology), Jinchi Lv (University of Southern Califnornia) Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this paper, we provide theoretical foundations on the power and robustness for the model- free knockoffs procedure introduced recently in Cand`es, Fan, Janson and Lv (2016) in high-dimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified model-free knockoffs method called graphical nonlinear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptoti cally controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knock- offs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real data set is analyzed to further assess the performance of the suggested knockoffs procedure. Related Links |
|
STSW01 |
16th January 2018 11:45 to 12:30 |
Andreas Krause |
Learning and inference in probabilistic submodular models
I will present our work on inference and learning in discrete probabilistic models defined through submodular functions. These generalize pairwise graphical models and determinantal point processes, express natural notions such as attractiveness and repulsion and allow to capture richly parameterized, long-range, high-order dependencies. The key idea is to use sub- and supergradients of submodular functions, and exploit their combinatorial structure to efficiently optimize variational upper and lower bounds on the partition function. This approach allows to perform efficient approximate inference in any probabilistic model that factorizes into log-submodular and log-supermodular potentials of arbitrary order. Our approximation is exact at the mode for log-supermodular distributions, and we provide bounds on the approximation quality of the log-partition function with respect to the curvature of the function. I will also discuss how to learn log-supermodular distributions via bi-level optimisation. In particular, we show how to compute gradients of the variational posterior, which allows integrating the models into modern deep architectures.
This talk is primarily based on joint work with Josip Djolonga
|
|
STSW01 |
16th January 2018 14:00 to 14:45 |
Quentin Berthet |
Optimal Link Prediction with Matrix Logistic Regression
We consider the problem of link prediction, based on partial observation of a large network and on covariates associated to its vertices. The generative model is formulated as matrix logistic regression. The performance of the model is analysed in a high-dimensional regime under structural assumption. The minimax rate for the Frobenius norm risk is established and a combinatorial estimator based on the penalised maximum likelihood approach is shown to achieve it. Furthermore, it is shown that this rate cannot be attained by any algorithm computable in polynomial time, under a computational complexity assumption. (Joint work with Nicolai Baldin)
|
![]() |
STSW01 |
16th January 2018 14:45 to 15:30 |
Rainer von Sachs |
Selecting Groups of Variables for Prediction Problems in Chemometrics: Recent Regularization Approaches
This presentation addresses the problem of selecting important, potentially overlapping groups of predictor variables
in linear models such that the resulting model satisfies a balance between interpretability and prediction performance.
This is motivated by data from the field of chemometrics where, due to correlation between predictors from different
groups (i.e. variable group “overlap”), identifying groups during model estimation is particularly challenging.
In particular, we will highlight some issues of existing methods when they are applied to high dimensional data with
overlapping groups of variables. This will be demonstrated through comparison of their optimization criteria and
their performance on simulated data.
This is joint work in progress with Rebecca Marion, ISBA, Université catholique de Louvain, Belgium.
|
|
STSW01 |
17th January 2018 09:00 to 09:45 |
Alex d'Aspremont |
Sharpness, Restart and Compressed Sensing Performance
Joint work with Vincent Roulet (University of Washington) and Nicolas Boumal (Princeton University).
We show that several classical quantities controlling compressed sensing performance directly match parameters controlling algorithmic complexity. We first describe linearly convergent restart schemes on first-order methods using a sharpness assumption. The Lojasievicz inequality shows that sharpness bounds on the minimum of convex optimization problems hold almost generically. Here, we show that sharpness directly controls the performance of restart schemes. For sparse recovery problems, sharpness at the optimum can be written as a condition number, given by the ratio between true signal sparsity and the largest signal size that can be recovered by the observation matrix. Overall, this means that in compressed sensing problems, a single parameter directly controls both computational complexity and recovery performance.
|
![]() |
STSW01 |
17th January 2018 09:45 to 10:30 |
Lorenzo Rosasco |
Optimal and efficient learning with random features
Random features approaches correspond to one hidden layer neural networks with random hidden units, and can be seen as approximate kernel methods. We study the statistical and computational properties of random features within a ridge regression scheme. We prove for the first time that a number of random features much smaller than the number of data points suffices for optimal statistical error, with a corresponding huge computational gain. We further analyze faster rates under refined conditions and the potential benefit of random features chosen according to adaptive sampling schemes.
|
![]() |
STSW01 |
17th January 2018 11:00 to 11:45 |
Rebecca Willett |
Nonlinear Models for Matrix Completion
The past decade of research on matrix completion has shown it is possible to leverage linear dependencies to impute missing values in a low-rank matrix. However, the corresponding assumption that the data lies in or near a low-dimensional linear subspace is not always met in practice. Extending matrix completion theory and algorithms to exploit low-dimensional nonlinear structure in data will allow missing data imputation in a far richer class of problems. In this talk, I will describe several models of low-dimensional nonlinear structure and how these models can be used for matrix completion. In particular, we will explore matrix completion in the context of three different nonlinear models: single index models, in which a latent subspace model is transformed by a nonlinear mapping; unions of subspaces, in which data points lie in or near one of several subspaces; and nonlinear algebraic varieties, a polynomial generalization of classical linear subspaces. In these settings, we will explore novel and efficient algorithms for imputing missing values and new bounds on the amount of missing data that can be accurately imputed. The proposed algorithms are able to recover synthetically generated data up to predicted sample complexity bounds and outperform standard low-rank matrix completion in experiments with real recommender system and motion capture data.
|
![]() |
STSW01 |
17th January 2018 11:45 to 12:30 |
Carola-Bibiane Schönlieb |
Learning from mistakes - learning optimally sparse image filters by quotient minimisation
Co-authors: Martin Benning (University of Cambridge), Guy Gilboa (Technion, Haifa), Joana Grah (University of Cambridge) Learning approaches have recently become very popular in the field of inverse problems. A large variety of methods has been established in recent years, ranging from bi-level learning to high-dimensional machine learning techniques. Most learning approaches, however, only aim at fitting parametrised models to favourable training data whilst ig- noring misfit training data completely. In this talk, we fol- low up on the idea of learning parametrised regularisation functions by quotient minimisation. We consider one- and higher-dimensional filter functions to be learned and allow for fit- and misfit-training data consisting of multiple func- tions. We first present results resembling behaviour of well- established derivative-based sparse regularisers like total variation or higher-order total variation in one-dimension. Then, we introduce novel families of non-derivative-based regularisers. This is accomplished by learning favourable scales and geometric properties while at the same time avoiding unfavourable ones. |
![]() |
STSW01 |
18th January 2018 09:00 to 09:45 |
Yi Yu |
Optimal Covariance Change Point Detection in High Dimension
Co-authors: Daren Wang (Carnegie Mellon University), Alessandro Rinaldo (Carnegie Mellon University) In this paper, we study covariance change point detection problem in high dimension. Specifically, we assume that the time series $X_i \in \mathbb{R}^p$, $i = 1, \ldots, n$ are independent $p$-dimensional sub-Gaussian random vectors and that the corresponding covariance matrices $\{\Sigma_i\}_{i=1}^n$ are stationary within segments and only change at certain time points. Our generic model setting allows $p$ grows with $n$ and we do not place any additional structural assumptions on the covariance matrices. We introduce algorithms based on binary segmentation (e.g. Vostrikova, 1981) and wild binary segmentation (Fryzlewicz, 2014) and establish the consistency results under suitable conditions. To improve the detection performance in high dimension, we propose Wild Binary Segmentation through Independent Projection (WBSIP). We show that WBSIP can optimally estimate the locations of the change points. Our analysis also reveals a phase transition effect based on our generic model assumption and to the best of our knowledge, this type of results have not been established elsewhere in the change point detection literature. |
![]() |
STSW01 |
18th January 2018 09:45 to 10:30 |
Alexandre Tsybakov |
Adaptive estimation of functionals under sparsity
Adaptive estimation of functionals in sparse mean model and in sparse regression exhibits some interesting effects. This talk focuses on estimation of the L_2-norm of the target vector and of the variance of the noise. In the first problem, the ignorance of the noise level causes changes in the rates. Moreover, the form of the noise distribution also infuences the optimal rate. For example, the rates of estimating the variance differ depending on whether the noise is Gaussian or sub-Gaussian without a precise knowledge of the distribution. Finally, for the sparse mean model, the sub-Gaussian rate cannot be attained adaptively to the noise level on classes of noise distributions with polynomial tails, independently on how fast is the polynomial decay. Joint work with O.Collier and L.Comminges.
|
|
STSW01 |
18th January 2018 11:00 to 11:45 |
Claire Boyer |
On the gap between local recovery guarantees in structured compressed sensing and oracle estimates
This is an on-going work with Ben Adcock and Simone Brugiapaglia (Simon Fraser University, Burnaby).
First we will introduce a compressed sensing (CS) theory more compatible with real-life applications: we derive guarantees to ensure reconstruction of a structured sparse signal of interest while imposing structure in the acquisition (no Gaussian measurements here...).
We will study how far those CS results are from oracle-type guarantees, and we will show that they are similar in terms of the required number of measurements.
These results give an insight to design new optimal sampling strategies when realistic physical constraints are imposed in the acquisition.
|
![]() |
STSW01 |
18th January 2018 11:45 to 12:30 |
Ioannis Kontoyiannis |
Small Big Data: Temporal structure in discrete time series
The identification of useful temporal structure in discrete time series is an important component of algorithms used for many tasks in statistical inference and machine learning. Most of the early approaches developed were ineffective in practice, because the amount of data required for reliable modeling grew exponentially with memory length. On the other hand, many of the more modern methodological approaches that make use of more flexible and parsimonious models result in algorithms that do not scale well and are computationally ineffective for larger data sets. We will discuss a class of novel methodological tools for effective Bayesian inference and model selection for general discrete time series, which offer promising results on both small and big data. Our starting point is the development of a rich class of Bayesian hierarchical models for variable-memory Markov chains. The particular prior structure we adopt makes it possible to design effective, linear-time algorithms that can compute most of the important features of the resulting posterior and predictive distributions without resorting to MCMC. We have applied the resulting algorithms and methodological tools to numerous application-specific tasks (including on-line prediction, segmentation, classification, anomaly detection, entropy estimation, and causality testing) on data sets from different areas of application, including data compression, neuroscience, finance, genetics, and animal communication. Results on both simulated and real data will be presented, and brief comparisons with other approaches (including Bühlmann et al’s VLMC, Ron et al’s PST, and Raftery’s MTD) will be discussed. |
![]() |
STSW01 |
18th January 2018 14:00 to 14:45 |
Judith Rousseau |
Asymptotics for ABC algorithms
Approximate Bayesoan Computation algorithms (ABC) are used in cases where the likelihood is intractable. To simulate from the (approximate) posterior distribution a possiblity is to sample new data from the model and check is these new data are close in some sense to the true data. The output of this algorithms thus depends on how we define the notion of closeness, which is based on a choice of summary statistics and on a threshold. Inthis work we study the behaviour of the algorithm under the assumption that the summary statistics are concentrating on some deterministic quantity and characterize the asymptotic behaviour of the resulting approximate posterior distribution in terms of the threshold and the rate of concentration of the summary statistics. The case of misspecified models is also treated where we show that surprising asymptotic behaviour appears.
|
![]() |
STSW01 |
18th January 2018 14:45 to 15:30 |
Andreas Elsener |
Sharp oracle inequalities for stationary points of nonconvex penalised M-estimators
Co-author: Sara van de Geer (ETH Zurich) Nonconvex loss functions are used in several areas of statistics and machine learning. They have several appealing properties as in the case of robust regression. We propose a general framework to derive sharp oracle inequalities for stationary points of penalised M-estimators with nonconvex loss. The penalisation term is assumed to be a weakly decomposable norm. We apply the general framework to sparse (additive) corrected linear regression, sparse PCA, and sparse robust regression. Finally, a new estimator called "robust SLOPE" is proposed to illustrate how to apply our framework to norms different from the l1-norm. |
|
STSW01 |
18th January 2018 16:00 to 16:45 |
Rajen Shah |
The xyz algorithm for fast interaction search in high-dimensional data
When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main effects and include interactions as products between individual variables. However, if the number of variables $p$ is large, as is common in genomic datasets, the computational cost of searching through $O(p^2)$ interactions can be prohibitive. In this talk I will introduce a new randomised algorithm called xyz that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. The underlying idea is to transform interaction search into a much simpler close pairs of points problem. We will see how strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $O(p^u)$ operations for $1
|
![]() |
STSW01 |
19th January 2018 09:00 to 09:45 |
Alessandro Rudi |
Falkon: fast and optimal kernel method for large scale machine learning
Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt{n}) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures.
|
![]() |
STSW01 |
19th January 2018 09:45 to 10:30 |
Philippe Rigollet |
Learning determinantal point processes
Co-authors: Victor-Emmanuel Brunel (MIT), Ankur Moitra (MIT), John Urschel (MIT) Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning (such as recommendation systems) where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. In this talk, I will present recent results related to this problem, specifically - Rates of convergence for the maximum likelihood estimator: by studying the local and global geometry of the expected log-likelihood function we are able to establish rates of convergence for the MLE and give a complete characterization of the cases where these are parametric. We also give a partial description of the critical points for the expected log-likelihood. - Optimal rates of convergence for this problem: these are achievable by the method of moments and are governed by a combinatorial parameter, which we call the cycle sparsity. - A fast combinatorial algorithm to implement the method of moments efficiently. The necessary background on DPPs will be given in the talk. Joint work with Victor-Emmanuel Brunel (M.I.T), Ankur Moitra (M.I.T) and John Urschel (M.I.T). |
![]() |
STSW01 |
19th January 2018 11:00 to 11:45 |
Shahar Mendelson |
An optimal unrestricted learning procedure
The question of prediction is of central importance in Statistical Learning Theory. The optimal tradeoff between accuracy and confidence and the identity of a procedure that attains that optimal tradeoff have been studied extensively over the years. In this talk I present some of ideas used in the recent solution of this problem: a procedure $\hat{f}$ that attains the best possible accuracy/confidence tradeoff for a triplet $(F,X,Y)$ consisting of an arbitrary class $F$, an unknown distribution $X$ and an unknown target $Y$. Specifically, I explain why under minimal assumptions on $(F,X,Y)$, there is a natural critical level $r^*$ that depends on the triplet and the sample size $N$, such that for any accuracy parameter $r>r^*$, one has $E((\hat{f}(X)-Y)^2|D) \leq \inf_{f \in F} E(f(X)-Y)^2+r^2$ with probability at least $1-2\exp(-cN \min\{1,r^2/\sigma^2\})$, where $\sigma^2=\inf_{f \in F} E(f(X)-Y)^2$. |
|
STSW01 |
19th January 2018 11:45 to 12:30 |
Garvesh Raskutti |
Variable selection and classification with large-scale presence only data
Co-author: Hyebin Song (University of Wisconsin-Madison) In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses where the number of covariates $p$ is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the \emph{PUlasso} algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal mean-squared error of $\frac{s \log p}{n}$, where $s$ is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate $p$ settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Related Links
|
![]() |
STS |
23rd January 2018 11:00 to 12:00 |
Anru Zhang |
Singular Value Decomposition for High-dimensional High-order Data
High-dimensional
high-order data arise in many modern scientific applications including
genomics, brain imaging, and social science. In this talk, we consider the
methods, theories, and computations for tensor singular value decomposition
(tensor SVD), which aims to extract the hidden low-rank structure from
high-dimensional high-order data. First, comprehensive results are developed on
both the statistical and computational limits for tensor SVD under the general
scenario. This problem exhibits three different phases according to
signal-noise-ratio (SNR), and the minimax-optimal statistical and/or
computational results are developed in each of the regimes. In addition,
we further consider the sparse tensor singular value decomposition which allows
more robust estimation under sparsity structural assumptions. A novel sparse
tensor alternating thresholding algorithm is proposed. Both the optimal
theoretical results and numerical analyses are provided to guarantee the
performance of the proposed procedure. |
|
STS |
25th January 2018 11:00 to 12:00 |
Nathalie Akakpo |
Multivariate intensity estimation via hyperbolic wavelet selection
We propose a new
statistical procedure that can in some sense overcome the curse of
dimensionality without structural assumptions on the function to estimate. It
relies on a least-squares type penalized criterion and a new collection of
models built from hyperbolic biorthogonal wavelet bases. We study its
properties in a unifying intensity estimation framework, where an oracle-type
inequality and adaptation to mixed dominating smoothness are shown to hold. We
also explain how to implement the estimator with an algorithm whose complexity
is manageable. (Reference: JMVA, Volume 161, p.32-57) |
|
STS |
30th January 2018 11:00 to 12:00 |
Chao Zheng |
Revisiting Huber’s M-Estimation: A Tuning-Free Approach
We
introduce a novel scheme to choose the scale or robustification parameter in
Huber’s method for mean estimation and linear regression in both low and high
dimensional settings, which is tuning-free. For robustly estimating the mean of
a univariate distribution, we first consider the adaptive Huber estimator with
the robustification parameter calibrated via the censored equation approach.
Our theoretical results provide finite sample guarantees for both the
estimation and calibration parts. To further reduce the computational
complexity, we next develop an alternating M-estimation procedure, which
simultaneously estimates the mean and variance in a sequential manner. This
idea can be naturally extended to regression problems in both low and high
dimensions. We provide simple and fast algorithms to implement this procedure
under various scenarios and study the numerical performance through
simulations. |
![]() |
STS |
1st February 2018 11:00 to 12:00 |
Qiwei Yao |
Testing for High-dimensional White Noise
Testing for white
noise is a fundamental problem in statistical inference, as many testing
problems in linear modelling can be transformed into a white noise test.
While the celebrated Box-Pierce test and its variants tests are often
applied for model diagnosis, their relevance in the context
of high-dimensional modeling is not well understood, as the
asymptotic null distributions are established for fixed dimensions.
Furthermore, those tests typically lose power when the dimension of time
series is relatively large in relation to the sample size. In this talk,
we introduce two new omnibus tests for high-dimensional time series. The first method uses the maximum absolute autocorrelations and cross-correlations of the component series as the testing statistic. Based on an approximation by the L-infinity norm of a normal random vector, the critical value of the test can be evaluated by bootstrapping from a multivariate normal distribution. In contrast to the conventional white noise test, the new method is proved to be valid for testing departure from white noise that is not independent and identically distributed. The second test statistic is defined as the sum of squared singular values of the first q lagged sample autocovariance matrices. Therefore, it encapsulates all the serial correlations (up to the time lag q) within and across all component series. Using the tools from random matrix theory, we derive the normal limiting distributions when both the dimension and the sample size diverge to infinity. |
![]() |
STS |
8th February 2018 11:00 to 12:00 |
Darren Wilkinson |
A compositional approach to scalable statistical modelling and computation
In statistics, and in life, we typically solve big problems by (recursively) breaking them down into smaller problems that we can solve more easily, and then compose the solutions of the smaller problems to provide a solution to the big problem that we are really interested in. This "divide and conquer" approach is necessary for the development of genuinely scalable models and algorithms. It is therefore unfortunate that statistical models and algorithms are not usually formulated in a composable way, and that the programming languages typically used for scientific and statistical computing also fail to naturally support composition of models, data and computation. The mathematical subject of category theory is in many ways the mathematical study of composition, and provides significant insight into the development of more compositional models of computation. Functional programming languages which are strongly influenced by category theory turn out to be much better suited to the development of scalable statistical algorithms than the imperative programming languages more commonly used. Expressing algorithms in a functional/categorical way is not only more elegant, concise and less error-prone, but provides numerous more tangible benefits, such as automatic parallelisation and distribution of algorithms. I will illustrate the concepts using examples such as the statistical analysis of streaming data, image analysis, numerical integration of PDEs, particle filtering, Gibbs sampling, and probabilistic programming, using concepts from category theory such as functors, monads and comonads. Illustrative code snippets will given using the Scala programming language.
|
![]() |
STS |
13th February 2018 11:00 to 12:00 |
Arlene Kim |
Adaptation in log-concave density estimation
The
log-concave maximum likelihood estimator of a density on R^d on a sample of
size n is known to attain the minimax optimal rate of convergence up to a log
factor when d = 2 and d = 3. In this talk, I will review the univariate
adaptation result, and will present new results on adaptation properties in the
multivariate setting. This is based on joint work with Oliver Feng, Aditya
Guntuboyina and Richard Samworth |
![]() |
STS |
15th February 2018 11:00 to 12:00 |
Clifford Lam |
Nonlinear Shrinkage Estimation in Quadratic Inference Function Analysis for Correlated Data
Quadratic inference
function (QIF) analysis is more efficient than the generalized estimating
equations (GEE) approach when the working covariance matrices for the data are
misspecified. Since QIF naturally requires a weighting matrix which is the
inverse of a sample covariance matrix of non-identically distributed data,
finite sample performance can be greatly affected when the number of independent
data points is not large enough, which is usually the case in cluster
randomized trials or many longitudinal studies. While nonlinear shrinkage is
very successful in regularizing the extreme eigenvalues of a sample covariance
matrix, the method is only restricted to independent and identically
distributed data. We propose a novel nonlinear shrinkage approach for a sample
covariance matrix of non-identically distributed data, which improves finite
sample performance of QIF, and gives room for increasing the potential number
of working correlation structures for even better performance. Together with a
nonlinearly shrunk weighting matrix, we derive the asymptotic normality of the
parameter estimators, and give an estimator for the asymptotic covariance matrix.
We demonstrate the performance of the proposed method through simulation
experiments and a real data analysis. |
![]() |
STS |
20th February 2018 11:00 to 12:00 |
Alastair Young |
A Hybrid Block Bootstrap For Sample Quantiles Under Weak Dependence
The subsampling
bootstrap and the moving blocks bootstrap provide effective methods for
nonparametric inference with weakly dependent data. Both are based on the
notion of resampling (overlapping) blocks of successive observations from a
data sample: in the former single blocks are sampled, while the latter splices
together random blocks to yield bootstrap series of the same length as the
original data sample. Here we discuss a general theory for block bootstrap
distribution estimation for sample quantiles, under mild strong mixing
assumptions. A hybrid between subsampling and the moving blocks bootstrap is
shown to give theoretical benefits, and startling improvements in accuracy in
distribution estimation in important practical settings. An intuitive procedure
for empirical selection of the optimal number of blocks and their length is
proposed. The conclusion that bootstrap samples should be of smaller size than
the original data sample has significant implications for computational
efficiency and scalability of bootstrap methodologies in dependent data
settings. This is joint work with Todd Kuffner and Stephen Lee and is described
at https://arxiv.org/abs/1710.02537. |
![]() |
STS |
22nd February 2018 11:00 to 12:00 |
Varun Jog |
Information-theoretic perspectives on learning algorithms
In statistical learning theory,
generalization error is used to quantify the degree to which a supervised
machine learning algorithm may overfit to training data. We overview some
recent work [Xu and Raginsky (2017)] that bounds generalization error of empirical
risk minimization based on the mutual information I(S;W) between the algorithm
input S and the algorithm output W. We leverage these results to derive
generalization error bounds for a broad class of iterative algorithms that are
characterized by bounded, noisy updates with Markovian structure, such as
stochastic gradient Langevin dynamics (SGLD). We describe certain shortcomings
of mutual information-based bounds, and propose alternate bounds that employ
the Wasserstein metric from optimal transport theory. We compare the
Wasserstein metric-based bounds with the mutual information-based bounds and
show that for a class of data generating distributions, the former leads to
stronger bounds on the generalization error.
|
![]() |
STS |
27th February 2018 11:00 to 12:00 |
Danning Li |
Testing covariance matrices in high dimensions
Testing covariance
structure is of significant interest in many areas of high-dimensional
inference. Using extreme-value form statistics to test against sparse
alternatives and using quadratic form statistics to test against dense alternatives
are two important testing procedures for high-dimensional independence.
However, quadratic form statistics suffer from low power against sparse
alternatives, and extreme-value form statistics suffer from low power against
dense alternatives with small disturbances. It would be important and appealing
to derive powerful testing procedures against general alternatives (either
dense or sparse), which is more realistic in real-world applications. Under the
ultra high-dimensional setting, we propose two novel testing procedures with
explicit limiting distributions to boost the power against general
alternatives. |
![]() |
OFBW38 |
28th February 2018 10:00 to 10:10 |
Richard Samworth, Jane Leeks | Welcome and Introduction | |
OFBW38 |
28th February 2018 10:10 to 10:50 |
Sergio Bacallado | Making sense of Human Microbiome Data |
![]() |
OFBW38 |
28th February 2018 10:50 to 11:30 |
Po-Ling Loh | Data Science for Networked Data |
![]() |
OFBW38 |
28th February 2018 11:50 to 12:20 |
Chris Harbron | Delivering Better Treatments to the Right Patients Faster and More Safely : Opportunities for Big Data to Improve Pharmaceutical R&D |
![]() |
OFBW38 |
28th February 2018 12:20 to 12:50 |
Nick Goldman | Genomics Research at EBI: Challenges in Statistical Scaling |
![]() |
OFBW38 |
28th February 2018 12:50 to 13:00 |
Questions and Discussions | ||
OFBW38 |
28th February 2018 14:00 to 14:30 |
Andrew Richards | Future Challenges for Scaling Big Data at the National Grid |
![]() |
OFBW38 |
28th February 2018 14:30 to 15:00 |
Sofia Olhede | Scalable Inference for Complex Environmental Data |
![]() |
OFBW38 |
28th February 2018 15:00 to 15:10 |
Questions and Discussions | ||
OFBW38 |
28th February 2018 15:30 to 16:00 |
Dave Yearling | Big Data Challenges in the Performance of Large IP Networks |
![]() |
OFBW38 |
28th February 2018 16:00 to 16:30 |
GCHQ Speaker | Taking Big Data Upstream: Trading Storage for Approximation | |
OFBW38 |
28th February 2018 16:30 to 16:50 |
Questions, Discussion and Wrap-up | ||
STS |
1st March 2018 11:00 to 12:00 |
Patrick Rubin-Delanchy |
Progress on the connection between spectral embedding and network models used by the probability, statistics and machine-learning communities
In this talk, I give theoretical and methodological results,
based on work spanning Johns Hopkins, the Heilbronn Institute for Mathematical
Research, Imperial and Bristol, regarding the connection between various graph
spectral methods and commonly used network models which are popular in the
probability, statistics and machine-learning communities. An attractive feature
of the results is that they lead to very simple take-home messages for network
data analysis: a) when using spectral embedding, consider eigenvectors from
both ends of the spectrum; b) when implementing spectral clustering, use
Gaussian mixture models, not k-means; c) when interpreting spectral embedding,
think of "mixtures of behaviour" rather than "distance".
Results are illustrated with cyber-security applications.
|
![]() |
STS |
6th March 2018 11:00 to 12:00 |
Rahul Mazumder |
New directions in solving structured nonconvex problems in multivariate statistics
Nonconvex problems arise
frequently in modern applied statistics and machine learning, posing
outstanding challenges from a computational and statistical viewpoint.
Continuous especially convex optimization, has played a key role in our
computational understanding of (relaxations or approximations of) these
problems. However, some other well-grounded techniques in mathematical
optimization (for example, mixed integer optimization) have not been explored
to their fullest potential. When the underlying statistical problem becomes
difficult, simple convex relaxations and/or greedy methods have shortcomings.
Fortunately, many of these can be ameliorated by using estimators that can be
posed as solutions to structured discrete optimization problems. To this end, I
will demonstrate how techniques in modern computational mathematical
optimization (especially, discrete optimization) can be used to address the canonical
problem of best-subset selection and cousins. I will describe how recent
algorithms based on local combinatorial optimization can lead to high quality
solutions in times comparable to (or even faster than) the fastest algorithms
based on L1-regularization. I will also discuss the relatively less understood
low Signal to Noise ratio regime, where usual subset selection performs
unfavorably from a statistical viewpoint; and propose simple alternatives
that rely on nonconvex optimization. If time permits, I will outline problems
arising in the context robust statistics (least median squares/least trimmed
squares), low-rank factor analysis and nonparametric function estimation where,
these techniques seem to be promising.
|
![]() |
STS |
8th March 2018 11:00 to 12:00 |
Tengyao Wang |
Isotonic regression in general dimensions
We study the least squares
regression function estimator over the class of real-valued functions on
[0,1]^d that are increasing in each coordinate. For uniformly bounded
signals and with a fixed, cubic lattice design, we establish that the estimator
achieves the minimax rate of order n^{-min(2/(d+2),1/d)} in the empirical L_2
loss, up to poly-logarithmic factors. Further, we prove a sharp oracle
inequality, which reveals in particular that when the true regression function
is piecewise constant on $k$ hyperrectangles, the least squares estimator
enjoys a faster, adaptive rate of convergence of (k/n)^{min(1,2/d)}, again up
to poly-logarithmic factors. Previous results are confined to the case
d≤2. Finally, we establish corresponding bounds (which are new even in
the case d=2) in the more challenging random design setting. There are
two surprising features of these results: first, they demonstrate that it is
possible for a global empirical risk minimisation procedure to be rate optimal
up to poly-logarithmic factors even when the corresponding entropy integral for
the function class diverges rapidly; second, they indicate that the adaptation
rate for shape-constrained estimators can be strictly worse than the parametric
rate.
|
|
STS |
13th March 2018 11:00 to 12:00 |
Regina Liu |
i-Fusion: Individualized Fusion Learning
Inferences from different data sources can often be fused together to
yield more powerful findings than those from individual sources alone. We
present a new fusion learning approach, called ‘i-Fusion’, for drawing
efficient individualized inference by fusing the leanings from relevant data
sources. i-Fusion is robust for handling heterogeneity arising from diverse
data sources, and is ideally suited for goal-directed applications such as
precision medicine. Specifically, i-Fusion summarizes individual inferences as
confidence distributions (CDs), adaptively forms a clique of individuals that
bear relevance to the target individual, and then suitably combines the CDs
from those relevant individuals in the clique to draw inference for the target
individual. In essence, i-Fusion strategically ‘borrows strength’ from relevant
individuals to improve efficiency while retaining its inference validity.
Computationally, i-Fusion is parallel in nature and scales up well in
comparison with competing approaches. The performance of the approach is
demonstrated by simulations and real data applications. This is joint work with Jieli Shen and Minge Xie, Rutgers University. |
|
STS |
15th March 2018 11:00 to 12:00 |
Mihaela van der Schaar |
Causal Inference for Treatment Effects: A Theory and Associated Learning Algorithms
We investigate the problem of estimating the causal effect of a
treatment on individual subjects from observational data; this is a
central problem in various application domains, including healthcare, social
sciences, and online advertising. We first develop a theoretical foundation of
causal inference for individualized treatment effects based on information
theory. Next, we use this theory, to construct an information-optimal Bayesian
causal inference algorithm. This algorithm embeds the potential outcomes
in a vector-valued reproducing kernel Hilbert space and uses a multi-task
Gaussian process prior over that space to infer the individualized causal
effects. We show that our algorithm significantly outperforms the
state-of-the-art causal inference algorithms. The talk will conclude with a
discussion of the impact of this work on precision medicine and clinical
trials. |
![]() |
STSW02 |
19th March 2018 10:00 to 11:00 |
Victor Panaretos |
Procrustes Analysis of Covariance Operators and Optimal Transport of Gaussian Processes
Covariance operators are fundamental in functional data analysis, providing the canonical means to analyse functional variation via the celebrated Karhunen-Loève expansion. These operators may themselves be subject to variation, for instance in contexts where multiple functional populations are to be compared. Statistical techniques to analyse such variation are intimately linked with the choice of metric on covariance operators, and the intrinsic infinite-dimensionality of these operators. We will describe the manifold-like geometry of the space of trace-class infinite-dimensional covariance operators and associated key statistical properties, under the recently proposed infinite-dimensional version of the Procrustes metric. In particular, we will identify this space with that of centred Gaussian processes equipped with the Wasserstein metric of optimal transportation. The identification allows us to provide a description of those aspects of the geometry that are important in terms of statistical inference, and establish key properties of the Fréchet mean of a random sample of covariances, as well as generative models that are canonical for such metrics. The latter will allow us to draw connections with the problem of registration of warped functional data. Based on joint work with V. Masarotto (EPFL) and Y. Zemel (Göttingen).
|
|
STSW02 |
19th March 2018 11:30 to 12:30 |
Jim Ramsay |
Dynamic Smoothing Meets Gravity
authors: Jim Ramsay, Michelle Carey and Juan Li
institutions: McGill University, University College Dublin, and McGill University
Systems of differential equations are often used to model buffering processes that modulate a non-smooth high-energy input so as to produce an output that is smooth and that distributes the energy load over time and space. Handwriting is buffered in this way. We show that the smooth complex script that spells `"statistics" in Chinese can be represented as buffered version of a series of 46 equal-interval step inputs. The buffer consists of three undamped oscillating springs, one for each orthogonal coordinate. The periods of oscillation vary slightly over the three coordinate in a way that reflects the masses that are moved by muscle activations. Our analyses of data on juggling three balls and on lip motion during speech confirm that this model works for a wide variety of human motions.
We use the term "dynamic smoothing" for the estimation of a structured input functional object along with the buffer characteristics.
|
![]() |
STSW02 |
19th March 2018 13:30 to 14:30 |
Bodhisattva Sen |
Adaptive Confidence Bands for Shape-Restricted Regression in Multi-dimension using Multiscale Tests
Co-author: Pratyay Datta (Columbia University) We consider a multidimensional (continuous) Gaussian white noise regression model. We define suitable multiscale tests in this scenario. Utilizing these tests we construct confidence bands for the underlying regression function with guaranteed coverage probability, assuming that the underlying function is isotonic or convex. These confidence bands are shown to be asymptotically optimal in an appropriate sense. Computational issues will also be discussed. |
![]() |
STSW02 |
19th March 2018 14:30 to 15:30 |
Richard Samworth |
Isotonic regression in general dimensions
Co-authors: Qiyang Han (University of Washington), Tengyao Wang (University of Cambridge), Sabyasachi Chatterjee (University of Illinois) We study the least squares regression function estimator over the class of real-valued functions on $[0,1]^d$ that are increasing in each coordinate. For uniformly bounded signals and with a fixed, cubic lattice design, we establish that the estimator achieves the minimax rate of order $n^{−min\{2/(d+2),1/d\}}$ in the empirical $L_2$ loss, up to poly-logarithmic factors. Further, we prove a sharp oracle inequality, which reveals in particular that when the true regression function is piecewise constant on $k$ hyperrectangles, the least squares estimator enjoys a faster, adaptive rate of convergence of $(k/n)^{min(1,2/d)}$, again up to poly-logarithmic factors. Previous results are confined to the case $d\leq 2$. Finally, we establish corresponding bounds (which are new even in the case $d=2$) in the more challenging random design setting. There are two surprising features of these results: first, they demonstrate that it is possible for a global empirical risk minimisation procedure to be rate optimal up to poly-logarithmic factors even when the corresponding entropy integral for the function class diverges rapidly; second, they indicate that the adaptation rate for shape-constrained estimators can be strictly worse than the parametric rate. |
![]() |
STSW02 |
19th March 2018 16:00 to 17:00 |
Rainer von Sachs |
Intrinsic wavelet regression for curves and surfaces of Hermitian positive definite matrices
Co-author: Joris Chau (ISBA, UC Louvain) In multivariate time series analysis, non-degenerate autocovariance and spectral density matrices are necessarily Hermitian and positive definite. An estimation methodology which preserves these properties is developed based on intrinsic wavelet transforms being applied to nonparametric wavelet regression for curves in the non-Euclidean space of Hermitian positive definite matrices. Via intrinsic average-interpolation in a Riemannian manifold equipped with a natural invariant Riemannian metric, we derive the wavelet coefficient decay and linear wavelet thresholding convergence rates of intrinsically smooth curves. Applying this more specifically to nonparametric spectral density estimation, an important property of the intrinsic linear or nonlinear wavelet spectral estimator under the invariant Riemannian metric is that it is independent of the choice of coordinate system of the time series, in contrast to most existing approaches. As a generalisation of this one-dimensional denoising of matrix-valued curves in the Riemannian manifold we also present higher-dimensional intrinsic wavelet transforms, applied in particular to time-varying spectral estimation of non-stationary multivariate time series, i.e. surfaces of Hermitian positive definite matrices. Related Links
|
|
STSW02 |
20th March 2018 09:00 to 10:00 |
Robert Nowak |
Learning Low-Dimensional Metrics
This talk discusses the problem of learning a low-dimensional Euclidean metric from distance comparisons. Specifically, consider a set of n items with high-dimensional features and suppose we are given a set of (possibly noisy) distance comparisons of the form sign(dist(x,y) − dist(x,z)), where x, y, and z are the features associated with three such items. The goal is to learn the distance function that generates such comparisons. The talk focuses on several key issues pertaining to the theoretical foundations of metric learning: 1) optimization methods for learning general low-dimensional (low-rank) metrics as well as sparse metrics; 2) upper and lower (minimax) bounds on prediction error; 3) quantification of the sample complexity of metric learning in terms of the dimension of the feature space and the dimension/rank of the underlying metric; 4) bounds on the accuracy of the learned metric relative to the underlying true generative metric. Our results involve novel mathematical approaches to the metric learning problem and shed new light on the special case of ordinal embedding (aka non-metric multidimensional scaling).
This is joint work with Lalit Jain and Blake Mason.
|
![]() |
STSW02 |
20th March 2018 10:00 to 11:00 |
Matthew Reimherr |
Manifold Data Analysis with Applications to High-Resolution 3D Imaging
Many scientific areas are faced with the challenge of extracting information from large, complex, and highly structured data sets. A great deal of modern statistical work focuses on developing tools for handling such data. In this work we presents a new subfield of functional data analysis, FDA, which we call Manifold Data Analysis, or MDA. MDA is concerned with the statistical analysis of samples where one or more variables measured on each unit is a manifold, thus resulting in as many manifolds as we have units. We propose a framework that converts manifolds into functional objects, an efficient 2-step functional principal component method, and a manifold-on-scalar regression model. This work is motivated by an anthropological application involving 3D facial imaging data, which is discussed extensively throughout. The proposed framework is used to understand how individual characteristics, such as age and genetic ancestry, influence the shape of the human face.
|
![]() |
STSW02 |
20th March 2018 11:30 to 12:30 |
Davide Pigoli |
Speech as object data: exploring cross-linguistic changes in Romance languages
Exploring phonetic change between languages is of particular importance in the understanding of the history and geographical spread of languages. While many studies have considered differences in textual form or in phonetic transcription, it is somewhat more difficult to analyse speech recordings in this manner, although this is usually the dominant mode of transmission.
Here, we propose a novel approach to explore phonetic changes, using log-spectrograms of speech recordings. After pre-processing the data to remove inherent individual differences, we identify time and frequency covariance functions as a feature of the language; in contrast, the mean depends mostly on the word that has been uttered. We use these means and covariances to postulate paths between languages, and we illustrate some preliminary results obtained when the model is applied to recordings of speakers of a few Romance languages.
This is part of a joint work with P.Z. Hadjipantelis, J.S. Coleman and J.A.D. Aston.
|
![]() |
STSW02 |
20th March 2018 13:30 to 14:30 |
Michelle Carey |
Uncertainty quantification for Geo-spatial process
Co-author: James Ramsay (Prof) Geo spatial data are observations of a process that are collected in conjunction with reference to their geographical location. This type of data is abundant in many scientific fields, some examples include: population census, social and demographic (health, justice, education), economic (business surveys, trade, transport, tourism, agriculture, etc.) and environmental (atmospheric and oceanographic) data. They are often distributed over irregularly shaped spatial domains with complex boundaries and interior holes. Modelling approaches must account for the spatial dependence over these irregular domains as well as describing there temporal evolution. Dynamic systems modelling has a huge potential in statistics, as evidenced by the amount of activity in functional data analysis. Many seemingly complex forms of functional variation can be more simply represented as a set of differential equations, either ordinary or partial. In this talk, I will present a class of semi parametric regression models with differential regularization in the form of PDEs. This methodology will be called Data2PDE “Data to Partial Differential Equations". Data2PDE characterizes spatial processes that evolve over complex geometries in the presence of uncertain, incomplete and often noisy observations and prior knowledge regarding the physical principles of the process characterized by a PDE. |
![]() |
STSW02 |
20th March 2018 14:30 to 15:30 |
Ian Dryden |
Object Data Driven Discovery
Object data analysis is an important tool in the many disciplines where the data have much richer structure than the usual numbers or vectors. An initial question to ask is: what are the most basic data units? i.e. what are the atoms of the data? We describe an introduction to this topic,
where the statistical analysis of object data has a wide variety of applications. An important aspect of the analysis is to reduce the dimension to a small number key features while respecting the geometry of the manifold in which objects lie. Three case studies are given which exemplify the types of issues that are encountered: i) Describing changes in variability in damaged DNA, ii) Testing for geometrical differences in carotid arteries, where patients are at high or low risk of aneurysm, iii) clustering enzymes observed over time. In all three applications the structure of the data manifolds is important, in particular the manifold of covariance matrices, unlabelled size-and-shape space and shape space.
|
![]() |
STSW02 |
20th March 2018 16:00 to 17:00 |
Fang Yao |
Functional regression on manifold with contamination
We propose a new perspective on functional regression with a predictor process via the concept of manifold that is intrinsically finite-dimensional and embedded in an infinite-dimensional functional space, where the predictor is contaminated with discrete/noisy measurements. By a novel method of functional local linear manifold smoothing, we achieve a polynomial rate of convergence that adapts to the intrinsic manifold dimension and the level of sampling/noise contamination with a phase transition phenomenon depending on their interplay. This is in contrast to the logarithmic convergence rate in the literature of functional nonparametric regression. We demonstrate that the proposed method enjoys favorable finite sample performance relative to commonly used methods via simulated and real data examples.
(Joint with Zhenhua Lin)
|
|
STSW02 |
21st March 2018 09:00 to 10:00 |
Rebecca Willett |
Graph Total Variation for Inverse Problems with Highly Correlated Designs
Co-authors: Garvesh Raskutti (University of Wisconsin), Yuan Li (University of Wisconsin) Sparse high-dimensional linear regression and inverse problems have received substantial attention over the past two decades. Much of this work assumes that explanatory variables are only mildly correlated. However, in modern applications ranging from functional MRI to genome-wide association studies, we observe highly correlated explanatory variables and associated design matrices that do not exhibit key properties (such as the restricted eigenvalue condition). In this talk, I will describe novel methods for robust sparse linear regression in these settings. Using side information about the strength of correlations among explanatory variables, we form a graph with edge weights corresponding to pairwise correlations. This graph is used to define a graph total variation regularizer that promotes similar weights for correlated explanatory variables. I will show how the graph structure encapsulated by this regularizer interacts with correlated design matrices to yield provably a ccurate estimates. The proposed approach outperforms standard methods in a variety of experiments on simulated and real fMRI data. This is joint work with Yuan Li and Garvesh Raskutti. |
![]() |
STSW02 |
21st March 2018 10:00 to 11:00 |
Sofia Olhede |
Small and Large Scale Network Features
Comparing and contrasting networks is hindered by their
strongly non-Euclidean structure. I will discuss how one determines “optimal”
features to compare two different networks of different sparsity and size. As the topology of any complex system is key to
understanding its structure and function, the result will be developed from
topological ideas. Fundamentally, algebraic topology guarantees that any system
represented by a network can be understood through its closed paths. The length
of each path provides a notion of scale, which is vitally important in
characterizing dominant modes of system behavior. Here, by combining topology
with scale, we prove the existence of universal features which reveal the
dominant scales of any network. We use these features to compare several
canonical network types in the context of a social media discussion which
evolves through the sharing of rumors, leaks and other news. Our analysis
enables for the first time a universal understanding of the balance between
loops and tree-like structure across network scales, and an assessment of how
this balance interacts with the spreading of information online. Crucially, our
results allow networks to be quantified and compared in a purely model-free way
that is theoretically sound, fully automated, and inherently scalable.
|
|
STSW02 |
21st March 2018 11:30 to 12:30 |
Johannes Schmidt-hieber |
Statistical theory for deep neural networks with ReLU activation function
The universal approximation theorem states that neural networks are capable of approximating any continuous function up to a small error that depends on the size of the network. The expressive power of a network does, however, not guarantee that deep networks perform well on data. For that, control of the statistical estimation risk is needed. In the talk, we derive statistical theory for fitting deep neural networks to data generated from the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to logarithmic factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the n etwork. Specifically, we consider large networks with number of potential parameters being much bigger than the sample size. Interestingly, the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that scaling the network depth with the logarithm of the sample size is natural. Related Links
|
![]() |
STSW02 |
22nd March 2018 09:00 to 10:00 |
Jingjing Zou |
Mixed Effects Model on Functional Manifolds / Sampling Directed Networks
I would like to talk about two projects.
Co-authors of Mixed Effects Model on Functional Manifolds: John Aston (University of Cambridge), Lexin Li (UC Berkeley)
We propose a generalized mixed effects model to study effects of subject-specific covariates on geometric and functional features of the subjects' surfaces. Here the covariates include both time-invariant covariates which affect both the geometric and functional features, and time-varying covariates which result in longitudinal changes in the functional textures. In addition, we extend the usual mixed effects model to model the covariance between a subject's geometric deformation and functional textures on the surface.
Co-authors of Sampling Directed Networks: Richard Davis (Columbia University), Gennady Samorodnitsky (Cornell University), Zhi-Li Zhang (University of Minnesota).
We propose a sampling procedure for the nodes in a network with the goal of estimating uncommon population features of the entire network. Such features might include tail behavior of the in-degree and out-degree distributions and as well as their joint distribution. Our procedure is based on selecting random initial nodes and then following the path of linked nodes in a structured fashion. In this procedure, targeted nodes with desired features, such as large in-degree, will have a larger probability of being retained. In order to construct nearly unbiased estimates of the quantities of interest, weights associated with the sampled nodes must be calculated. We will illustrate this procedure and compare it with a sampling scheme based on multiple random walks on several data sets including webpage network data and Google+ social network data.
|
![]() |
STSW02 |
22nd March 2018 10:00 to 11:00 |
Hao Chen |
New two-sample tests based on adjacency
Two-sample tests for multivariate data and non-Euclidean data are widely used in many fields. We study a nonparametric testing procedure that utilizes graphs representing the similarity among observations. It can be applied to any data types as long as an informative similarity measure on the sample space can be defined. Existing tests based on a similarity graph lack power either for location or for scale alternatives. A new test is proposed that utilizes a common pattern overlooked previously, and it works for both types of alternatives. The test exhibits substantial power gains in simulation studies. Its asymptotic permutation null distribution is derived and shown to work well under finite samples, facilitating its application to large data sets. Another new test statistic will also be discussed that addresses the problem of the classic test of the type under unequal sample sizes. Both tests are illustrated on an application of comparing networks under different conditions.
|
![]() |
STSW02 |
22nd March 2018 11:30 to 12:30 |
Alexander Aue |
Limiting spectral distributions for a class of high-dimensional time series
This talk discusses extensions to the time series case of the Marcenko-Pastur law on limiting spectral distributions (LSDs) for the eigenvalues of high-dimensional sample covariance matrices. The main result will be on establishing a non-linear integral equation characterizing the LSD in terms of its Stieltjes transform. Intuition will be presented for the simple case of a first-order moving average time series and evidence will be provided, indicating the applicability of the result to problems involving to the estimation of certain quadratic forms as they arise, for example, when dealing with the Markowitz portfolio problem. The talk is based on joint work with Haoyang Liu (Florida State) and Debashis Paul (UC Davis).
|
|
STSW02 |
22nd March 2018 13:30 to 14:30 |
Regina Liu |
Fusion and Individualized Fusion Learning from Diverse Data Sources by Confidence Distribution
Inferences
from different data sources can often be fused together to yield more powerful
findings than those from individual sources alone. We present a new approach
for fusion learning by using the so-called confidence distributions (CD). We
further develop the individualized fusion learning, ‘iFusion’, for drawing
efficient individualized inference by fusing the leanings from relevant data sources.
This approach is robust for handling heterogeneity arising from diverse data
sources, and is ideally suited for goal-directed applications such as precision
medicine. In essence, iFusion strategically ‘borrows strength’ from relevant
individuals to improve efficiency while retaining its inference validity.
Computationally, the fusion approach here is parallel in nature and scales up
well in comparison with competing approaches. The performance of the approach
is demonstrated by simulation studies and risk valuation of aircraft
landing data.
|
|
STSW02 |
22nd March 2018 14:30 to 15:30 |
Debashis Paul |
Spectral estimation for a class of high-dimensional linear processes
We present results about the limiting behavior of the
empirical distribution of eigenvalues of weighted integrals of the sample
periodogram for a class of high-dimensional linear processes. The processes
under consideration are characterized by having simultaneously diagonalizable
coefficient matrices.
We make use of these asymptotic results, derived under
the setting where the dimension and sample size are comparable, to formulate an
estimation strategy for the distribution of eigenvalues of the coefficients of
the linear process. This approach generalizes existing works on estimation of
the spectrum of an unknown covariance matrix for high-dimensional i.i.d.
observations.
(Joint work with Jamshid Namdari and Alexander Aue) |
|
STSW02 |
22nd March 2018 16:00 to 17:00 |
Eardi Lila |
Statistical Analysis of Functions on Surfaces, with an application to Medical Imaging
Co-author: John Aston (University of Cambridge)
In Functional Data Analysis, data are commonly assumed to be smooth functions on a fixed interval of the real line. In this work, we introduce a comprehensive framework for the analysis of functional data, whose domain is a two-dimensional manifold and the domain itself is subject to variability from sample to sample. We formulate a statistical model for such data, that we call Functions on Surfaces, which enables a joint representation of the geometric and functional aspects, and propose an associated estimation framework. We apply the proposed framework to the analysis of neuroimaging data of cortical thickness, acquired from the brains of different subjects, and thus lying on domains with different geometries. |
|
STSW02 |
23rd March 2018 09:00 to 10:00 |
Sara Anna van de Geer |
A concentration interval for the Lasso
We consider the linear model
and the Lasso estimator. Our goal is to provide upper and lower bounds
for the prediction error that are close to each other.
We assume that the active components
of the vector of regression coefficients
are sufficiently large in absolute value (in a sense that will be specified)
and that the tuning parameter is suitably chosen.
The bounds depend on
so-called compatibility constants.
We will present the definition of the compatibility constants and discuss their relation with
restricted eigenvalues.
As an example, we consider the
the least squares estimator with total variation penalty
and present bounds with small gap.
For the case of random design, we assume that the rows of the design matrix are i.i.d.copies
of a Gaussian random vector. We assume that the largest
eigenvalue of the covariance matrix remains bounded and establish under some mild compatibility conditions upper and lower bounds with ratio tending to one.
|
![]() |
STSW02 |
23rd March 2018 10:00 to 11:00 |
Miguel del alamo |
Multiscale Bounded Variation Regularization
Co-authors: Housen Li (University of Goettingen), Axel Munk (University of Goettingen) In nonparametric regression and inverse problems, variational methods based on bounded variation (BV) penalties are a well-known and established tool for yielding edge-preserving reconstructions, which is a desirable feature in many applications. Despite its practical success, the theory behind BV-regularization is poorly understood: most importantly, there is a lack of convergence guarantees in spatial dimension d\geq 2. In this talk we present a variational estimator that combines a BV penalty and a multiscale constraint, and prove that it converges to the truth at the optimal rate. Our theoretical analysis relies on a proper analysis of the multiscale constraint, which is motivated by the statistical properties of the noise, and relates in a natural way to certain Besov spaces of negative smoothness. Further, the main novelty of our approach is the use of refined interpolation inequalities between function spaces. We also illustrate the performance of these variational estimators in simulations on signals and images. |
![]() |
STSW02 |
23rd March 2018 11:30 to 12:30 |
Piotr Fryzlewicz |
Multiscale methods and recursion in data science
The talk starts on a general note: we first attempt to define a "multiscale" method / algorithm as a recursive program acting on a dataset in a suitable way. Wavelet transformations, unbalanced wavelet transformations and binary segmentation are all examples of multiscale methods in this sense. Using the example of binary segmentation, we illustrate the benefits of the recursive formulation of multiscale algorithms from the software implementation and theoretical tractability viewpoints. We then turn more specific and study the canonical problem of a-posteriori detection of multiple change-points in the mean of a piecewise-constant signal observed with noise. Even in this simple set-up, many publicly available state-of-the-art methods struggle for certain classes of signals. In particular, this misperformance is observed in methods that work by minimising a "fit to the data plus a penalty" criterion, the reason being that it is challenging to think of a penalty that works well over a wide range of signal classes. To overcome this issue, we propose a new approach whereby methods learn from the data as they proceed, and, as a result, operate differently for different signal classes. As an example of this approach, we revisit our earlier change-point detection algorithm, Wild Binary Segmentation, and make it data-adaptive by equipping it with a recursive mechanism for deciding "on the fly" how many sub-samples of the input data to draw, and w here to draw them. This is in contrast to the original Wild Binary Segmentation, which is not recursive. We show that this significantly improves the algorithm particularly for signals with frequent change-points. Related Links
|
![]() |
STS |
27th March 2018 11:00 to 12:00 |
Adam Sykulski |
Spatiotemporal modelling and parameter estimation of anisotropic particle trajectories
Trajectories of moving objects are collected everywhere and in massive
volumes. In the first part of this talk I will present a framework for
stochastically modelling such trajectories in time over two-dimensional
space. I will show an application to modelling fluid particles in ocean
turbulence, where trajectories are typically anisotropic due to spherical
dynamics. In the second part, I will discuss computationally-efficient
parameter estimation for massive datasets, where we propose an important
modification to the Whittle likelihood to significantly reduce bias at no
additional computational cost. We extend these estimation methods to
spatiotemporal trajectories, such that we can estimate parameters that reveal
the nature of the anisotropic structure. |
![]() |
STS |
29th March 2018 11:00 to 12:00 |
Steffen Grunewalder |
Compressed Empirical Measures
I will present results on compressed representations of
expectation operators with a particular emphasis on expectations with respect
to empirical measures. Such expectations are a cornerstone of non-parametric
statistics and compressed representations are of great value when dealing with
large sample sizes and computationally expensive methods. I will focus on a
conditional gradient like algorithm to generate such representations in infinite
dimensional function spaces. In particular, I will discuss extensions of
classical convergence results to uniformly smooth Banach spaces (think Lp, 1
|
![]() |
STS |
3rd April 2018 11:00 to 12:00 |
Jon August Wellner |
Inference for the mode of a log-concave density: a likelihood ratio test and confidence intervals
I will discuss a likelihood ratio test for the mode of a
log-concave density. The new test is based on comparison of the log-likelihoods
corresponding to the unconstrained maximum likelihood estimator of a
log-concave density and the constrained maximum likelihood estimator, where the
constraint is that the mode of the density is fixed, say at m. The constrained
estimators have many properties in common with the unconstrained estimators
discussed by Walther (2001), Pal, Woodroofe, and Meyer (2007), Dümbgen and
Rufibach (2009), and Balabdaoui, Rufibach and Wellner (2010), but they differ
from the unconstrained estimator under the null hypothesis on n^{−1/5}
neighborhoods of the mode m. Using joint limiting properties of the
unconstrained and constrained estimators we show that under the null hypothesis
(and strict curvature of - log f at the mode), the likelihood ratio statistic
is asymptotically pivotal: that is, it converges in distribution to a limiting
distribution which is free of nuisance parameters, thus playing the role of the
chi-squared distribution in classical parametric statistical problems. By
inverting this family of tests, we obtain new (likelihood ratio based)
confidence intervals for the mode of a log-concave density f. These new
intervals do not depend on any smoothing parameters. We study the new
confidence intervals via Monte Carlo studies and illustrate them with several
real data sets. The new confidence intervals seem to have several advantages
over existing procedures.
This talk is based on joint work with Charles Doss.
|
|
STS |
5th April 2018 11:00 to 12:00 |
Tobias Kley |
Sequential detection of structural changes in irregularly observed data
Online surveillance of time series is traditionally done
with the aim to identify changes in the marginal distribution under the
assumption that the data between change-points is stationary and that new data
is observed at constant frequency. In many situations of interest to data
analysts, the classical approach is too restrictive to be used unmodified. We
propose a unified system for the monitoring of structural changes in streams of
data where we use generalised likelihood ratio-type statistics in the sequential
testing problem, obtaining the flexibility to account for the various types of
changes that are practically relevant (such as, for example, changes in the
trend of the mean). The method is applicable to sequences where new
observations are allowed to arrive irregularly. Early identification of changes
in the trend of financial data can assist to make trading more profitably. In
an empirical illustration we apply the procedure to intra-day prices of
components of the NASDAQ-100 stock market index. This project is joint work
with Piotr Fryzlewicz.
|
![]() |
STS |
10th April 2018 11:00 to 12:00 |
Heather Battey |
Large numbers of explanatory variables
The lasso and its variants are powerful methods
for regression analysis when there are a small number of study individuals and
a large number of potential explanatory variables. There results a single
model, while there may be several simple representations equally compatible
with the data. I will outline a different approach, whose aim is essentially a
confidence set of models. A probabilistic assessment of the method will be
given. The talk is based on joint work with David R Cox. |
|
STS |
12th April 2018 11:00 to 12:00 |
Georg Hahn |
Recent advances in quantum annealing and outlook on its potential in statistics
Since the 1970s, the potential
of quantum computing has been a field of extensive research, particularly its
advantages and disadvantages over classical computing. This research, however,
was theoretical since physical quantum devices were unavailable. With the
recent availability of the first adiabatic computers (or quantum annealers),
computational mathematics and statistics (as all other computational sciences)
are provided with a new means of great potential. This talk will begin with an
introduction to quantum annealing and proceed with a presentation of recent
advances in the field. Special focus will be given to two topics: Solving the
NP-hard problem of finding cliques in a graph and the reduction of binary
quadratic forms for scalable quantum annealing. Further relevant works will be
discussed, especially those exploring the statistical properties of quantum
annealing. To stimulate discussion, the talk will highlight future directions
of research, in particular the statistical analysis of the (empirical)
distribution of annealing solutions, the characterisation of classes of
statistical methods allowing formulations suitable for quantum computation (and
hence almost instant solutions), and the exploitation of the inherent
randomness in adiabatic computing for statistical purposes. |
|
STS |
24th April 2018 11:00 to 12:00 |
Ian McKeague |
Causal trees and Wigner's semicircle law
Various aspects
of standard model particle physics might be explained by a suitably rich
algebra acting on itself, as suggested recently by Furey (2015). This talk
discusses the statistical behavior of large causal tree diagrams that combine
freely independent elements in such an algebra. It is shown that some of the
familiar limiting distributions in random matrix theory (namely the
Marchenko-Pastur law and Wigner's semicircle law) emerge in this setting as
limits of normalized sums-over-paths of non-negative elements of the algebra
assigned to the edges of the tree. These results are established in the setting
of non-commutative probability. Trees with classically independent positive
edge weights (random multiplicative cascades) were originally proposed by
Mandelbrot as a model displaying the fractal features of turbulence. The
novelty of the present approach is the use of non-commutative (free) probability
to allow the edge weights to take values in an algebra. Potential applications
in theoretical neuroscience related to Alan Turing's famous "Imitation
Game" paper are also discussed.
|
![]() |
STS |
26th April 2018 11:00 to 12:00 |
Jacob Bien |
High-dimensional variable selection when features are sparse
It is common in modern prediction
problems for many predictor variables to be counts of rarely occurring events.
This leads to design matrices in which a large number of columns are highly
sparse. The challenge posed by such "rare features" has received little
attention despite its prevalence in diverse areas, ranging from biology (e.g.,
rare species) to natural language processing (e.g., rare words). We show, both
theoretically and empirically, that not explicitly accounting for the rareness
of features can greatly reduce the effectiveness of an analysis. We next
propose a framework for aggregating rare features into denser features in a
flexible manner that creates better predictors of the response. An application
to online hotel reviews demonstrates the gain in accuracy achievable by proper
treatment of rare words. This is joint work with Xiaohan Yan.
|
|
STS |
1st May 2018 11:00 to 12:00 |
Dino Sejdinovic |
Approximate kernel embeddings of distributions
Kernel embeddings of distributions and the
Maximum Mean Discrepancy (MMD), the resulting probability metric, are useful
tools for fully nonparametric hypothesis testing and for learning on
distributional inputs; i.e., where labels are only observed at an aggregate
level. I will give an overview of this framework and describe the use of
large-scale approximations to kernel embeddings in the context of Bayesian
approaches to learning on distributions and in the context of distributional
covariate shift; e.g., where measurement noise on the training inputs differs
from that on the testing inputs.
|
|
STS |
3rd May 2018 11:00 to 12:00 |
Sandipan Roy |
Multiple change-point estimation in high-dimensional Gaussian graphical models
We consider the
consistency properties of a regularised estimator for the simultaneous
identification of both changepoints and graphical dependency structure in
multivariate time-series. Traditionally, estimation of Gaussian Graphical
Models (GGM) is performed in an i.i.d setting. More recently, such models have
been extended to allow for changes in the distribution, but only where
changepoints are known a-priori. In this work, we study the Group-Fused
Graphical Lasso (GFGL) which penalises partial-correlations with an L1 penalty
while simultaneously inducing block-wise smoothness over time to detect
multiple changepoints. We present a proof of consistency for the estimator,
both in terms of changepoints, and the structure of the graphical models in
each segment. Several synthetic experiments and a real data application
validate the performance of the proposed methodology. |
![]() |
STS |
8th May 2018 11:00 to 12:00 |
Qiyang Han |
Least squares estimation: Beyond Gaussian regression models
We study the convergence rate of
the least squares estimator (LSE) in a regression model with possibly
heavy-tailed errors. Despite its importance in practical applications,
theoretical understanding of this problem has been limited. We first show that
from a worst-case perspective, the convergence rate of the LSE in a general
non-parametric regression model is given by the maximum of the Gaussian
regression rate and the noise rate induced by the errors. In the more difficult
statistical model where the errors only have a second moment, we further show
that the sizes of the 'localized envelopes' of the model give a sharp
interpolation for the convergence rate of the LSE between the worst-case rate
and the (optimal) parametric rate. These results indicate both certain positive
and negative aspects of the LSE as an estimation procedure in a heavy-tailed
regression setting. The key technical innovation is a new multiplier
inequality that sharply controls the size of the multiplier empirical process
associated with the LSE, which also finds applications in shape-restricted and
sparse linear regression problems.
|
|
STS |
10th May 2018 11:00 to 12:00 |
Tatyana Krivobokova |
Partial least squares for dependent data
We consider the
linear and kernel partial least squares algorithms for dependent data and study
the consequences of ignoring the dependence both theoretically and numerically.
For linear partial least squares estimator we derive convergence rates and show
that ignoring non-stationary dependence structures can lead to inconsistent
estimation. For kernel partial least squares estimator we establish convergence
rates under a source and an effective dimensionality conditions. It is shown
both theoretically and in simulations that long range dependence results in
slower convergence rates. A protein dynamics example illustrates our results
and shows high predictive power of partial least squares. This is joint work with Marco Singer, Axel Munk and Bert de Groot. |
|
STS |
15th May 2018 11:00 to 12:00 |
Jana Jankova |
Inference for eigenstructure of high-dimensional covariance matrices
Sparse principal component analysis (PCA) has
become one of the most widely used techniques for dimensionality reduction in
high-dimensional datasets. The main challenge underlying sparse PCA is to
estimate the first vector of loadings of the population covariance matrix,
provided that only a certain number of loadings are non-zero. A vast number of methods have been proposed in literature for point estimation of eigenstructure of the covariance matrix. In this work, we study uncertainty quantification and propose methodology for inference and hypothesis testing for individual loadings and for the largest eigenvalue of the covariance matrix. We base our methodology on a Lasso-penalized M-estimator which, despite non-convexity, may be solved by a polynomial-time algorithm such as coordinate or gradient descent. Our results provide theoretical guarantees for asymptotic normality of the new estimators and may be used for valid hypothesis testing and variable selection. |
|
STS |
17th May 2018 11:00 to 12:00 |
Xinghao Qiao |
Regression with Dependent Functional Errors-in-Predictors
Functional regression is an important topic in functional data analysis. Traditionally, in functional regression, one often assumes that samples of the functional predictor are independent realizations of an underlying stochastic process, and are observed over a grid of points contaminated by independent and identically distributed measurement errors. However, in practice, the dynamic dependence across different curves may exist and the parametric assumption on the measurement error covariance structure could be unrealistic. In this paper, we consider
functional linear regression with serially dependent functional predictors, when the contamination of predictors by measurement error is "genuinely functional" with fully nonparametric covariance structure. Inspired by the fact that the autocovariance operator of the observed functional predictor automatically filters out the impact from the unobservable measurement error, we propose a novel generalized-method-of-moments estimator of the slope function. The asymptotic properties of the resulting estimators under different scenarios are established. We also demonstrate that the proposed method significantly outperforms possible competitors through intensive simulation studies. Finally, the proposed method is applied to a public
financial dataset, revealing some interesting findings. This is a joint work with Cheng Chen and Shaojun Guo.
|
![]() |
STS |
22nd May 2018 11:00 to 12:00 |
Marc Hallin |
Multivariate Distribution and Quantile Functions, Ranks and Signs: A measure transportation approach
Unlike the real line,
the d-dimensional space R^d, for d > 1, is not canonically ordered. As a
consequence, such fundamental and strongly order-related univariate concepts as
quantile and distribution functions, and their empirical counterparts,
involving ranks and signs, do not canonically extend to the multivariate
context. Palliating that lack of a canonical ordering has remained an open problem
for more than half a century, and has generated an abundant literature,
motivating, among others, the development of statistical depth and copula-based
methods. We show here that, unlike the many definitions that have been proposed
in the literature, the measure transportation-based ones introduced
in Chernozhukov et al. (2017) enjoy all the properties (distribution-freeness
and preservation of semiparametric efficiency) that make univariate quantiles
and ranks successful tools for semiparametric statistical inference. We
therefore propose a new center-outward definition of multivariate distribution
and quantile functions, along with their empirical counterparts, for which we
establish a Glivenko-Cantelli result. Our approach, based on results by McCann
(1995), is geometric rather than analytical and, contrary to the
Monge-Kantorovich one in Chernozhukov et al. (2017) (which assumes compact
supports or finite second-order moments), does not require any moment
assumptions. The resulting ranks and signs are shown to be strictly
distribution-free, and maximal invariant under the action of transformations
(namely, the gradients of convex functions, which thus are playing the role of
order-preserving transformations) generating the family of absolutely continuous
distributions; this, in view of a general result by Hallin and Werker (2003),
implies preservation of semiparametric efficiency. As for the resulting
quantiles, they are equivariant under the same transformations, which confirms
the order-preserving nature of gradients of convex function. |
![]() |
STS |
29th May 2018 11:00 to 12:00 |
Victor Panaretos |
Amplitude and phase variation of point processes
The amplitude variation of a real
random field X(t) consists in its random oscillations in its range space (the
"y-axis"), typically encapsulated by its (co)variation around a mean
level. In contrast, phase variation refers to fluctuations in its domain (the
"x-axis"), often caused by random time changes or spatial
deformations. We consider the problem of identifiably formalising similar
notions for (potentially spatial) point processes, and of nonparametrically
separating them based on realisations of i.i.d. copies of the phase-varying
point process. The key element of our approach is the use of the theory of
optimal transportation of measure, which is proven to be the natural formalism
for the problem under the usual assumptions imposed. It is shown to allow the
consistent separation of the two types of variation for point processes over
Euclidean domains, under no parametric restrictions, including convergence
rates, and even asymptotic distributions in some cases. (Based on joint work with Y. Zemel, Göttingen.
|
|
STS |
30th May 2018 10:00 to 11:00 |
Edward George |
The remarkable flexibility of BART
For the canonical regression setup where one wants to
discover the relationship between Y and a p-dimensional vector x, BART (Bayesian
Additive Regression Trees) approximates the conditional mean E[Y|x] with a sum
of regression trees model, where each tree is constrained by a regularization
prior to be a weak learner. Fitting and inference are accomplished via a
scalable iterative Bayesian backfitting MCMC algorithm that generates samples
from a posterior. Effectively, BART is a nonparametric Bayesian regression
approach which uses dimensionally adaptive random basis elements. Motivated by
ensemble methods in general, and boosting algorithms in particular, BART is
defined by a statistical model: a prior and a likelihood. This approach enables
full posterior inference including point and interval estimates of the unknown
regression function as well as the marginal effects of potential predictors. By
keeping track of predictor inclusion frequencies, BART can also be used for
model-free variable selection. To further illustrate the modeling flexibility
of BART, we introduce two elaborations, MBART and HBART. Exploiting the
potential monotonicity of E[Y|x] in components of x, MBART incorporates such
monotonicity with a multivariate basis of monotone trees, thereby enabling
estimation of the decomposition of E[Y|x] into its unique monotone components.
To allow for the possibility of heteroscedasticity, HBART incorporates an
additional product of regression trees model component for the conditional
variance, thereby providing simultaneous inference about both E[Y|x] and
Var[Y|x]. (This is joint research with Hugh Chipman, Matt Pratola, Rob McCulloch
and Tom Shively.)
|
![]() |
STS |
31st May 2018 11:00 to 12:00 |
Yining Chen |
Narrowest-over-threshold detection of multiple change-points and change-point-like features
We propose a new,
generic and flexible methodology for nonparametric function estimation, in
which we first estimate the number and locations of any features that may be
present in the function, and then estimate the function parametrically between
each pair of neighbouring detected features. Examples of features handled by
our methodology include change-points in the piecewise-constant signal model,
kinks in the piecewise-linear signal model, and other similar irregularities,
which we also refer to as generalised change-points. Our methodology works with
only minor modifications across a range of generalised change-point scenarios,
and we achieve such a high degree of generality by proposing and using a new
multiple generalised change-point detection device, termed
Narrowest-Over-Threshold (NOT). NOT offers highly competitive performance when
being applied together with some information criteria. For selected scenarios,
we show the consistency and near-optimality of NOT in detecting the number and
locations of generalised change-points. The computational complexity as well as
its extensions will also be discussed. (Joint work with Rafal Baranowski and Piotr Fryzlewicz.) |
|
STS |
5th June 2018 11:00 to 12:00 |
Jennifer Wadsworth |
Spatial extremes: A conditional approach
The past decade has seen a huge effort in
modelling the extremes of spatial processes. Significant challenges include the
development of models with an appropriate asymptotic justification for the
tail; ensuring model assumptions are compatible with the data; and the fitting
of these models to (at least reasonably) high-dimensional datasets. I will
review basic ideas of modelling spatial extremes, and introduce an approach
based on the (multivariate) conditional extreme value model of Heffernan and
Tawn (2004, JRSSB) and Heffernan and Resnick (2007, Ann. Appl. Prob.).
Advantages of the conditional approach include its asymptotic motivation,
flexibility, and comparatively simple inference, meaning that it can be fitted
to reasonably large datasets. I will illustrate ideas by presenting parts of an
application (in progress) to temperature extremes across Australia.
|
|
STS |
7th June 2018 11:00 to 12:00 |
Timothy Cannings |
Classification with imperfect training labels
We study the effect of imperfect training data labels on the performance of
classification methods. In a general setting, where the probability that an observation
in the training dataset is mislabelled may depend on both the feature vector and the true
label, we bound the excess risk of an arbitrary classifier trained with
imperfect labels in terms of its excess risk for predicting a noisy label. This
reveals conditions under which a classifier trained with imperfect labels
remains consistent for classifying uncorrupted test data points. Furthermore,
under stronger conditions, we derive detailed asymptotic properties for the
popular k-nearest neighbour (k-nn), Support Vector Machine (SVM) and Linear
Discriminant Analysis (LDA) classifiers. One consequence of these results is
that the k-nn and SVM classifiers are robust to imperfect training labels, in
the sense that the rate of convergence of the excess risks of these classifiers
remains unchanged; in fact, it even turns out that in some cases, imperfect
labels may improve the performance of these methods. On the other hand,
the LDA classifier is shown to be typically inconsistent in the presence of
label noise unless the prior probabilities of each class are equal.
|
|
STS |
12th June 2018 11:00 to 12:00 |
Shahin Tavakoli |
Tests for separability in nonparametric covariance operators of random surfaces
The assumption of separability of the covariance operator for a random image or hypersurface can be of
substantial use in applications, especially in situations where the accurate estimation of the full covariance
structure is unfeasible, either for computational reasons or due to a small sample size. However, inferential
tools to verify this assumption are somewhat lacking in high-dimensional or functional settings where this
assumption is most relevant. We propose here to test separability by focusing on K-dimensional projections of
the difference between the covariance operator and its nonparametric separable approximation. The subspace we
project onto is one generated by the eigenfunctions estimated under the separability hypothesis, negating the
need to ever estimate the full non-separable covariance. We show that the rescaled difference of the sample
covariance operator with its separable approximation is asymptotically Gaussian. As a by-product of this
result, we derive asymptotically pivotal tests under Gaussian assumptions, and propose bootstrap methods for
approximating the distribution of the test statistics when multiple eigendirections are taken into account. We
probe the finite sample performance through simulations studies, and present an application to log-spectrogram
images from a phonetic linguistics dataset.
This is joint work with Davide Pigoli (KCL) and John Aston (Cambridge)
|
![]() |
STS |
14th June 2018 10:00 to 11:00 |
Patrick Rebeschini |
Multi-agent learning: Implicit regularization and order-optimal gossip
In distributed machine learning, data are
stored and processed in multiple locations by different agents. Each agent is
represented by a node in a graph, and communication is allowed between
neighbours. In the decentralised setting typical of peer-to-peer networks,
there is no central authority that can aggregate information from all the
nodes. A typical setting involves agents cooperating with their peers to learn
models that can perform better on new, unseen data. In this talk, we
present the first results on the generalisation capabilities of distributed
stochastic gradient descent methods. Using algorithmic stability, we derive
upper bounds for the test error and provide a principled approach for implicit
regularization, tuning the learning rate and the stopping time as a function of
the graph topology. We also present a new Gossip protocol for the aggregation
step in distributed methods that can yield order-optimal communication
complexity. Based on non-reversible Markov chains, our protocol is local and
does not require global routing, hence improving existing methods. (Joint work
with Dominic Richards)
|
|
STS |
19th June 2018 11:00 to 12:00 |
Yanyuan Ma |
A superpopulation treatment to case-control data analysis
We study the regression relationship among
covariates in case-control data, an area known as the secondary analysis of
case-control studies. The context is such that only the form of the regression
mean is specified, so that we allow an arbitrary regression error distribution,
which can depend on the covariates and thus can be heteroscedastic. Under mild
regularity conditions we establish the theoretical identifiability of such
models. Previous work in this context has either (a) specified a fully
parametric distribution for the regression errors, (b) specified a
homoscedastic distribution for the regression errors, (c) has specified the
rate of disease in the population (we refer this as true population), or (d)
has made a rare disease approximation. We construct a class of semiparametric
estimation procedures that rely on none of these. The estimators differ from
the usual semiparametric ones in that they draw conclusions about the true
population, while technically operating in a hypothetic superpopulation. We
also construct estimators with a unique feature, in that they are robust
against the misspecification of the regression error distribution in terms of
variance structure, while all other nonparametric effects are estimated despite
of the biased samples. We establish the asymptotic properties of the estimators
and illustrate their finite sample performance through simulation
studies.
|
|
STS |
22nd June 2018 10:00 to 11:00 |
Victor Brunel |
Concentration of the empirical contours of Tukey’s halfspace depth
Tukey's halfspace
depth has attracted much interest in data analysis, because it is a natural way
of measuring the notion of depth relative to a cloud of points or, more
generally, to a probability measure. Given an i.i.d. sample, we investigate the
concentration of upper level sets of the Tukey depth relative to that sample
around their population version. We show that under some mild assumptions on
the underlying probability measure, concentration occurs at a parametric rate
and we deduce moment inequalities at that same rate. In a computational
prospective, we study the concentration of a discretized version of the
empirical upper level sets |
![]() |
STS |
22nd June 2018 16:00 to 17:00 |
Peter Bühlmann |
Rothschild Lecture: Causality, Invariance and Robustness
Is it a cause or an effect? This simple but fundamental question
has a long history in science and society. Randomised studies serve as the gold standard for inferring causality but they are often expensive or even impossible to do due to ethical reasons. Recent approaches try to "substitute in part" the randomised studies by models, algorithms and statistical methods. Perhaps surprisingly, heterogeneity in potentially large-scale data can be beneficially exploited for causal inference and novel robustness, with wide-ranging prospects for various applications. The key idea relies on a notion of probabilistic invariance: it opens up new insights with connections to frameworks used in robust optimisation and economics. |
![]() |
STSW04 |
25th June 2018 11:00 to 11:45 |
Iain Johnstone |
Eigenstructure in high dimensional random effects models
The eigenstructure of i.i.d. samples from high dimensional data is
known to show phenomena unexpected from experience with low dimensions
-- eigenvalue spreading, bias and eigenvector inconsistency.
Motivated by some problems in quantitative genetics, we now consider
high dimensional data with more than one level of variation, and more
specifically the spectral properties of estimators of the various
components of variance matrices. We briefly describe bulk and edge
eigenvalue distributions in this setting, and then focus more
attention on properties and estimation of 'spiked' models. This is
joint work with Zhou Fan and Yi Sun.
|
|
STSW04 |
25th June 2018 11:45 to 12:30 |
Philippe Rigollet |
Uncoupled isotonic regression via minimum Wasserstein deconvolution
Isotonic regression is a standard problem in shape constrained estimation where the goal is to estimate an unknown nondecreasing regression function $f$ from independent pairs $(x_i,y_i)$ where $\E[y_i]=f(x_i), i=1, \ldots n$. While this problem is well understood both statistically and computationally, much less is known about its uncoupled counterpart where one is given uncoupled $\{x_1, \ldots, x_n\}$ and $\{y_1, \ldots, y_n\}$. In this work, we leverage tools from optimal transport theory to derive minimax rates under weak moments conditions on $y_i$ together with an efficient algorithm. Both upper and lower bounds are articulated around moment-matching arguments that are also pertinent to learning mixtures of distributions and deconvolution. [Joint work with Jonathan Weed (MIT)]
|
![]() |
STSW04 |
25th June 2018 14:00 to 14:45 |
Matthew Stephens |
On applications of Empirical Bayes approaches to the Normal Means problem
The normal means problem is very simple: given normally-distributed observations with known variances and unknown means, estimate the means. That is, given X_j \sim N(\theta_j, \sigma_j^2, estimate \theta_j. A key idea is that one can do better than the maximum likelihood estimates, \hat{\theta}_j= \X_j, in particular by use of appropriate "shrinkage" estimators. One attractive way to perform shrinkage estimation in practice is to use Empirical Bayes methods. That is, to assume that \theta_j are independent and identically distributed from some distribution g that is to be estimated from the data. Then, given such an estimate \hat{g}, the posterior distributions of \theta_j can be computed to perform inference. We call this the "Empirical Bayes Normal Means" (EBNM) problem. Despite its simplicity, solving the EBNM problem has a wide range of practical applications. Here we present some flexible non-parametric approaches we have recently developed for solving the EBNM problem, and describe their application to several different settings: false discovery rate (FDR) estimation, non-parametric smoothing, and sparse matrix factorization problems (ie sparse factor analysis and sparse principal components analysis). |
![]() |
STSW04 |
25th June 2018 14:45 to 15:30 |
Jana Jankova |
Asymptotic Inference for Eigenstructure of Large Covariance Matrices
A vast number of methods have been proposed in literature for point estimation of eigenstructure of covariance matrices in high-dimensional settings. In this work, we study uncertainty quantification and propose methodology for inference and hypothesis testing for individual loadings of the covariance matrix. We base our methodology on a Lasso-penalized M-estimator which, despite non-convexity, may be solved by a polynomial-time algorithm such as coordinate or gradient descent. Our results provide theoretical guarantees on asymptotic normality of the new estimators and may be used for valid hypothesis testing and variable selection. These results are achieved under a sparsity condition relating the number of non-zero loadings, sample size, dimensionality of the covariance matrix and spectrum of the covariance matrix. This talk is based on joint work with Sara van de Geer.
|
|
STSW04 |
25th June 2018 16:00 to 16:45 |
Flori Bunea |
A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations. The model links all cell probabilities, collected in a p x n matrix, via the assumption that this matrix can be factorized as the product of two non-negative matrices A and W. Topic models have been originally developed in text mining, when one browses through n documents, based on a dictionary of p words, and covering K topics. In this terminology, the matrix A is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in this talk. Notably, the unique A is required to satisfy what is commonly known as the anchor word assumption, under which A has an unknown number of rows respectively proportional to K-dimensional canonical basis vectors. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the K vertices of a simplex. This crucial step in the estimation of A requires K to be known, and cannot be easily extended to the more realistic set-up when K is unknown.
In this talk, we take a different view on anchor word estimation, and on the estimation of the word-topic matrix A. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any n, p, K and document length N, and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We illustrate, via simulations, that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
|
![]() |
STSW04 |
26th June 2018 09:00 to 09:45 |
Guy Bresler |
Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure
The prototypical high-dimensional statistics problem entails finding a structured signal in noise. Many of these problems exhibit a curious phenomenon: the amount of data needed by all known computationally efficient algorithms far exceeds what is needed for inefficient algorithms that search over all possible structures. A line of work initiated by Berthet and Rigollet in 2013 has aimed to explain these gaps by reducing from conjecturally hard problems in computer science. However, the delicate nature of average-case reductions has limited the applicability of this approach. In this work we introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. These include tight lower bounds for Planted Independent Set, Planted Dense Subgraph, Sparse Spiked Wigner, Sparse PCA, as well as for new models that we introduce. Joint work with Matthew Brennan and Wasim Huleihel.
|
![]() |
STSW04 |
26th June 2018 09:45 to 10:30 |
Chao Gao |
Reduced Isotonic Regression
Consider an $n$-dimensional vector $X$ with mean $\theta^*$. In this talk, we consider $\theta^*$ that is both nondecreasing and has a piecewise constant structure. We establish the exact minimax rate of estimating such monotone functions, and thus give a non-trivial answer to an open problem in the shape-constrained analysis literature. The minimax rate involves an interesting iterated logarithmic dependence on the dimension. We then develop a penalized least-squares procedure for estimating $\theta^*$ adaptively. This estimator is shown to achieve the derived minimax rate without the knowledge of the number of pieces in linear time. We further allow the model to be misspecified and derive oracle inequalities with the optimal rates for the proposed estimator. This is a joint work with Fang Han and Cun-hui Zhang.
|
![]() |
STSW04 |
26th June 2018 11:00 to 11:45 |
Ryan Tibshirani |
Dykstra’s Algorithm, ADMM, and Coordinate Descent: Connections, Insights, and Extensions
We study connections between Dykstra’s algorithm for projecting onto an intersection of convex sets, the augmented Lagrangian method of multipliers or ADMM, and block coordinate descent. We prove that coordinate descent for a regularized regression problem, in which the penalty is a separable sum of support functions, is exactly equivalent to Dykstra’s algorithm applied to the dual problem. ADMM on the dual problem is also seen to be equivalent, in the special case of two sets, with one being a linear subspace. These connections, aside from being interesting in their own right, suggest new ways of analyzing and extending coordinate descent. For example, from existing convergence theory on Dykstra’s algorithm over polyhedra, we discern that coordinate descent for the lasso problem converges at an (asymptotically) linear rate. We also develop two parallel versions of coordinate descent, based on the Dykstra and ADMM connections. Finally, we discuss the implications of this work for backfitting in additive models.
|
![]() |
STSW04 |
26th June 2018 11:45 to 12:30 |
Peter Bickel |
Two network scale challenges:Constructing and fitting hierarchical block models and fitting large block models using the mean field method
Work with S.Bhattacharyya,T.Li,E.Levina,S.Mukherjee,P.Sarkar
Networks are a complex type of structure presenting itself in many applications . They are usually represented by a graph ,with possibly weighted edges plus additional covariates (such as directions).Block models have been studied for some time as basic approximations to ergodic stationary probability models for single graphs.A huge number of fitting methods have been developed for these models some of which we will touch on.
The mean field method in which an increasing number of parameters must be fitted is used not only for multiple membership block models but also in applications such as LDA.if the graph is too large poor behaviour of the method can be seen.. We have developed what we call "patch methods " for fitting which help both computationally and inferentially in such situations bur much further analysis is needed.
It is intuitively clear but mathematically unclear how knowledge of the model having nested scales helps in fitting large scale as opposed to small scale parameters.We will discuss this issue through an example,
|
![]() |
STSW04 |
26th June 2018 14:00 to 14:45 |
Marten Herman Wegkamp |
Adaptive estimation of the rank of the regression coefficient matrix
We consider the multivariate response regression problem with a regression coefficient matrix of low, unknown rank. In this setting, we analyze a new criterion for selecting the optimal reduced rank. This criterion does not require estimation of the unknown variance of the noise, nor depends on a delicate choice of a tuning parameter. We develop an iterative, fully data-driven procedure, that adapts to the optimal signal-to-noise ratio. This procedure finds the true rank in a few steps with overwhelming probability. At each step, our estimate increases, while at the same time it does not exceed the true rank. Our finite sample results hold for any sample size and any dimension, even when the number of responses and of covariates grow much faster than the number of observations. An extensive simulation study that confirms our theoretical findings in both low and high dimensional settings.
This is joint work with Xin Bing.
|
![]() |
STSW04 |
26th June 2018 14:45 to 15:30 |
Maryam Fazel |
Competitive Online Algorithms for Budgeted Allocation with Application to Online Experiment Design
We consider an online resource allocation problem, where the goal is to maximize a function of a positive semidefinite (PSD) matrix with a scalar budget constraint. The problem data arrives online, and the algorithm needs to make an irrevocable decision at each step. Of particular interest are classic experiment design problems in the online setting, with the algorithm deciding whether to allocate budget to each experiment as new experiments become available sequentially.
We analyze two classes of primal-dual algorithms and provide bounds on their competitive ratios. Our analysis relies on a smooth surrogate of the objective function that needs to satisfy a new diminishing-returns property. We then formulate a convex optimization problem to directly optimize this competitive ratio bound; this design problem can be solved offline before the data start arriving. We give examples of computing the smooth surrogate for D-optimal and A-optimal experiment design.
This is joint work with Reza Eghbali and James Saunderson.
|
![]() |
STSW04 |
26th June 2018 16:00 to 16:45 |
Alex d'Aspremont |
An Approximate Shapley-Folkman Theorem.
with Thomas Kerdreux and Igor Colin.
The Shapley-Folkman theorem shows that Minkowski averages of uniformly bounded sets tend to be convex when the number of terms in the sum becomes much larger than the ambient dimension. In optimization, \citet{Aubi76} show that this produces an a priori bound on the duality gap of separable nonconvex optimization problems involving finite sums. This bound is highly conservative and depends on unstable quantities, and we relax it in several directions to show that non convexity can have a much milder impact on finite sum minimization problems such as empirical risk minimization and multi-task classification. As a byproduct, we show a new version of Maurey's classical approximate Carath\'eodory lemma where we sample a significant fraction of the coefficients, without replacement.
|
![]() |
STSW04 |
27th June 2018 09:00 to 09:45 |
Urvashi Oswal |
Selection and Clustering of Correlated variables using OWL/GrOWL regularizers
In high-dimensional linear regression problems, it is likely that several covariates are highly correlated e.g. in fMRI data, certain voxels or brain regions may have very correlated activation patterns. Using standard sparsity-inducing regularization (such as Lasso) in such scenarios is known to be unsatisfactory, as it leads to the selection of a subset of the highly correlated covariates. For engineering purposes and/or scientific interpretability, it is often desirable to explicitly identify all of the covariates that are relevant for modeling the data. In this talk, I will present clustering properties and error bounds of Ordered Weighted $\ell_1$ (OWL) regularization for linear regression, and a group variant for multi-task regression called Group OWL (GrOWL). I will present the application of OWL/GrOWL in three settings: (1) Brain networks (2) Subspace clustering and (3) Deep Learning. I will demonstrate, in theory and experiments, how OWL/GrOWL deal with strongly correlated covariates by automatically clustering and averaging regression coefficients associated with those covariates.
This is joint work with Robert Nowak, Mário Figueiredo, Tim Rogers and Chris Cox.
|
![]() |
STSW04 |
27th June 2018 09:45 to 10:30 |
Elizaveta Levina |
Matrix completion in network analysis
Matrix completion is an active area of research in itself, and a natural tool to apply to network data, since many real networks are observed incompletely and/or with noise. However, developing matrix completion algorithms for networks requires taking into account the network structure. This talk will discuss three examples of matrix completion used for network tasks. First, we discuss the use of matrix completion for cross-validation on network data, a long-standing problem in network analysis. Two other examples focus on reconstructing incompletely observed networks, with structured missingness resulting from network sampling mechanisms. One scenario we consider is egocentric sampling, where a set of nodes is selected first and then their connections to the entire network are observed. Another scenario focuses on data from surveys, where people are asked to name a given number of friends. We show that matrix completion, when used appropriately, can generally be very helpful in solving network problems.
|
![]() |
STSW04 |
27th June 2018 11:00 to 11:45 |
Garvesh Raskutti |
Estimating sparse additive auto-regressive network models
Consider a multi-variate time series, which may correspond to spike train responses for multiple neurons in a brain, crime event data across multiple regions, and many others. An important challenge associated with these time series models is to estimate an influence network between the d variables, especially when the number of variables d is large meaning we are in the high-dimensional setting. Prior work has focused on parametric vector auto-regressive models. However, parametric approaches are somewhat restrictive in practice. In this paper, we use the non-parametric sparse additive model (SpAM) framework to address this challenge. Using a combination of beta and phi-mixing properties of Markov chains and empirical process techniques for reproducing kernel Hilbert spaces (RKHSs), we provide upper bounds on mean-squared error in terms of the sparsity s, logarithm of the dimension logd, number of time points T, and the smoothness of the RKHSs. Our rates are sharp up to logarithm factors in many cases. We also provide numerical experiments that support our theoretical results and display potential advantages of using our non-parametric SpAM framework for a Chicago crime dataset.
|
![]() |
STSW04 |
27th June 2018 11:45 to 12:30 |
Peter Bartlett |
Representation, optimization and generalization properties of deep neural networks
Deep neural networks have improved the state-of-the-art performance for prediction problems across an impressive range of application areas. This talk describes some recent results in three directions. First, we investigate the impact of depth on representational properties of deep residual networks, which compute near-identity maps at each layer, showing how their representational power improves with depth and that the functional optimization landscape has the desirable property that stationary points are optimal. Second, we study the implications for optimization in deep linear networks, showing how the success of a family of gradient descent algorithms that regularize towards the identity function depends on a positivity condition of the regression function. Third, we consider how the performance of deep networks on training data compares to their predictive accuracy, we demonstrate deviation bounds that scale with a certain "spectral complexity," and we compare the behavior of these bounds with the observed performance of these networks in practical problems. Joint work with Steve Evans, Dylan Foster, Dave Helmbold, Phil Long, and Matus Telgarsky. |
![]() |
STSW04 |
28th June 2018 09:00 to 09:45 |
Tong Zhang |
Candidates vs. Noises Estimation for Large Multi-Class Classification Problem
In practice, there has been sigificant interest in multi-class classification problems where the number of classes is large. Computationally such applications require statistical methods with run time sublinear in the number of classes. A number of methods such as Noise-Contrastive Estimation (NCE) and variations have been proposed in recent years to address this problem. However, the existing methods are not statistically efficient compared to multi-class logistic regression, which is the maximum likelihood estimate. In this talk, I will describe a new method called Candidate v.s. Noises Estimation (CANE) that selects a small subset of candidate classes and samples the remaining classes. We show that CANE is always consistent and computationally efficient. Moreover, the resulting estimator has low statistical variance approaching that of the maximum likelihood estimator, when the observed label belongs to the selected candidates with high probability. Extensive experimental results show that CANE achieves better prediction accuracy over a number of the state-of-the-art tree classifiers, while it gains significant speedup compared to standard multi-class logistic regression.
|
![]() |
STSW04 |
28th June 2018 09:45 to 10:30 |
Aurore Delaigle |
Estimating a covariance function from fragments of functional data
Functional data are often observed only partially, in the form of
fragments. In that case, the standard approaches for estimating
the covariance function do not work because entire parts of the
domain are completely unobserved. In previous work, Delaigle and
Hall (2013, 2016) have suggested ways of estimating the covariance
function, based for example on Markov assumptions. In this work
we take a completely different approach which does not rely on such
assumptions. We show that, using a tensor product approach, it is
possible to reconstruct the covariance function using observations
located only on the diagonal of its domain.
|
|
STSW04 |
28th June 2018 11:00 to 11:45 |
Francis Bach |
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
We consider stochastic gradient descent (SGD) for
least-squares regression with potentially several passes over the data. While
several passes have been widely reported to perform practically better in terms
of predictive performance on unseen data, the existing theoretical analysis of
SGD suggests that a single pass is statistically optimal. While this is true
for low-dimensional easy problems, we show that for hard problems, multiple
passes lead to statistically optimal predictions while single pass does not; we
also show that in these hard models, the optimal number of passes over the data
increases with sample size. In order to define the notion of hardness and show
that our predictive performances are optimal, we consider potentially
infinite-dimensional models and notions typically associated to kernel methods,
namely, the decay of eigenvalues of the covariance matrix of the features and
the complexity of the optimal predictor as measured through the covariance
matrix. We illustrate our results on synthetic experiments with non-linear
kernel methods and on a classical benchmark with a linear model. (Joint work
with Loucas Pillaud-Vivien and Alessandro Rudi)
|
![]() |
STSW04 |
28th June 2018 11:45 to 12:30 |
Edward Ionides |
Monte Carlo adjusted profile likelihood, with applications to spatiotemporal and phylodynamic inference.
Partially observed nonlinear stochastic dynamic systems raise inference challenges. Sequential Monte Carlo (SMC) methods provide a route to accessing the likelihood function. However, despite the advantage of applicability to a wide class of nonlinear models, standard SMC methods have a limitation that they scale poorly to large systems. We present a profile likelihood approach, properly adjusted for Monte Carlo uncertainty, that enables likelihood-based inference in systems for which Monte Carlo error remains large despite stretching the limits of available computational resources. Together with state-of-the-art SMC algorithms, this technique permits effective inference on some scientific problems in panel time series analysis, spatiotemporal modeling, and inferring population dynamic models from genetic sequence data.
The results presented are joint work with Carles Breto, Joonha Park, Alex Smith and Aaron King.
|
![]() |
STSW04 |
28th June 2018 14:00 to 14:45 |
Jinchi Lv |
Asymptotics of Eigenvectors and Eigenvalues for Large Structured Random Matrices
Characterizing the exact asymptotic distributions of high-dimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. This paper establishes the asymptotic properties of the spiked eigenvectors and eigenvalues for the generalized Wigner random matrix, where the mean matrix is assumed to have a low-rank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we provide novel asymptotic expansions for the general linear combination and further show that the linear combination is asymptotically normal after some normalization, where the weight vector can be an arbitrary unit vector. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications. This is a joint work with Jianqing Fan, Yingying Fan and Xiao Han.
|
![]() |
STSW04 |
28th June 2018 14:45 to 15:30 |
Rui Castro |
Are there needles in a (moving) haystack? Adaptive sensing for detection and estimation of static and dynamically evolving signals
In this work we investigate the problem of testing for the presence of dynamically evolving sparse signals. This is motivated by settings were the signal of interest changes rapidly while measurements are being collected (e.g., monitoring of the electromagnetic spectrum, or detection of hot spots of a rapidly spreading disease). We model the signal as an n dimensional vector that has s non-zero components, and at each time a fraction of the non-zero components may change their location. The statistical problem is to decide whether the signal is a zero vector or in fact it has non-zero components. This decision is based on m noisy observations of individual signal components. We consider two different sensing paradigms, namely adaptive and non-adaptive sensing. For non-adaptive sensing the choice of components to measure has to be decided before the data collection process started, while for adaptive sensing one can adjust the sensing process based on observations collected earlier. We characterize the difficulty of this detection problem in both sensing paradigms, with special interest to the speed of change of the active components. In addition we provide an adaptive sensing algorithm for this problem and contrast its performance to that of best non-adaptive detection algorithms. Interestingly, when the number of measurements is small (on the order n/s), there is a significant difference in performance between the two sensing paradigms, as non-adaptive sensing is unable to exploit the dynamic nature of the signal (based on joint works with Ervin Tánczos).
|
![]() |
STSW04 |
28th June 2018 16:00 to 16:45 |
Pradeep Ravikumar |
Robust Estimation via Robust Gradient Estimation
A common assumption in the training of machine learning systems is that the data is sufficiently clean and well-behaved: there are very few or no outliers, or that the distribution of the data does not have very long tails. As machine learning finds wider usage, these assumptions are increasingly indefensible. The key question then is how to perform estimation that is robust to departure from these assumptions; and which has been of classical interest, with seminal contributions due to Box, Tukey, Huber, Hampel, and several others. Loosely, there seemed to be a computation-robustness tradeoff, practical estimators did have strong robustness guarantees, while estimators with strong robustness guarantees were computationally impractical.
In our work, we provide a new class of computationally-efficient class of estimators for risk minimization that are provably robust to a variety of robustness settings, such as arbitrary oblivious contamination, and heavy-tailed data, among others. Our workhorse is a novel robust variant of gradient descent, and we provide conditions under which our gradient descent variant provides accurate and robust estimators in any general convex risk minimization problem. These results provide some of the first computationally tractable and provably robust estimators for general statistical models.
Joint work with Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan.
|
![]() |
STSW04 |
29th June 2018 09:00 to 09:45 |
Alexandre Tsybakov |
Does data interpolation contradict statistical optimality?
We exhibit estimators that interpolate the data, and yet achieve optimal rates of convergence for the problems of nonparametric regression and prediction with square loss. This curious observation goes against the usual intuition that a good statistical procedure should forego the exact fit to data in favor of a more smooth representation.
A motivation for this work is the recent focus within the machine learning community on the out-of-sample performance of neural networks. These flexible models are typically trained to fit the data exactly (either in their sign or in the actual value), and yet they have good prediction behaviour. This talk is based on joint work with Misha Belkin and Sasha Rakhlin.
|
|
STSW04 |
29th June 2018 09:45 to 10:30 |
Vincent Vu |
Group invariance and computational sufficiency
Statistical sufficiency formalizes the notion of data reduction. In the decision theoretic interpretation, once a model is chosen all inferences should be based on a sufficient statistic. However, suppose we start with a set of methods that share a sufficient statistic rather than a specific model. Is it possible to reduce the data beyond the statistic and yet still be able to compute all of the methods? In this talk, I'll present some progress towards a theory of "computational sufficiency" and show that strong reductions _can_ be made for large classes of penalized M-estimators by exploiting hidden symmetries in the underlying optimization problems. These reductions can (1) enable efficient computation and (2) reveal hidden connections between seemingly disparate methods. As a main example, I'll show how the theory provides a surprising answer to the following question: "What do the Graphical Lasso, sparse PCA, single-linkage clustering, and L1 penalized Ising model selection all have in common?"
|
![]() |
STSW04 |
29th June 2018 11:00 to 11:45 |
Richard Samworth |
Data perturbation for data science
When faced with a dataset and a problem of interest, should we propose a statistical model and use that to inform an appropriate algorithm, or dream up a potential algorithm and then seek to justify it? The former is the more traditional statistical approach, but the latter appears to be becoming more popular. I will discuss a class of algorithms that belong in the second category, namely those that involve data perturbation (e.g. subsampling, random projections, artificial noise, knockoffs,...). As examples, I will consider Complementary Pairs Stability Selection for variable selection and sparse PCA via random projections.
This will involve joint work with Rajen Shah, Milana Gataric and Tengyao Wang.
|
![]() |
STSW04 |
29th June 2018 11:45 to 12:30 |
Domagoj Ćevid, Peter Bühlmann |
Deconfounding using Spectral Transformations
High-dimensional regression methods which rely on the sparsity of the ground truth, such as the Lasso, might break down in the presence of confounding variables. If a latent variable affects both the response and the predictors, the correlation between them changes. This phenomenon can be represented as a linear model where the sparse coefficient vector has been perturbed. We will present our work on this problem. We investigate and propose some spectral transformations for the data which serve as input for the Lasso. We discuss assumptions for achieving the optimal error rate and illustrate the performance on a genomic dataset. The approach is easy to use and leads to convincing results. The talk is based on joint work with Nicolai Meinshausen.
|
![]() |