09:00 to 09:45 Yi Yu Optimal Covariance Change Point Detection in High Dimension Co-authors: Daren Wang (Carnegie Mellon University), Alessandro Rinaldo (Carnegie Mellon University) In this paper, we study covariance change point detection problem in high dimension. Specifically, we assume that the time series  $X_i \in \mathbb{R}^p$, $i = 1, \ldots, n$ are independent $p$-dimensional sub-Gaussian random vectors and that the corresponding covariance matrices $\{\Sigma_i\}_{i=1}^n$ are stationary within segments and only change at certain time points.  Our generic model setting allows $p$ grows with $n$ and we do not place any additional structural assumptions on the covariance matrices.  We introduce algorithms based on binary segmentation (e.g. Vostrikova, 1981) and wild binary segmentation (Fryzlewicz, 2014) and establish the consistency results under suitable conditions.  To improve the detection performance in high dimension, we propose   Wild Binary Segmentation through Independent Projection (WBSIP).  We show that WBSIP can optimally estimate the locations of the change points.  Our analysis also reveals a phase transition effect based on our generic model assumption and to the best of our knowledge, this type of results have not been established elsewhere in the change point detection literature. INI 1 09:45 to 10:30 Alexandre Tsybakov Adaptive estimation of functionals under sparsity Adaptive estimation of functionals in sparse mean model and in sparse regression exhibits some interesting effects. This talk focuses on estimation of the L_2-norm of the target vector and of the variance of the noise. In the first problem, the ignorance of the noise level causes changes in the rates. Moreover, the form of the noise distribution also infuences the optimal rate. For example, the rates of estimating the variance differ depending on whether the noise is Gaussian or sub-Gaussian without a precise knowledge of the distribution. Finally, for the sparse mean model, the sub-Gaussian rate cannot be attained adaptively to the noise level on classes of noise distributions with polynomial tails, independently on how fast is the polynomial decay. Joint work with O.Collier and L.Comminges. INI 1 10:30 to 11:00 Morning Coffee 11:00 to 11:45 Claire Boyer On the gap between local recovery guarantees in structured compressed sensing and oracle estimates This is an on-going work with Ben Adcock and Simone Brugiapaglia (Simon Fraser University, Burnaby). First we will introduce a compressed sensing (CS) theory more compatible with real-life applications: we derive guarantees to ensure reconstruction of a structured sparse signal of interest while imposing structure in the acquisition (no Gaussian measurements here...). We will study how far those CS results are from oracle-type guarantees, and we will show that they are similar in terms of the required number of measurements. These results give an insight to design new optimal sampling strategies when realistic physical constraints are imposed in the acquisition. INI 1 11:45 to 12:30 Ioannis Kontoyiannis Small Big Data: Temporal structure in discrete time series The identification of useful temporal structure in discrete time series is an important component of algorithms used for many tasks in statistical inference and machine learning. Most of the early approaches developed were ineffective in practice, because the amount of data required for reliable modeling grew exponentially with memory length. On the other hand, many of the more modern methodological approaches that make use of more flexible and parsimonious models result in algorithms that do not scale well and are computationally ineffective for larger data sets. We will discuss a class of novel methodological tools for effective Bayesian inference and model selection for general discrete time series, which offer promising results on both small and big data. Our starting point is the development of a rich class of Bayesian hierarchical models for variable-memory Markov chains. The particular prior structure we adopt makes it possible to design effective, linear-time algorithms that can compute most of the important features of the resulting posterior and predictive distributions without resorting to MCMC. We have applied the resulting algorithms and methodological tools to numerous application-specific tasks (including on-line prediction, segmentation, classification, anomaly detection, entropy estimation, and causality testing) on data sets from different areas of application, including data compression, neuroscience, finance, genetics, and animal communication. Results on both simulated and real data will be presented, and brief comparisons with other approaches (including Bühlmann et al’s VLMC, Ron et al’s PST, and Raftery’s MTD) will be discussed. INI 1 12:30 to 14:00 Lunch @ Churchill College 14:00 to 14:45 Judith Rousseau Asymptotics for ABC algorithms Approximate Bayesoan Computation algorithms (ABC) are used in cases where the likelihood is intractable. To simulate from the (approximate) posterior distribution a possiblity is to sample new data from the model and check is these new data are close in some sense to the true data. The output of this algorithms thus depends on how we define the notion of closeness, which is based on a choice of summary statistics and on a threshold. Inthis work we study the behaviour of the algorithm under the assumption that the summary statistics are concentrating on some deterministic quantity and characterize the asymptotic behaviour of the resulting approximate posterior distribution in terms of the threshold and the rate of concentration of the summary statistics. The case of misspecified models is also treated where we show that surprising asymptotic behaviour appears. INI 1 14:45 to 15:30 Andreas Elsener Sharp oracle inequalities for stationary points of nonconvex penalised M-estimators Co-author: Sara van de Geer (ETH Zurich) Nonconvex loss functions are used in several areas of statistics and machine learning. They have several appealing properties as in the case of robust regression. We propose a general framework to derive sharp oracle inequalities for stationary points of penalised M-estimators with nonconvex loss. The penalisation term is assumed to be a weakly decomposable norm. We apply the general framework to sparse (additive) corrected linear regression, sparse PCA, and sparse robust regression. Finally, a new estimator called "robust SLOPE" is proposed to illustrate how to apply our framework to norms different from the l1-norm. INI 1 15:30 to 16:00 Afternoon Tea 16:00 to 16:45 Rajen Shah The xyz algorithm for fast interaction search in high-dimensional data When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main effects and include interactions as products between individual variables. However, if the number of variables $p$ is large, as is common in genomic datasets, the computational cost of searching through $O(p^2)$ interactions can be prohibitive. In this talk I will introduce a new randomised algorithm called xyz that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. The underlying idea is to transform interaction search into a much simpler close pairs of points problem. We will see how strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $O(p^u)$ operations for $1 INI 1  09:00 to 09:45 Alessandro Rudi Falkon: fast and optimal kernel method for large scale machine learning Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt{n}) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures. INI 1 09:45 to 10:30 Philippe Rigollet Learning determinantal point processes Co-authors: Victor-Emmanuel Brunel (MIT), Ankur Moitra (MIT), John Urschel (MIT) Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning (such as recommendation systems) where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. In this talk, I will present recent results related to this problem, specifically - Rates of convergence for the maximum likelihood estimator: by studying the local and global geometry of the expected log-likelihood function we are able to establish rates of convergence for the MLE and give a complete characterization of the cases where these are parametric. We also give a partial description of the critical points for the expected log-likelihood. - Optimal rates of convergence for this problem: these are achievable by the method of moments and are governed by a combinatorial parameter, which we call the cycle sparsity. - A fast combinatorial algorithm to implement the method of moments efficiently. The necessary background on DPPs will be given in the talk. Joint work with Victor-Emmanuel Brunel (M.I.T), Ankur Moitra (M.I.T) and John Urschel (M.I.T). INI 1 10:30 to 11:00 Morning Coffee 11:00 to 11:45 Shahar Mendelson An optimal unrestricted learning procedure The question of prediction is of central importance in Statistical Learning Theory. The optimal tradeoff between accuracy and confidence and the identity of a procedure that attains that optimal tradeoff have been studied extensively over the years.In this talk I present some of ideas used in the recent solution of this problem: a procedure$\hat{f}$that attains the best possible accuracy/confidence tradeoff for a triplet$(F,X,Y)$consisting of an arbitrary class$F$, an unknown distribution$X$and an unknown target$Y$.Specifically, I explain why under minimal assumptions on$(F,X,Y)$, there is a natural critical level$r^*$that depends on the triplet and the sample size$N$, such that for any accuracy parameter$r>r^*$, one has$E((\hat{f}(X)-Y)^2|D) \leq \inf_{f \in F} E(f(X)-Y)^2+r^2$with probability at least$1-2\exp(-cN \min\{1,r^2/\sigma^2\})$,where$\sigma^2=\inf_{f \in F} E(f(X)-Y)^2$. INI 1 11:45 to 12:30 Garvesh Raskutti Variable selection and classification with large-scale presence only data Co-author: Hyebin Song (University of Wisconsin-Madison) In various real-world problems, we are presented with positive and unlabelled data, referred to as presence-only responses where the number of covariates$p$is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this paper, we develop the \emph{PUlasso} algorithm for variable selection and classification with positive and unlabelled responses. Our algorithm involves using the majorization-minimization (MM) framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm is guaranteed to converge to a stationary point, and then prove that any stationary point achieves the minimax optimal mean-squared error of$\frac{s \log p}{n}$, where$s$is the sparsity of the true parameter. We also demonstrate through simulations that our algorithm out-performs state-of-the-art algorithms in the moderate$p\$ settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example.Related Linkshttps://arxiv.org/abs/1711.08129 - Link to Arxiv paper INI 1 12:30 to 14:00 Lunch @ Churchill College