# Workshop Programme

## for period 23 - 27 June 2008

### Future Directions in High-Dimensional Data Analysis:

23 - 27 June 2008

Timetable

 Monday 23 June 08:30-09:55 Registration 08:30-17:00 Poster session 09:55-10:00 David Wallace - Welcome Chair: Iain Johnstone 10:00-11:00 Hall, PG (Melbourne) Variable selection in very high dimensional regression and classification Sem 1 11:00-11:30 Coffee 11:30-12:30 Gather, U (Dortmund) Dimension reduction Sem 1 Ursula Gather joint work with Charlotte Guddat Progress in computer science in the last decades has practically led to ’floods of data’ which can be stored and has to be handled to gain information of interest therein. As an example, consider data from the field of genetics where the dimension may increase to values up in the thousands. Classical statistical tools are not able to cope with this situation. Hence, a number of dimension reduction procedures have been developed which may be applied when considering nonparametric regression procedures. The aim is to find a subspace of the predictor space which is of much lower dimension but still contains the important information on the relation between response and predictors. We will review a number of procedures for dimension reduction (e.g. SIR, SAVE) in multiple regression and consider them under robustness aspects as well. As a special case we include methods for variable selection (e.g. EARTH, SIS) and introduce a new robust approach for the case when n is much smaller than p. 12:30-13:30 Lunch at Wolfson Court Chair: Richard Samworth 14:00-15:00 Meinshausen, N (UC Berkeley) Stability - based regularisation Sem 1 The properties of L1-penalized regression have been examined in detail in recent years. I will review some of the developments for sparse high-dimensional data, where the number of variables p is potentially very much larger than sample size n. The necessary conditions for convergence are less restrictive if looking for convergence in L2-norm than if looking for convergence in L0-quasi-norm. I will discuss some implications of these results. These promising theoretical developments notwithstanding, it is unfortunately often observed in practice that solutions are highly unstable. If running the same model selection procedure on a new set of samples, or indeed a subsample, results can change drastically. The choice of the proper regularization parameter is also not obvious in practice, especially if one is primarily interested in structure estimation and only secondarily in prediction. Some preliminary results suggest, though, that the stability or instability of results is informative when looking for suitable data-adaptive regularization. 15:00-15:30 Tea 15:30-16:30 Cai, T (Pennsylvania) Large-scale multiple testing: finding needles in a haystack Sem 1 Due to advances in technology, it has become increasingly common in scientific investigations to collect vast amount of data with complex structures. Examples include microarray studies, fMRI analysis, and astronomical surveys. The analysis of these data sets poses many statistical challenges not present in smaller scale studies. In these studies, it is often required to test thousands and even millions of hypotheses simultaneously. Conventional multiple testing procedures are based on thresholding the ordered p-values. In this talk, we consider large-scale multiple testing from a compound decision theoretical point of view by treating it as a constrained optimization problem. The solution to this optimization problem yields an oracle procedure. A data-driven procedure is then constructed to mimic the performance of the oracle and is shown to be asymptotically optimal. In particular, the results show that, although p-value is appropriate for testing a single hypothesis, it fails to serve as the fundamental building block in large-scale multiple testing. Time permitting, I will also discuss simultaneous testing of grouped hypotheses. This is joint work with Wenguang Sun (University of Pennsylvania). Related Links http://www-stat.wharton.upenn.edu/~tcai/ - Download papers here. 16:30-17:30 Gong Show 17:30-18:30 Welcome Wine Reception and Poster Session 18:45-19:30 Dinner at Wolfson Court (Residents only)
 Thursday 26 June Chair: Sasha Tsybakov 09:00-10:00 Wang, J-L (UC Davis) Covariate adjusted functional principal component analysis for longitudinal data Sem 1 Classical multivariate principal component analysis has been extended to functional data and termed Functional principal component analysis (FPCA). Much progress has been made but most existing FPCA approaches do not accommodate covariate information, and it is the goal of this talk to develop alternative approaches to incorporate covariate information in FPCA, especially for irregular or sparse functional data. Two approaches are studied: the first incorporates covariate effects only through the mean response function, but the second approach adjusts the covariate effects for both the mean and covariance functions of the response. Both new approaches can accommodate measurement errors and allow data to be sampled at regular or irregular time grids. Asymptotic results are developed and numerical support provided through simulations and a data example. A comparison of the two approaches will also be discussed. 10:00-11:00 Koltchinskii, V (Georgia Institute of Technology) Penalized empirical risk minimization and sparse recovery problems Sem 1 A number of problems in regression and classification can be stated as penalized empirical risk minimization over a linear span or a convex hull of a given dictionary with convex loss and convex complexity penalty, such as, for instance, $\ell_1$-norm. We will discuss several oracle inequalities showing how the error of the solution of such problems depends on the "sparsity" of the problem and the "geometry" of the dictionary. 11:00-11:30 Coffee 11:30-12:30 Wolfe, P (Harvard) The Nystrom extension and spectral methods in learning: low-rank approximation of quadratic forms and products Sem 1 Spectral methods are of fundamental importance in statistics and machine learning, as they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. Motivated by such applications, we present here two new algorithms for the approximation of positive semi-definite kernels, together with error bounds that improve upon known results. The first of these—based on sampling—leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach—based on sorting—provides for the selection of a partition in a deterministic way. After detailing their numerical implementation and verifying performance via simulation results for representative problems in statistical data analysis, we conclude with an extension of these results to the sparse representation of linear operators and the efficient approximation of matrix products. 12:30-13:30 Lunch at Wolfson Court Chair: Doug Nychka 14:00-14:20 Pan, G (Eurandom) Limiting theorems for large dimensional sample means, sample covariance matrices and Hotelling's T2 statistics Sem 1 It is well known that sample means and sample covariance matrices are independent if the samples are from the Gaussian distribution and are i.i.d.. In this talk, via investigating the random quardratic forms involving sample means and sample covariance matrices, we suggest the conjecture that the sample means and the sample covariance matrices under general distribution functions are asymptotically independent in the large dimensional case when the dimension of the vectors and the sample size both go to infinity with their ratio being a positive constant. As a byproduct, the central limit theorem for the Hotelling $T^2$ statistic under the large dimensional case is established. 14:20-14:40 Shi, JQ (Newcastle) Generalised gaussian process functional regression model Sem 1 In this talk, I will discuss a functional regression problem with non-Gaussian functional (longitudinal) response with functional predictors. This type of problem includes for example binomial and Poisson response data, occurring in many bi-medical and engineering experiments. We proposed a generalised Gaussian process functional regression model for such regression situation. We suppose that there exists an underlying latent process between the inputs and the response. The latent process is defined by Gaussian process functional regression model, which is connected with stepwise response data by means of a link function. 14:40-15:00 Wang, Y (NSF) Estimation of large volatility matrix for high-frequency financial data Sem 1 Statistical theory for estimating large covariance matrix shows that even for noiseless synchronized high-frequency financial data, the existing realized volatility based estimators of integrated volatility matrix of p assets are inconsistent, for large p (the number of assets and large n (the sample size for high-frequency data). This paper proposes new types of estimators of integrated volatility matrix for noisy non-synchronized high-frequency data. We show that when both n and p go to infinity with p/n approaching to a constant, the proposed estimators are consistent with good convergence rates. Our simulations demonstrate the excellent performance of the proposed estimators under complex stochastic volatility matrices. We have applied the methods to high-frequency data with over 600 stocks. 15:00-15:30 Tea 15:30-16:30 Barber, D (UC London) Graph decomposition for community identification and covariance constraints Sem 1 An application in large databases is to find well-connected clusters of nodes in an undirected graph where a link represents interaction between objects. For example, finding tight-knit communities in social networks, identifying related product-clusters in collaborative filtering, finding genes which collaborate in different biological functions. Unlike graph-partitioning, in this scenario an object may belong to more than one community -- for example, a person might belong to more than one group of friends, or a gene may be active in more than one gene-network. I'll discuss an approach to identifying such overlapping communities based on extending the incidence matrix decomposition of a graph to a clique-decomposition. Clusters are then identified by approximate variational (mean-field) inference in a related probabilistic model. The resulting decomposition has the side-effect of enabling a parameteristion of positive definite matrices under zero-constraints on entries in the matrix. Provided the graph corresponding to the constraints is decomposable all such matrices are reachable by this parameterisation. In the non-decomposable case, we show how the method forms an approximation of the space and relate it to more standard latent variable parameterisations of zero-constrained covariances. 16:30-17:30 Levina, E (Michigan) Permutation-invariant covariance regularisation in high dimensions Sem 1 Estimation of covariance matrices has a number of applications, including principal component analysis, classification by discriminant analysis, and inferring independence and conditional independence between variables, and the sample covariance matrix has many undesirable features in high dimensions unless regularized. Recent research mostly focused on regularization in situations where variables have a natural ordering. When no such ordering exists, regularization must be performed in a way that is invariant under variable permutations. This talk will discuss several new sparse covariance estimators that are invariant to variable permutations. We obtain convergence rates that make explicit the trade-offs between the dimension, the sample size, and the sparsity of the true model, and illustrate the methods on simulations and real data. We will also discuss a method for finding a "good" ordering of the variables when it is not provided, based on the Isomap, a manifold projection algorithm. The talk includes joint work with Adam Rothman, Amy Wagaman, Ji Zhu (University of Michigan) and Peter Bickel (UC Berkeley). 19:30-23:00 Conference Dinner - Lucy Cavendish College (Hall)