Workshop Programme
for period 7  11 January 2008
Contemporary Frontiers in HighDimensional Statistical Data Analysis
7  11 January 2008
Timetable
Monday 7 January  
08:3009:55  Registration  
09:5510:00  David Wallace  Welcome  
Chair: Mike Titterington  
10:0011:00  Donoho, D (Stanford)  
Breakdown point of model selection when the number of variables exceeds the number of observations  Sem 1  
11:0011:30  Coffee  
11:3012:30  van de Geer, S (Zurich)  
The deterministic lasso  Sem 1  
We study highdimensional generalized linear models. and risk minimization using the Lasso. The risk is taken under a random probability measure P' and the target is an overall minimizer of the risk under some other nonrandom probability measure P. We restrict ourselves to a set S where P' and P are close to each other, and present an oracle inequality under a socalled compatibility condition between the L_2 norm and l_1 norm. 

12:3013:30  Lunch at Wolfson Court  
Chair: Sara van de Geer  
14:0015:00  Wegman, E (George Mason)  
Methods for visualizing high dimensional data  Sem 1  
In this presentation, we review some fundamentals of visualization and then proceed to describe methods and combinations of methods useful for visualizing high dimensional data. Some methods include parallel coordinates, smooth interpolations of parallel coordinates, grand tours including wrapping tours, fractal tours, pseudogrand tours, and pixel tours. 

15:0015:30  Tea  
15:3016:30  Young, A (Imperial)  
Bootstrap and parametric inference: successes and challenges  Sem 1  
We review parametric frequentist inference as it has developed over the last 25 years or so. Two main strands have emerged: analytic procedures based on smallsample asymptotics and simulation (bootstrap) approaches. We argue that the latter yield, with appropriate handling of nuisance parameters, a simple and flexible methodology, yet one which nevertheless retains the finer inferential components of parametric theory in an automatic fashion. Performance of the bootstrap methods, even in problems with highdimensional parameters but small data sample sizes, points in favour of their being the method of choice in complex settings, such as those motivating this programme. Related Links


16:3017:30  Short advertisements by poster presenters "Gong Show"  
17:3018:30  Welcome Wine Reception and Poster session  
18:4519:30  Dinner at Wolfson Court (Residents only) 
Tuesday 8 January  
Chair: Di Cook  
09:0010:00  Wainwright, M (Berkeley)  
Practical and informationtheoretic limitations in highdimensional inference  Sem 1  
This talk considers questions of two types concerning highdimensional inference. First, given a practical (polynomialtime) algorithm, what are the limits of its performance? Second, how do such practical limitations compare to informationtheoretic bounds, which apply to the performance of any algorithm regardless of computational complexity? We analyze these issues in highdimensional versions of two canonical inference problems: (a) support recovery in sparse regression; and (b) the sparse PCA or eigenvector problem. For the sparse regression problem, we describe a sharp threshold on the sample size n that controls success/failure of \ell_1 constrained quadratic programming (the Lasso), as function of the problem size p, and sparsity index k (number of nonzero entries). Using informationtheoretic methods, we prove that the Lasso is orderoptimal for sublinear sparsity (vanishing k/p), but suboptimal for linear sparsity (k/p bounded away from zero). For the sparse eigenvector problem, we analyze a semidefinite programming relaxation due to Aspremont et al., and establish a similar transition in failure/success for triplets (n,p,k) tending to infinity. Based on joint works with Arash Amini, John Lafferty, and Pradeep Ravikumar. 

10:0011:00  Samworth, R (Cambridge)  
Some thoughts on nonparametric classification: nearest neighbours, bagging and max likelihood estimation of shapeconstrained densities  Sem 1  
The $k$nearest neighbour rule is arguably the simplest and most intuitively appealing nonparametric classifier. We will discuss recent results on the optimal choice of $k$ in situations where the underlying populations have densities with a certain smoothness in $\mathbb{R}^d$. Extensions to the bagged nearest neighbour classifier, which can be regarded as a weighted $k$nearest neighbour classifier, are also possible, and yield a somewhat suprising comparsion with the unweighted case. Another possibility for nonparametric classification is based on estimating the underlying densities explicitly. An attractive alternative to kernel methods is based on the maximum likelihood estimator, which can be shown to exist if the densities satisfy certain shape constraints, such as logconcavity. We will also discuss an algorithm for computing the estimator in this case, which results in a classifier that is fully automatic yet still nonparametric. Related Links


11:0011:30  Coffee  
11:3012:30  Cook, RD (Minnesota)  
Modelbased sufficient dimension reduction for regression  Sem 1  
Dimension reduction in regression, represented primarily by principal components, is ubiquitous in the applied sciences. This is an old idea that has moved to a position of prominence in recent years because technological advances now allow scientists to routinely formulate regressions in which the number p of predictors is considerably larger than in the past. Although "large" p regressions are perhaps mainly responsible for renewed interest, dimension reduction methodology can be useful regardless of the size of p. Starting with a little history and a definition of "sufficient reductions", we will consider a variety of models for dimension reduction in regression. The models start from one in which maximum likelihood estimation produces principal components, step along a few incremental expansions, and end with forms that have the potential to improve on some standard methodology. This development provides remedies for two concerns that have dogged principal components in regression: principal components are typically computed from the predictors alone and then do not make apparent use of the response, and they are not equivariant under full rank linear transformation of the predictors. Related Links 

12:3013:30  Lunch at Wolfson Court  
Chair: Dennis Cook  
14:0015:00  Jordan, M (Berkeley)  
Kernelbased contrast functions for sufficient dimension reduction  Sem 1  
We present a new methodology for sufficient dimension reduction (the problem of finding a subspace $S$ such that the projection of the covariate vector $X$ onto $S$ captures the statistical dependency of the response $Y$ on $X$). Our methodology derives directly from a formulation of sufficient dimension reduction in terms of the conditional independence of the covariate $X$ from the response $Y$, given the projection of $X$ on the central subspace (cf. Li, 1991; Cook, 1998). We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization leads to an Mestimator for the central subspace. The resulting estimator is shown to be consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods. We also present empirical results showing that the new methodology is competitive in practice. Related Links 

15:0015:30  Tea  
15:3016:30  Fan, J (Princeton)  
Challenge of dimensionality in model selection and classification  Sem 1  
Model selection and classification using highdimensional features arise frequently in many contemporary statistical studies such as tumor classification using microarray or other highthroughput data. The impact of dimensionality on classifications is largely poorly understood. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in highdimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for highdimensional classification, resulting in Features Annealed Independence Rules (FAIR). The connections with the sure independent screeing (SIS) and iterative SIS(ISIS) of Fan and Lv (2007) in model selection will be elucidated and extended. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure. Related Links


16:3017:30  Bickel, P (Berkeley)  
Regularised estimation of high dimensional covariance matrices  Sem 1  
Abstract: We review ,with examples, various important parameters depending on the population covariance matrix such as inverses and eigenstructures , and the uses they are put to.We give a brief discussion of well known pathologies of the empirical covariance matrix in various applications when the data is high dimensional which imply inconsistency of "plugin"estimates of the parameters mentioned. We introduce different notions of sparsity of such matrices and show how some of these are intimately related. We then review a number of methods taking advantage of such sparsity in the population matrices .In particular we state results with various collaborators, particularly E. Levina establishing rates of convergence of our estimates of parameters as above ,as dimension and sample size tend to oo, that are uniform over large classes of sparse population covariance matrices . We conclude with some simulations , a data analysis supporting the asymptotics, and a discussion of future directions. Related Links


18:4519:30  Dinner at Wolfson Court (Residents only) 
Wednesday 9 January  
Chair: Jianqing Fan  
09:0010:00  Murtagh, F (Royal Holloway)  
The ultrametric topology perspective on analysis of massive, very high dimensional data stores  Sem 1  
An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling. Other applications will be described, in particular in the area of textual data mining. References [1] F. Murtagh, On ultrametricity, data coding, and computation, Journal of Classification, 21, 167184, 2004. [2] F. Murtagh, G. Downs and P. Contreras, "Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding", SIAM Journal on Scientific Computing, in press, 2007. [3] F. Murtagh, The remarkable simplicity of very high dimensional data: application of modelbased clustering, submitted, 2007. [4] F. Murtagh, Symmetry in data mining and analysis: a unifying view based on hierarchy, submitted, 2007. Related Links


10:0011:00  Duembgen, L (Bern)  
Pvalues for computerintensive classifiers  Sem 1  
In the first part of the talk presents pvalues for classification in general. The latter are an interesting alternative to classifiers or posterior distributions of class labels. Their purpose is to quantify uncertainty when classifying a single observation, even if we don't have information on the prior distribution of class labels. After illustrating this concept with some examples and procedures, we focus on computational issues and discuss pvalues involving regularization, in particular, LASSO type penalties, to cope with highdimensional data. (Part of this talk is based on joint work with Axel Munk, Goettingen, and BerndWolfgang Igl, Luebeck.) Related Links


11:0011:30  Coffee  
11:3012:30  Stuetzle, W (Washington)  
Nonparametric cluster analysis: estimating the cluster tree of a density  Sem 1  
The general goal of clustering is to identify distinct groups in a collection of objects. To cast clustering as a statistical problem we regard the feature vectors characterizing the objects as a sample from some unknown probability density. The premise of nonparametric clustering is that groups correspond to modes of this density. The cluster tree summarizes the connectivity structure of the level sets of a density; leaves of the tree correspond to modes of the density. I will define the cluster tree, present methods for its estimating, show examples, and discuss some open problems. Related Links 

12:3013:30  Lunch at Wolfson Court  
Chair: Peter Bickel  
14:0015:00  West, M (Duke)  
Sparsity modelling in largescale dynamic models for portfolio analysis  Sem 1  
I will discuss some of our recent work in dynamic modelling for multivariate time series that combines stochastic volatility and graphical modelling ideas. I will describe the modelling ideas and resulting matrixvariate, dynamic graphical models, and aspects of Bayesian methodology and computation for model fitting and structure search. Practical implications of the framework when applied to financial time series for predictive portfolio analysis will highlight some of the reasons for interest in sparsely structured, conditional independence models of volatility matrices. 

15:0015:30  Tea  
15:3016:30  Candes, E (Caltech)  
Computationally tractable statistical estimation when there are more variables than observations  Sem 1  
We consider the fundamental problem of estimating the mean of a vector y = X beta + z, where X is an n by p design matrix in which one can have far more variables than observations and z is a stochastic error termthe socalled `p > n' setup. When \beta is sparse, or more generally, when there is a sparse subset of covariates providing a close approximation to the unknown mean response, we ask whether or not it is possible to accurately estimate the mean using a computationally tractable algorithm. We show that in a surprisingly wide range of situations, the lasso happens to nearly select the best subset of variables. Quantitatively speaking, we prove that solving a simple quadratic program achieves a squared error within a logarithmic factor of the ideal mean squared error one would achieve with an oracle supplying perfect information about which variables should be included in the model and which variables should not. Interestingly, our results describe the average performance of the lasso; that is, the performance one can expect in an overwhelming majority of cases where X\beta is a sparse or nearly sparse superposition of variables, but not in all cases. Our results are sharp, nonasymptotic and widely applicable since they simply require that pairs of predictor variables be not overly collinear. 

16:3017:30  Nadler, B (Weizmann)  
Learning in high dimensions, noise, sparsity and treelets  Sem 1  
In recent years there is growing practical need to perform learning (classification,regression, etc) in high dimensional settings where p>>n. Consequently instead of the standard limit $n\to\infty$, learning algorithms are typically analyzed in the joint limit $p,n\to\infty$. In this talk we present a different approach, that keeps $p,n$ fixed, but considers noise as a small parameter. This resulting perturbation analysis reveals the importance of a robust low dimensional representation of the noisefree signals, the possible failure of simple variable selection methods and the key role of sparsity for the success of learning in high dimensions. We also discuss sparsity in apriori unknown basis and a possible datadriven adaptive construction of such basis, called treelets. We present a few applications of our analysis, mainly to errorinvariables linear regression problems, principal component analysis, and rank determination. Related Links 

19:3023:00  Conference Dinner  Emmanuel College (The Old Library) 
Thursday 10 January  
Chair: Alastair Young  
09:0010:00  van der Vaart, AW (Amsterdam)  
Estimating a response parameter in missing data models with highdimensional covariates  Sem 1  
We discuss a new method of estimation of parameters in semiparametric and nonparametric models. The method is based on estimating equations that are $U$statistics in the observations. The $U$statistics are based on higher order influence functions that extend ordinary linear influence functions of the parameter of interest, and represent higher derivatives of this parameter. For parameters for which the matching cannot be perfect the method leads to a biasvariance tradeoff, and results in estimators that converge at a slower than rootnrate. In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at rootnrate. 

10:0011:00  Wellner, J (Seattle)  
Persistence: alternative proofs of some results of Greenshtein and Ritov  Sem 1  
11:0011:30  Coffee  
11:3012:30  Cook, D (Iowa State)  
Looking at models in highdimensional data spaces  Sem 1  
What do the fishing net models of selforganizing maps look like in the data space? How do the estimated mean vectors and variancecovariance ellipses from modelbased clustering fit to the clusters? How does small n, large p affect the variability in the esimates of the separating hyperplane from support vector machine models? These are a few of the things that we may discuss in this talk. The goal is to calibrate participants' eyes to viewing highdimensional spaces and stimulate thought about what types of plots might accompany highdimensional statistical analysis 

12:3013:30  Lunch at Wolfson Court  
Chair: Enno Mammen  
14:0015:00  Tanner, J (Edinburgh)  
The surprising structure of Gaussian point clouds and its implications for signal processing  Sem 1  
We will explore connections between the structure of highdimensional convex polytopes and information acquisition for compressible signals. A classical result in the field of convex polytopes is that if N points are distributed Gaussian i.i.d. at random in dimension n<<N, then only order (log N)^n of the points are vertices of their convex hull. Recent results show that provided n grows slowly with N, then with high probability all of the points are vertices of its convex hull. More surprisingly, a rich "neighborliness" structure emerges in the faces of the convex hull. One implication of this phenomenon is that an Nvector with k nonzeros can be recovered computationally efficiently from only n random projections with n=2e k log(N/n). Alternatively, the best kterm approximation of a signal in any basis can be recovered from 2e k log(N/n) nonadaptive measurements, which is within a log factor of the optimal rate achievable for adaptive sampling. Additional implications for randomized error correcting codes will be presented. Related Links


15:0015:30  Tea  
15:3016:30  Lee, A (CarnegieMellon)  
Finding lowdimensional structure in highdimensional data  Sem 1  
In highdimensional data analysis, one is often faced with the problem that real data is noisy and in many cases given in coordinates that are not informative for understanding the data structure itself or for performing later tasks, such as clustering, classification and regression. The combination of noise and high dimensions (>1001000) presents challenges for data analysis and calls for efficient dimensionality reduction tools that take the inherent geometry of natural data into account. In this talk, I will first describe treelets – an adaptive multiscale basis inspired by wavelets and hierarchical trees. I will then, in the second half of my talk, describe diffusion maps  a general framework for dimensionality reduction, data set parameterization and clustering that combines ideas from eigenmaps, spectral graph theory and harmonic analysis. Our construction is based on a Markov random walk on the data, and allows one to define a system of coordinates that is robust to noise, and that reflects the intrinsic geometry or connectivity of the data points in a diffusion process. I will outline where we stand and what problems still remain. (Part of this work is joint with R.R. Coifman, S. Lafon, B. Nadler and L. Wasserman) 

16:3017:30  Niyogi, P (Chicago)  
A geometric perspective on learning theory and algorithms  Sem 1  
Increasingly, we face machine learning problems in very high dimensional spaces. We proceed with the intuition that although natural data lives in very high dimensions, they have relatively few degrees of freedom. One way to formalize this intuition is to model the data as lying on or near a low dimensional manifold embedded in the high dimensional space. This point of view leads to a new class of algorithms that are "manifold motivated" and a new set of theoretical questions that surround their analysis. A central construction in these algorithms is a graph or simplicial complex that is dataderived and we will relate the geometry of these to the geometry of the underlying manifold. Applications to embedding, clustering, classification, and semisupervised learning will be considered. 

18:4519:30  Dinner at Wolfson Court (Residents only) 
Friday 11 January  
Chair: Mike West  
09:0010:00  Buehlmann, P (Zurich)  
Highdimensional variable selection and graphs: sparsity, faithfulness and stability  Sem 1  
Over the last few years, substantial progress has been achieved on highdimensional variable selection (and graphical modeling) using L1penalization methods. Diametrically opposed to penaltybased schemes is the PCalgorithm, a special hierarchical multiple testing procedure, which exploits the socalled faithfulness assumption from graphical modeling. For asymptotic consistency in highdimensional settings, the different approaches require very different "coherence" conditions, say for the design matrix in a linear model. From a conceptual aspect, the PCalgorithm allows to identify not only regressiontype associations but also directed edges in a graph and causal effects (in the sense of Pearl's intervention operator). Thereby, sparsity, faithfulness and stability play a crucial role. We will discuss potential and limitations from a theory and practical point of view. Related Links 

10:0011:00  Mammen, E (Mannheim)  
Time series regression with semiparametric factor dynamics  Sem 1  
Highdimensional regression problems which reveal dynamic behavior are typically analyzed by time propagation of a few number of factors. The inference on the whole system is then based on the lowdimensional time series analysis. Such highdimensional problems occur frequently in many different fields of science. In this paper we address the problem of inference when the factors and factor loadings are estimated by semiparametric methods. This more flexible modelling approach poses an important question: Is it justified, from inferential point of view, to base statistical inference on the estimated times series factors? We show that the difference of the inference based on the estimated time series and `true' unobserved time series is asymptotically negligible. Our results justify fitting vector autoregressive processes to the estimated factors, which allows one to study the dynamics of the whole highdimensional system with a lowdimensional representation. The talk reports on joint projects with Szymon Borak, Wolfgang H\"ardle, Jens Perch Nielsen and Byeong U. Park 

11:0011:30  Coffee  
11:3012:30  Yu, B Y (Berkeley)  
Using side information for prediction  Sem 1  
Extracting useful information from highdimensional data is the focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the vitues of both regularization and sparsity, the L1penalized L2 minimization method Lasso has been popular. However, Lasso is often seen as not having enough regularization in the large p case. In this talk, we propose two methods that take into account side information in the penalized L2 framework, in order to bring the needed extra regularization in the large p case. First, we combine different norms including L1 to to introduce the Composite Absolute Penalties (CAP) family. CAP allows the grouping and hierarchical relationships between the predictors to be expressed. It covers and goes beyond existing works including grouped lasso and elastic nets. Path following algorithms and simulation results will be presented to compare with Lasso in terms of prediction and sparsity. Second, motivated by the problem of predicting fMRI signals from input natural images, we investigate a method that uses side information in the unlabeled data for prediction. We present a theoretical result in the case of p/n > constant and apply the method to the fMRI data problem. (It is noted that the second part is a report on ongoing research.) ~ Related Links


12:3013:30  Lunch at Wolfson Court  
Chair: David Banks  
14:0015:00  Hoyle, D (Manchester)  
A physicist's approach to highdimensional inference  Sem 1  
15:0015:30  Tea  
15:3016:30  Clarke, B (British Columbia)  
Models, model lists, model spaces and predictive optimality  Sem 1  
Sources of uncertainty related to model specification are often the single biggest factors in inference. In the predictive context, we demonstrate the effect of varying the model list used for averaging and varying the averaging strategy in computational examples. In addition, by varying the model space while using similar lists and averaging strategies, we demonstrate that the effect of the space itself computationally. Thus, it is reasonable to associate a concept of variance and bias not just to individual models but to other aspects of an overall modeling strategy. Moreover, although difficult to formalize, good prediction is seen to be associated with a sort of complexity matching between the space and the unknown function, and robustness. In some cases, the relationship among complexity, variancebias, robustness and averaging strategy seems to be dependent on sample size. Taken together, these considerations can be formalized into an overview that may serve as a framework for more general inferential problems in Statistics 

16:3017:30  Discussion of INI programme directions  
18:4519:30  Dinner at Wolfson Court (Residents only) 