# Seminars (SCHW01)

Videos and presentation materials from other INI events are also available.

Search seminar archive

Event When Speaker Title Presentation Material
SCHW01 7th January 2008
10:00 to 11:00
Breakdown point of model selection when the number of variables exceeds the number of observations
SCHW01 7th January 2008
11:30 to 12:30
The deterministic lasso

We study high-dimensional generalized linear models. and risk minimization using the Lasso. The risk is taken under a random probability measure P' and the target is an overall minimizer of the risk under some other nonrandom probability measure P. We restrict ourselves to a set S where P' and P are close to each other, and present an oracle inequality under a so-called compatibility condition between the L_2 norm and l_1 norm.

SCHW01 7th January 2008
14:00 to 15:00
Methods for visualizing high dimensional data

In this presentation, we review some fundamentals of visualization and then proceed to describe methods and combinations of methods useful for visualizing high dimensional data. Some methods include parallel coordinates, smooth interpolations of parallel coordinates, grand tours including wrapping tours, fractal tours, pseudo-grand tours, and pixel tours.

SCHW01 7th January 2008
15:30 to 16:30
A Young Bootstrap and parametric inference: successes and challenges

We review parametric frequentist inference as it has developed over the last 25 years or so. Two main strands have emerged: analytic procedures based on small-sample asymptotics and simulation (bootstrap) approaches. We argue that the latter yield, with appropriate handling of nuisance parameters, a simple and flexible methodology, yet one which nevertheless retains the finer inferential components of parametric theory in an automatic fashion. Performance of the bootstrap methods, even in problems with high-dimensional parameters but small data sample sizes, points in favour of their being the method of choice in complex settings, such as those motivating this programme.

SCHW01 8th January 2008
09:00 to 10:00
Practical and information-theoretic limitations in high-dimensional inference

This talk considers questions of two types concerning high-dimensional inference. First, given a practical (polynomial-time) algorithm, what are the limits of its performance? Second, how do such practical limitations compare to information-theoretic bounds, which apply to the performance of any algorithm regardless of computational complexity?

We analyze these issues in high-dimensional versions of two canonical inference problems: (a) support recovery in sparse regression; and (b) the sparse PCA or eigenvector problem. For the sparse regression problem, we describe a sharp threshold on the sample size n that controls success/failure of \ell_1 constrained quadratic programming (the Lasso), as function of the problem size p, and sparsity index k (number of non-zero entries). Using information-theoretic methods, we prove that the Lasso is order-optimal for sublinear sparsity (vanishing k/p), but sub-optimal for linear sparsity (k/p bounded away from zero). For the sparse eigenvector problem, we analyze a semidefinite programming relaxation due to Aspremont et al., and establish a similar transition in failure/success for triplets (n,p,k) tending to infinity.

Based on joint works with Arash Amini, John Lafferty, and Pradeep Ravikumar.

SCHW01 8th January 2008
10:00 to 11:00
Some thoughts on nonparametric classification: nearest neighbours, bagging and max likelihood estimation of shape-constrained densities

The $k$-nearest neighbour rule is arguably the simplest and most intuitively appealing nonparametric classifier. We will discuss recent results on the optimal choice of $k$ in situations where the underlying populations have densities with a certain smoothness in $\mathbb{R}^d$. Extensions to the bagged nearest neighbour classifier, which can be regarded as a weighted $k$-nearest neighbour classifier, are also possible, and yield a somewhat suprising comparsion with the unweighted case.

Another possibility for nonparametric classification is based on estimating the underlying densities explicitly. An attractive alternative to kernel methods is based on the maximum likelihood estimator, which can be shown to exist if the densities satisfy certain shape constraints, such as log-concavity. We will also discuss an algorithm for computing the estimator in this case, which results in a classifier that is fully automatic yet still nonparametric.

SCHW01 8th January 2008
11:30 to 12:30
RD Cook Model-based sufficient dimension reduction for regression

Dimension reduction in regression, represented primarily by principal components, is ubiquitous in the applied sciences. This is an old idea that has moved to a position of prominence in recent years because technological advances now allow scientists to routinely formulate regressions in which the number p of predictors is considerably larger than in the past. Although "large" p regressions are perhaps mainly responsible for renewed interest, dimension reduction methodology can be useful regardless of the size of p.

Starting with a little history and a definition of "sufficient reductions", we will consider a variety of models for dimension reduction in regression. The models start from one in which maximum likelihood estimation produces principal components, step along a few incremental expansions, and end with forms that have the potential to improve on some standard methodology. This development provides remedies for two concerns that have dogged principal components in regression: principal components are typically computed from the predictors alone and then do not make apparent use of the response, and they are not equivariant under full rank linear transformation of the predictors.

SCHW01 8th January 2008
14:00 to 15:00
Kernel-based contrast functions for sufficient dimension reduction

We present a new methodology for sufficient dimension reduction (the problem of finding a subspace $S$ such that the projection of the covariate vector $X$ onto $S$ captures the statistical dependency of the response $Y$ on $X$). Our methodology derives directly from a formulation of sufficient dimension reduction in terms of the conditional independence of the covariate $X$ from the response $Y$, given the projection of $X$ on the central subspace (cf. Li, 1991; Cook, 1998). We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization leads to an M-estimator for the central subspace. The resulting estimator is shown to be consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods. We also present empirical results showing that the new methodology is competitive in practice.

SCHW01 8th January 2008
15:30 to 16:30
J Fan Challenge of dimensionality in model selection and classification

Model selection and classification using high-dimensional features arise frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is largely poorly understood. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The connections with the sure independent screeing (SIS) and iterative SIS(ISIS) of Fan and Lv (2007) in model selection will be elucidated and extended. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.

SCHW01 8th January 2008
16:30 to 17:30
P Bickel Regularised estimation of high dimensional covariance matrices

Abstract: We review ,with examples, various important parameters depending on the population covariance matrix such as inverses and eigenstructures , and the uses they are put to.We give a brief discussion of well known pathologies of the empirical covariance matrix in various applications when the data is high dimensional which imply inconsistency of "plug-in"estimates of the parameters mentioned. We introduce different notions of sparsity of such matrices and show how some of these are intimately related. We then review a number of methods taking advantage of such sparsity in the population matrices .In particular we state results with various collaborators, particularly E. Levina establishing rates of convergence of our estimates of parameters as above ,as dimension and sample size tend to oo, that are uniform over large classes of sparse population covariance matrices . We conclude with some simulations , a data analysis supporting the asymptotics, and a discussion of future directions.

SCHW01 9th January 2008
09:00 to 10:00
F Murtagh The ultrametric topology perspective on analysis of massive, very high dimensional data stores

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling. Other applications will be described, in particular in the area of textual data mining.

References

[1] F. Murtagh, On ultrametricity, data coding, and computation, Journal of Classification, 21, 167-184, 2004.

[2] F. Murtagh, G. Downs and P. Contreras, "Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding", SIAM Journal on Scientific Computing, in press, 2007.

[3] F. Murtagh, The remarkable simplicity of very high dimensional data: application of model-based clustering, submitted, 2007.

[4] F. Murtagh, Symmetry in data mining and analysis: a unifying view based on hierarchy, submitted, 2007.

SCHW01 9th January 2008
10:00 to 11:00
L Duembgen P-values for computer-intensive classifiers

In the first part of the talk presents p-values for classification in general. The latter are an interesting alternative to classifiers or posterior distributions of class labels. Their purpose is to quantify uncertainty when classifying a single observation, even if we don't have information on the prior distribution of class labels.

After illustrating this concept with some examples and procedures, we focus on computational issues and discuss p-values involving regularization, in particular, LASSO type penalties, to cope with high-dimensional data.

(Part of this talk is based on joint work with Axel Munk, Goettingen, and Bernd-Wolfgang Igl, Luebeck.)

SCHW01 9th January 2008
11:30 to 12:30
W Stuetzle Nonparametric cluster analysis: estimating the cluster tree of a density

The general goal of clustering is to identify distinct groups in a collection of objects. To cast clustering as a statistical problem we regard the feature vectors characterizing the objects as a sample from some unknown probability density. The premise of nonparametric clustering is that groups correspond to modes of this density. The cluster tree summarizes the connectivity structure of the level sets of a density; leaves of the tree correspond to modes of the density. I will define the cluster tree, present methods for its estimating, show examples, and discuss some open problems.

SCHW01 9th January 2008
14:00 to 15:00
M West Sparsity modelling in large-scale dynamic models for portfolio analysis

I will discuss some of our recent work in dynamic modelling for multivariate time series that combines stochastic volatility and graphical modelling ideas. I will describe the modelling ideas and resulting matrix-variate, dynamic graphical models, and aspects of Bayesian methodology and computation for model fitting and structure search. Practical implications of the framework when applied to financial time series for predictive portfolio analysis will highlight some of the reasons for interest in sparsely structured, conditional independence models of volatility matrices.

SCHW01 9th January 2008
15:30 to 16:30
Computationally tractable statistical estimation when there are more variables than observations

We consider the fundamental problem of estimating the mean of a vector y = X beta + z, where X is an n by p design matrix in which one can have far more variables than observations and z is a stochastic error term---the so-called p > n' setup. When \beta is sparse, or more generally, when there is a sparse subset of covariates providing a close approximation to the unknown mean response, we ask whether or not it is possible to accurately estimate the mean using a computationally tractable algorithm.

We show that in a surprisingly wide range of situations, the lasso happens to nearly select the best subset of variables. Quantitatively speaking, we prove that solving a simple quadratic program achieves a squared error within a logarithmic factor of the ideal mean squared error one would achieve with an oracle supplying perfect information about which variables should be included in the model and which variables should not. Interestingly, our results describe the average performance of the lasso; that is, the performance one can expect in an overwhelming majority of cases where X\beta is a sparse or nearly sparse superposition of variables, but not in all cases.

Our results are sharp, nonasymptotic and widely applicable since they simply require that pairs of predictor variables be not overly collinear.

SCHW01 9th January 2008
16:30 to 17:30
Learning in high dimensions, noise, sparsity and treelets

In recent years there is growing practical need to perform learning (classification,regression, etc) in high dimensional settings where p>>n. Consequently instead of the standard limit $n\to\infty$, learning algorithms are typically analyzed in the joint limit $p,n\to\infty$. In this talk we present a different approach, that keeps $p,n$ fixed, but considers noise as a small parameter. This resulting perturbation analysis reveals the importance of a robust low dimensional representation of the noise-free signals, the possible failure of simple variable selection methods and the key role of sparsity for the success of learning in high dimensions. We also discuss sparsity in a-priori unknown basis and a possible data-driven adaptive construction of such basis, called treelets. We present a few applications of our analysis, mainly to error-in-variables linear regression problems, principal component analysis, and rank determination.

SCHW01 10th January 2008
09:00 to 10:00
Estimating a response parameter in missing data models with high-dimensional covariates

We discuss a new method of estimation of parameters in semiparametric and nonparametric models. The method is based on estimating equations that are $U$-statistics in the observations. The $U$-statistics are based on higher order influence functions that extend ordinary linear influence functions of the parameter of interest, and represent higher derivatives of this parameter. For parameters for which the matching cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than root-n-rate. In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at root-n-rate.

SCHW01 10th January 2008
10:00 to 11:00
Persistence: alternative proofs of some results of Greenshtein and Ritov
SCHW01 10th January 2008
11:30 to 12:30
Looking at models in high-dimensional data spaces

What do the fishing net models of self-organizing maps look like in the data space? How do the estimated mean vectors and variance-covariance ellipses from model-based clustering fit to the clusters? How does small n, large p affect the variability in the esimates of the separating hyperplane from support vector machine models? These are a few of the things that we may discuss in this talk. The goal is to calibrate participants' eyes to viewing high-dimensional spaces and stimulate thought about what types of plots might accompany high-dimensional statistical analysis

SCHW01 10th January 2008
14:00 to 15:00
The surprising structure of Gaussian point clouds and its implications for signal processing

We will explore connections between the structure of high-dimensional convex polytopes and information acquisition for compressible signals. A classical result in the field of convex polytopes is that if N points are distributed Gaussian i.i.d. at random in dimension n<<N, then only order (log N)^n of the points are vertices of their convex hull. Recent results show that provided n grows slowly with N, then with high probability all of the points are vertices of its convex hull. More surprisingly, a rich "neighborliness" structure emerges in the faces of the convex hull. One implication of this phenomenon is that an N-vector with k non-zeros can be recovered computationally efficiently from only n random projections with n=2e k log(N/n). Alternatively, the best k-term approximation of a signal in any basis can be recovered from 2e k log(N/n) non-adaptive measurements, which is within a log factor of the optimal rate achievable for adaptive sampling. Additional implications for randomized error correcting codes will be presented.

SCHW01 10th January 2008
15:30 to 16:30
Finding low-dimensional structure in high-dimensional data

In high-dimensional data analysis, one is often faced with the problem that real data is noisy and in many cases given in coordinates that are not informative for understanding the data structure itself or for performing later tasks, such as clustering, classification and regression. The combination of noise and high dimensions (>100-1000) presents challenges for data analysis and calls for efficient dimensionality reduction tools that take the inherent geometry of natural data into account. In this talk, I will first describe treelets  an adaptive multi-scale basis inspired by wavelets and hierarchical trees. I will then, in the second half of my talk, describe diffusion maps -- a general framework for dimensionality reduction, data set parameterization and clustering that combines ideas from eigenmaps, spectral graph theory and harmonic analysis. Our construction is based on a Markov random walk on the data, and allows one to define a system of coordinates that is robust to noise, and that reflects the intrinsic geometry or connectivity of the data points in a diffusion process. I will outline where we stand and what problems still remain.

(Part of this work is joint with R.R. Coifman, S. Lafon, B. Nadler and L. Wasserman)

SCHW01 10th January 2008
16:30 to 17:30
P Niyogi A geometric perspective on learning theory and algorithms

Increasingly, we face machine learning problems in very high dimensional spaces. We proceed with the intuition that although natural data lives in very high dimensions, they have relatively few degrees of freedom. One way to formalize this intuition is to model the data as lying on or near a low dimensional manifold embedded in the high dimensional space. This point of view leads to a new class of algorithms that are "manifold motivated" and a new set of theoretical questions that surround their analysis. A central construction in these algorithms is a graph or simplicial complex that is data-derived and we will relate the geometry of these to the geometry of the underlying manifold. Applications to embedding, clustering, classification, and semi-supervised learning will be considered.

SCHW01 11th January 2008
09:00 to 10:00
High-dimensional variable selection and graphs: sparsity, faithfulness and stability

Over the last few years, substantial progress has been achieved on high-dimensional variable selection (and graphical modeling) using L1-penalization methods. Diametrically opposed to penalty-based schemes is the PC-algorithm, a special hierarchical multiple testing procedure, which exploits the so-called faithfulness assumption from graphical modeling. For asymptotic consistency in high-dimensional settings, the different approaches require very different "coherence" conditions, say for the design matrix in a linear model. From a conceptual aspect, the PC-algorithm allows to identify not only regression-type associations but also directed edges in a graph and causal effects (in the sense of Pearl's intervention operator). Thereby, sparsity, faithfulness and stability play a crucial role. We will discuss potential and limitations from a theory and practical point of view.

SCHW01 11th January 2008
10:00 to 11:00
Time series regression with semiparametric factor dynamics

High-dimensional regression problems which reveal dynamic behavior are typically analyzed by time propagation of a few number of factors. The inference on the whole system is then based on the low-dimensional time series analysis. Such high-dimensional problems occur frequently in many different fields of science. In this paper we address the problem of inference when the factors and factor loadings are estimated by semiparametric methods. This more flexible modelling approach poses an important question: Is it justified, from inferential point of view, to base statistical inference on the estimated times series factors? We show that the difference of the inference based on the estimated time series and true' unobserved time series is asymptotically negligible. Our results justify fitting vector autoregressive processes to the estimated factors, which allows one to study the dynamics of the whole high-dimensional system with a low-dimensional representation. The talk reports on joint projects with Szymon Borak, Wolfgang H\"ardle, Jens Perch Nielsen and Byeong U. Park

SCHW01 11th January 2008
11:30 to 12:30
B Y Yu Using side information for prediction

Extracting useful information from high-dimensional data is the focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the vitues of both regularization and sparsity, the L1-penalized L2 minimization method Lasso has been popular. However, Lasso is often seen as not having enough regularization in the large p case.

In this talk, we propose two methods that take into account side information in the penalized L2 framework, in order to bring the needed extra regularization in the large p case. First, we combine different norms including L1 to to introduce the Composite Absolute Penalties (CAP) family. CAP allows the grouping and hierarchical relationships between the predictors to be expressed. It covers and goes beyond existing works including grouped lasso and elastic nets. Path following algorithms and simulation results will be presented to compare with Lasso in terms of prediction and sparsity. Second, motivated by the problem of predicting fMRI signals from input natural images, we investigate a method that uses side information in the unlabeled data for prediction. We present a theoretical result in the case of p/n -> constant and apply the method to the fMRI data problem. (It is noted that the second part is a report on on-going research.) ~