An Isaac Newton Institute Workshop

Contemporary Frontiers in High-Dimensional Statistical Data Analysis

Early stage exploration of high-dimensional data: clustering, visualisation and knowledge elicitation

Authors: Paulo JG Lisboa (School of Computing and Mathematical Sciences, Liverpool John Moores University), Terence A Etchells (School of Computing and Mathematical Sciences, Liverpool John Moores University), Ian H Jarman (School of Computing and Mathematical Sciences, Liverpool John Moores University), Andrew R Green (School of Molecular Medical Sciences, Nottingham University Hospitals Trust and University of Nottingham ), Ian O Ellis (School of Molecular Medical Sciences, Nottingham University Hospitals Trust and University of Nottingham )

Abstract

The analysis of high-dimensional data in a range of practical domains, from histology to computational marketing, often starts with a baseline exploratory study using relatively simple methods such as k-means clustering and multi-dimensional scaling. However, it is surprising that where methods rely on randomly derived initializations, for instance in clustering, there is a lack a systematic methodologies to produce reproducible results with broadly accepted optimality properties. Moreover, visualization relies increasingly on non-linear methods for which axis projections, traditionally used to aid interpretation in a manner that is directly meaningful to the user, is not possible. Furthermore, cluster profiling is typically done through statistical data aggregation, from which it is not a simple matter to identify covariate interactions.

The poster will describe and evaluate a methodology to reproducibly generate similar partitions of the data, by maximizing jointly the cluster separation and concordance between multiple initializations, for generally accepted clustering algorithms. The matrix argument of the well-known invariant cluster separation tr[inv(Sw).Sb], together with the clustering structure, then generate a low-dimensional linear projective space where the separation between the cluster projections is maximized. As a third step, the composition of clusters is described using low-order Boolean rules. These methods are generic and apply equally well to model-based clustering, where they are useful tools for benchmarking and evaluation.

Related Links