Skip to content

Workshop Programme

for period 31 March - 4 April 2008

High Dimensional Statistics in Biology

31 March - 4 April 2008


Monday 31 March
08:30-09:55 Registration
10:00-11:00 Birney, E (EBI)
  The evolution of promoter sequence Sem 1

We have produced an evolutionary model for promoters (and more generally for genomic regulatory sequence) analogous to the commonly used synonymous/nonsynonymous mutation models for protein-coding sequence. Although our model, called Sunflower, relies on some simple assumptions, it captures enough of the biology of transcription factor action to show clear correlation with other biological features. Sunflower predicts a binding profile of transcription factors to DNA sequence, in which different factors compete for the same potential binding sites. Sunflower can also model cooperative binding. We can control the apparent concentration of the factors by setting parameters uniformly or from gene expression data. The parameterized model simultaneously estimates a continuous measurement of binding occupancy across the genomic sequence for each factor. We can then introduce either a localized mutation (such as a SNP) or a coordinated set of mutations (for example, from a haplotype or another species), rerun the binding model and record the difference in binding profiles using their relative entropy. A single mutation can alter interactions both upstream and downstream of its position due to potential overlapping binding sites, and our statistic captures this domino effect.

Results from Sunflower show many features in agreement with known biology. For example, the overall binding occupancy rises over transcription start sites, and CpG desert promoters show sharper localization signals relative to the transcription start site. More interesting are correlates to variation both between species and within them. Over evolutionary time, we observe a clear excess of low- scoring mutations fixed in promoters, consistent with most changes being neutral. However, this is not consistent across all promoters, and some promoters show more rapid divergence. This divergence often occurs in the presence of relatively constant protein coding divergence. Interestingly, different classes of promoters show different sensitivity to mutations, with developmental and immunological genes having promoters inherently more sensitive to mutations than housekeeping genes.

11:00-11:30 Coffee
11:30-12:30 Pachter, L (UC, Berkeley)
  Functional genomics and the forest of life Sem 1

We will discuss the 0-dimensional statistical problem of alignment, and its relation to the high-dimensional problem of phylogeny. In particular, we will discuss the relevance of the "space of phylogenetic oranges" and its relation to the above problems. We will also discuss "sequence annealing", which is a new alignment strategy based on these ideas.

12:30-13:30 Lunch at Wolfson Court
14:00-15:00 Brunak, S (Denmark)
  Understanding interactomes by data integration Sem 1
15:00-15:30 Tea
15:30-16:30 McLachlan, GJ (Queensland)
  On mixture models in high-dimensional testing for the detection of differential gene expression Sem 1

An important problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. As there are usually thousands of genes to be considered simultaneously, one encounters high-dimensional testing problems. We provide a straightforward and easily implemented method for estimating the posterior probability that an individual gene is null (not differentially expressed). The problem can be expressed in a two-component mixture framework, using an empirical Bayes approach. Current methods of implementing this approach either have some limitations due to the minimal assumptions made or with the computationally intensive nature of more specific assumptions. By converting to a z-score the value of the test statistic used to test the significance of each gene, we propose a simple two-component normal mixture that models adequately the distribution of this score. The approach provides an estimate of the local false discovery rate (FDR) for each gene, which is taken to be the posterior probability that the gene is null. Genes with the local FDR less than a specified threshold C are taken to be differentially expressed. For a given C, this approach also provides estimates of the implied overall errors such as the (global) FDR and the false negative/positive rates.

Related Links

16:30-17:30 Margulies, E (National Human Genome Research)
  Statistical challenges in using comparative genomics for the identification of functional sequences Sem 1

There are two main aspects of comparative sequence analysis that rely on high-dimensional statistical approaches: identifying evolutionarily constrained regions and determining the significance of their overlap with functional sequences. The identification of constrained sequences largely relies on our understanding of evolutionary models and applying them to multi-sequence alignments. However, our understanding of evolutionary processes is incomplete and our ability to generate “perfect” multi-sequence alignments is hampered by incomplete sequence datasets and general uncertainty in the process; these factors can lead to multiple equally plausible alignments, only one of which is typically represented in downstream analyses. In order to mitigate some of these issues, we have been developing new comparative genomics approaches that take into account the biochemical physical properties of DNA, such that we can understand which substitutions are more “tolerable” with respect to the three dimensional structure of DNA, and thus more “neutral” in evolution. We also plan to start taking into account alignment uncertainty into our predictions of constrained sequences. Determining the significance of our improved sequence constraint methods relies on a new statistical approach for determining the significance of overlap with known functional annotations. This new method, devised by Peter Bickel and colleagues, was applied to analyses performed within the ENCODE consortium and provides the basis for newer methods that will be discussed later in this meeting.

Related Links

17:30-18:30 Welcome Wine Reception and Poster Session
18:45-19:30 Dinner at Wolfson Court (Residents only)
Tuesday 1 April
09:00-10:00 Hurles, M (Sanger)
  Structural variation in the human genome Sem 1

Over the past three years it has become rapidly appreciated that the human genome varies in its structure as well as its sequence, by virtue of a panoply of different chromosomal rearrangements, some that alter the number of copies of DNA segments, and others that alter orientation but not copy number. Evidence is growing from diverse sources that this source of genomic variation has an appreciable functional impact, and yet we remain far from a complete catalogue of this form of variation let alone its biological consequences. In my talk I will summarise the progress to date and highlight the appreciable statistical challenges that remain, with particular reference to the approaches being adopted in my group towards assaying copy number variation and assessing its impact on complex traits through genetic association studies.

10:00-11:00 Benjamini, Y (Tel Aviv)
  Selective inference in complex research problems Sem 1

We shall highlight the problem of selective inference in genomics using some recent studies. The False Discovery Rate (FDR) approach to this problem will be reviewed, and then we shall discuss: (i) advances in hierarchical testing with an example from a study associating gene expression in the brain with multiple traits of behavior; (ii) screening for partial conjunctions in order to address replicability; and (iii) selective confidence intervals in the frequentist and Bayesian frameworks.

11:00-11:30 Coffee
11:30-12:30 Durbin, R (Sanger)
  Efficient use of population genome sequencing data Sem 1

With the advent of new sequencing technologies that reduce the cost of DNA sequence by a factor of a hundred, we have moved into the era of population genomic sequencing, where we sample many individuals from a population to study natural genetic variation genome-wide. However, at this scale sequencing is still costly. I will discuss strategies to use low coverage sequencing on multiple samples from a population, and some of the complications in using the resulting data for population genetic analyses. Examples will be drawn from the Saccharomyces Genome Resequencing Project (SGRP) in which we have collected sequence data from 70 yeast strains, and planning for the 1000 Genomes Project to characterise human genetic variation down to 1% allele frequency.

Related Links

12:30-13:30 Lunch at Wolfson Court
14:00-15:00 West, M (Duke)
  Sparsity modelling in gene expression pathway studies Sem 1

I will discuss aspects of large-scale multivariate modelling utilising sparsity priors for anova, regression and latent factor analysis in gene expression studies. Specific attention will be given to the development of experimental gene expression signatures in cell lines and animal models, and their extrapolation/evaluation in gene pathway-focused analyses of data from human disease contexts. The role of sparse statistical modelling in signature identification, and in evaluation of complex interacting "sub pathway" related patterns in gene expression in observational data sets, will behighlighted. I will draw on data and examples from some of our projects in cancer and cardiovascular genomics.

15:00-15:30 Tea
15:30-16:30 Dermitzakis, M (Sanger)
  Population genomics of human gene expression Sem 1

The recent comparative analysis of the human genome has revealed a large fraction of functionally constrained non-coding DNA in mammalian genomes. However, our understanding of the function of non-coding DNA is very limited. In this talk I will present recent analysis in my group and collaborators that aims at the identification of functionally variable regulatory regions in the human genome by correlating SNPs and copy number variants with gene expression data. I will also be presenting some analysis on inference of trans regulatory interactions and evolutionary consequences of gene expression variation.

18:45-19:30 Dinner at Wolfson Court (Residents only)
Wednesday 2 April
09:00-10:00 Enright, A (Sanger)
  Computational analysis and prediction of microRNA binding sites Sem 1

MicroRNAs (miRNAs) are small 22 nucleotide RNA molecules that directly bind to the 3' Untranslated regions of protein-coding messenger RNAs. This binding event represses the target transcript rendering it unsuitable for protein production and causing its degradation. Many miRNAs have been found and a large-number of them have already been implicated in human disease and development. We have developed a number of computational approaches for predicting the target transcripts of miRNAs. One method (miRanda) is purely computational and uses a simple dynamic programming algorithm and a statistical model to identify significant binding sites. Our second approach (Sylamer) is an algorithm for scanning genome sequences for 7mer words and testing gene-expression data to identify gene sets which are significantly enriched or depleted in such 7mer words using Hypergeometric Statistics. This combined computational/experimental approach has worked extremely well for identifying candidate miRNA targets in B and T blood cells, developing Zebrafish embryos and in mouse mutants with deafness.

Related Links

10:00-11:00 Bühlmann, P (ETH Zürich)
  L1-regularisation, motif regression and ChIP-on-chip data analysis Sem 1

Motivated by the proposed format of talks, we include the following: (i) a review of statistical facts about L1-regularization for high-dimensional problems; (ii) some adaptations of motif regression (Conlon, Liu, Lieb & Liu, 2003) for scoring potential motifs or for presence/absence of other biological targets of interest (e.g. proteins) by integrating multiple data sources; (iii) using the concepts for analyzing ChIP-on-chip data from human liver cells (with a side remark on signal extraction) for HIF-dependent transcriptional networks.

Issue (i) deals with a general purpose method for variable selection or feature extraction which is potentially useful for a broad variety of (multiple) bio-molecular and high-dimensional data. Issue (ii) is - in our experience - an interesting method to improve upon some chosen "standard" methodology by making use of additional data sources. Finally, issue (iii) is work in progress with the Ricci lab at ETH Zurich: it is an illustration for statisticians and - of course - the "real thing" for biologists.

Related Links

11:00-11:30 Coffee
11:30-12:30 Huber, W (EBI)
  Extraction and classification of cellular and genetic phenotypes from automated microscopy data Sem 1

I will start the presentation by an overview over the Bioconductor project, a large international open source and open development software project for the analysis and comprehension of genomic data. Its goals are to provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data; to facilitate the integration of biological metadata in the analysis of experimental data: e.g. literature data, gene and genome annotation data; to allow the rapid development of extensible, scalable, and interoperable software; to promote high-quality documentation and reproducible research; to provide training in computational and statistical methods for the analysis of genomic data. While much of the initial focus has been on microarray analysis, one of the recent developments has been the development of methods, and computational infrastructure, for the analysis of cell-based assays using various phenotypic readouts.

Changes in cell shape are important for many processes during development and disease. However, cellular mechanisms and molecular components that underlie these processes remain poorly understood. We here present a rapid and automated approach to identify and categorize genes based on their phenotypic signatures on a single-cell level. Perturbations by RNAi on a whole genome scale led to the identification of several hundred genes with distinct cell shape phenotypes. More than 6,000,000 cells were individually profiled into different phenotypic classes. The approach is permits the ‘segmentation’ of the genome into phenotypic clusters using complex phenotypic signatures.

12:30-13:30 Lunch at Wolfson Court
14:00-15:00 Beerenwinkel, N (ETH Zürich)
  Ultra-deep sequencing of mixed virus populations Sem 1

The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response, vaccine design, and antiviral drug therapy. Recently developed ultra-deep sequencing technologies can be used for quantifying this diversity by direct sequencing of the mixed virus population. We present statistical and computational methods for the analysis of such sequence data. Inference of the population structure from observed reads is based on error correction, reconstruction of a minimal set of haplotypes that explain the data, and eventually estimation of haplotype frequencies. We demonstrate our approach by analyzing simulated data and by comparison to 165 sequences obtained from clonal Sanger sequencing of four independent, diverse HIV populations.

Related Links

15:00-15:30 Tea
15:30-16:30 Poster Session
19:30-23:00 Conference Dinner at St Catharine's College (The Main Dining Hall)
Thursday 3 April
09:00-10:00 Segal, E (Weizmann Institute)
  Cracking the regulatory code: predicting expression patterns from DNA sequence Sem 1

Precise control of gene expression lies at the heart of nearly all biological processes. However, despite enormous advances in understanding this process from both experimental and theoretical perspectives, we are still missing a quantitative description of the underlying transcriptional control mechanisms, and the remaining questions, such as how regulatory sequence elements ‘compute’ expression from the inputs they receive, are still very basic.

In this talk, I will present our progress towards the ultimate goal of developing integrated quantitative models for transcription regulation, spanning all aspects of the process, including the DNA sequence, regulators, and expression patterns. I will first describe a novel thermodynamic model that computes expression patterns as a function of cis-regulatory sequence and the binding site preferences and expression of participating transcription factors. I will show that when applied to the segmentation gene network of Drosophila, the model accurately predicts the expression of many known cis-regulatory modules, even across species, and reveals important organizing principles of transcriptional regulation in the network: that both strong and large numbers of weaker binding sites contribute, leading to high occupancy of the module DNA, and conferring robustness against mutation; and that clustering of weaker sites permits cooperative binding, which is necessary to sharpen the patterns.

Related Links

10:00-11:00 Bickel, PJ (UC Berkeley)
  Refined nonparametric methods for genomic inference Sem 1

Inference about genomic features faces the particular difficulty that, save for interspecies variation, we have only one copy of any of the genomes that Nature might have produced. We postulate a framework which includes an “ergodic hypothesis” which permits us to compute p values and confidence bounds. These seem to be as conservative as could be hoped for. Out methods in crude form were applied to data for the ENCODE project (Birney et al (2007)). We will discuss our model and refinements of the methods previously proposed.

11:00-11:30 Coffee
11:30-12:30 Marcotte, EM (Texas at Austin)
  Steps toward directed identification of disease genes: predicting the consequences of genetic perturbations Sem 1

Related Links

12:30-13:30 Lunch at Wolfson Court
14:00-15:00 Crawford, G (Duke)
  High-resolution identification of active gene regulatory elements Sem 1

I will discuss methods we use to identify active gene regulatory elements within the human genome and some of the current obstacles and hurdles we still need to overcome

15:00-15:30 Coffee
15:30-16:30 Bulyk, ML (Harvard Medical School)
  High-resolution binding specificity profiles of transcription factors and cis regulatory codes in DNA Sem 1
18:45-19:30 Dinner at Wolfson Court (Residents only)
Friday 4 April
09:00-10:00 Bertone, P (EBI)
  Functional genomic approaches to stem cell biology Sem 1

Embryonic stem (ES) cells are similar to the transient population of self-renewing cells within the inner cell mass of the preimplantation blastocyst (epiblast), capable of pluripotential differentiation to all specialised cell types comprising the adult organism. These cells undergo continuous self-renewal to produce identical daughter cells, or can develop into specialised progenitors and terminally differentiated cells. A variety of molecular pathways involved in embryonic development have been elucidated, including those influencing stem cell differentiation. As a result, we know of a number of key transcriptional regulators and signalling molecules that play essential roles in manifesting nuclear potency and self-renewal capacity of embryo-derived and tissue-specific stem cells. Despite these efforts however, a small number of components have been identified and large-scale characterisation of these processes remains incomplete. While the precise biological niche is believed to direct differentiation and development in vivo, it is now possible to utilise explanted stem cell lines as an in vitro model of cell fate assignment and differentiation. The aim of the studies discussed here is to map the global transcriptomic and proteomic activity of ES cells during various stages of differentiation and lineage commitment in tissue culture. This approach will help characterise the functional roles of key developmental regulators and yield more rational approaches to manipulating stem cell behaviour in vitro. The generation of large-scale data from microarray and functional genomic experiments will help to identify and characterise the regulatory influence of key transcription factors, signaling genes and non-coding RNAs involved in early developmental pathways, leading to a more detailed understanding of the molecular mechanisms of vertebrate embryogenesis.

10:00-11:00 McVean, G (Oxford)
  Approximate genealogical inference Sem 1

For many inferential problems in evolutionary biology and population genetics considerable power can be gained by explicitly modelling the genealogical relationship between DNA sequences. In the presence of of recombination, genealogical relationships are described by a complex graph. While it is theoretically possible to explore the posterior distribution of such graphs using techniques such as MCMC, in most realistic situations the computational complexity of such methods makes them unpractical. One possible solution is to develop approximations to full genealogical inference. I will discuss what properties such approximations should have and describe one approach that samples local genealogical relationships along a genome.

11:00-11:30 Coffee
11:30-12:30 Luscombe, N (EBI)
  Genomic principles for feedback regulation of metabolism Sem 1

Small molecule metabolism is the highly coordinated interconversion of chemical substrates through enzyme-catalysed reactions. It is central to the viability of all organisms as it enables the assimilation of nutrients for energy production and the synthesis of precursors for all cellular components. The system is tightly regulated so cells can respond efficiently to environmental changes. This is optimised to minimise the substantial cost of enzyme production and core metabolite depletion, and to maximise the benefit of cell growth and division. It is commonly known that this regulation is achieved by controlling either (i) the availability of enzymes or (ii) their activities. Though the molecular mechanisms behind these two regulatory processes have been elucidated in great detail, and we still lack insight into how they are deployed and complement each other at a global level. Here, I will present a genome-scale analysis of how regulatory feedback by small molecules control the metabolic system, and examine how the two modes of regulation are deployed throughout the system.

Bio: Nick Luscombe, Group Leader, EMBL-European Bioinformatics Institute Nick completed his PhD with Professor Janet Thornton at University College London (1996-2000), studying the basis for specificity of DNA-binding proteins. He then moved to Yale University as a post-doctoral fellow with Professor Mark Gerstein (2000-2004). During this time, he shifted his research focus to genomics, with a particular emphasis on transcriptional regulation in yeast. He has been a Group Leader at EMBL-EBI since 2005, examining the control of interesting biological systems.

12:30-13:30 Lunch at Wolfson Court
14:00-15:00 Huang, H (UC Berkeley)
  A bayesian probabilistic approach to transform public microarray repositories into disease diagnosis databases Sem 1

Predicting phenotypes from genotypes is one of the major challenges of functional genomics. In this talk, we aim to take the first step into using microarray repositories to create a disease diagnosis database, or in general, for phenotype prediction. This will provide an important application for the enormous amount of costly generated, yet freely available, genomics data. In many disease diagnosis cases, it is not obvious which potential disease should be targeted, and screening across the enormous accumulation of disease expression profiles will help to narrow down the disease candidates. In addition, such profile-based-diagnosis is especially useful for those diseases that lack biochemical diagnosis tests.

15:00-15:30 Tea

Back to top ∧