skip to content

Timetable (SCBW02)

Recent Advances in Statistical Genetics and Bioinformatics

Monday 11th December 2006 to Friday 15th December 2006

Monday 11th December 2006
08:30 to 10:00 Registration at Newton Institute
10:00 to 11:00 Overview of statistical issues in genome-wide association testing
11:00 to 11:45 Coffee at Newton Institute
11:45 to 12:30 Causal effects in functional genomics

The power of traditional genetics studies to identify the genetic determinants of diseases is limited by the fact that complex disease traits depend on small incremental contributions from many loci. Integrative functional genomics represents a relatively novel approach to the problem. The idea is to use hypotheses on the patho-physiological mechanisms underlying the studied disease to focus attention on a restricted collection of molecular pathways and corresponding inheritable molecular phenotypes. On each sampled individual, data information is collected both at a DNA sequence level, within or around candidate genes, as well as at a clinical phenotype and at a molecular phenotype level.

We propose a general statistical framework for the design and analysis of functional genomics studies of the above kind. Our approach uses a directed graph representation of a probability model of the problem, incorporating "intervention nodes" for formal reasoning about causes and effects of causes, as proposed by Dawid. In fact, meaningful biological questions can often be formulated in terms of effects of specific interventions, for example, the effect of blocking a certain receptor by a drug. Our approach involves mapping available biological knowledge onto the graph, using graph semantics and conditional independence reasoning to formulate meaningful scientific questions, and identifying appropriate experimental designs for answering those questions. Finally, the graph can be used as a computational framework for estimating the quantities of interest.

The method will be illustrated with the aid of our study on the effect of platelet sensitivity to thrombotic occlusive events.

12:30 to 13:30 Lunch at Wolfson Court
14:00 to 15:00 Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

We consider methodology for Bayesian model-based clustering of gene expression profiles, that is, measurements of expression levels of a large number of genes, typically from microarray assays, across a number of different experimental conditions and/or biological subjects. We follow a familiar approach using Dirichlet-process-based models to cluster the genes implicitly, but depart from standard practice in several ways. First, we incorporate regression on covariate information at the condition/subject level by modelling regression coefficients, not the expectations of the data directly. More importantly, we replace the Dirichlet process by one of a richer family of models, generated from a stick-colouring-and-breaking construction, under which cluster identities are not exchangeable: this allows modelling a 'background' cluster, for example. Finally, we follow a formal decision-theoretic approach to point estimation of the clustering, using a pairwise coincidence loss function. This is joint work with John Lau at Bristol.

15:00 to 15:30 Tea at Newton Institute
15:30 to 16:30 P Brown ([Kent])
Aspects of feature selection in Mass Spec proteomic functional data

We look at functional data as arising from mass spectroscopy data used in proteomics. The data may contain experimental factors and covariates but a desire is to provide interpretation and to discriminate between two or more groups. Modelling is often facilitated by the use of wavelets.

We review a variety of approaches to (i) modelling the functional data as response (ii) modelling directly the discriminatory categories conditional on functional data and experimental factor/covariates. Our ultimate focus will be on Bayesian models that allow regularisation. To this end we look at a variety of forms of scale mixture of normal prior distributions including forms of hyper-lasso and approaches to robustness and stability of discrimination. We are particularly interested in fast algorithms capable of scaling up to very many variables and which are flexible enough to allow a variety of prior structures.

Keywords: Bayesian methods; Hyper-lasso; Bayesian Wavelet functional modelling; MCMC; EM algorithm.

18:00 to 18:45 Wine Reception at Newton Institute
18:45 to 19:30 Dinner at Wolfson Court (Residents only)
Tuesday 12th December 2006
09:00 to 10:00 JM Thornton ([European Bioinformatics Institute])
From protein structure to biological function:progress and limitations

Understanding the relationship between protein structure and biological function has long been a major goal of structural biology. With the advent of many structural genomics projects, there is a practical need for tools to analyse and characterise the possible functional attributes for a new structure.

One of the major challenges in assigning function is to recognise a cognate ligand which may be a small molecule or a large macromolecule. At EBI we have been developing a range of methods which seek to annotate a functional site. These methods include:

• Using sequence data and global and local structure comparisons to recognise distant relatives or short sequence patterns that are characteristic of binding sites. • Using 3-dimensional templates for functional sites defined from proteins of known structure and function which can identify similarities between the query protein and other proteins in the PDB. • Using spherical harmonics to define the shape of a binding site and to compare this shape with all known binding sites in the PDB and with the small molecule metabolome.

These methods have some success dependent upon the shape and flexibility of the binding site. In this presentation I will review our progress in this area and describe application to the sulpher transferases. Some of this work has been integrated into the ProFunc pipeline (coordinated by R.A. Laskowski) which is a web server which can provide automated annotation for a new protein structure. (


Laskowski, R.A., Watson, J.D. & Thornton, J.M. (2005). ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res., 33, W89-W93. Laskowski, R.A., Watson, J.D. & Thornton, J.M. (2005) Protein function prediction using local 3D templates. J. Mol Biol., 351, 614-626. Morris, R.J., Kahraman, A. & Thornton, J.M. (2005) Binding pocket shape analysis for protein function prediction. Acta. Cryst. D61, C156-C157.

10:00 to 10:45 K Walter ([MRC Biostatistics Unit, Cambridge])
Modelling the boundaries of highly conserved non-coding DNA sequences in vertebrates

A comparison of the human and fish genomes produced more than 1000 highly conserved non-coding elements (CNEs), sequences of DNA that show a remarkable degree of similarity between human and fish despite an evolutionary distance of about 900 million years.

The high sequence conservation suggests that these CNEs possess some kind of function, though neither their function nor which part of their sequence is functional have been well defined yet. Since each CNE was defined by a pairwise sequence alignment, its boundary might not be accurate enough to design biological experiments to help identify its role in the genome. In a first step an examination of the CNE's nucleotide composition revealed a striking A+T pattern at the CNE boundary in fish as well as human.

In a step further we propose a probabilistic model that takes into consideration not only nucleotide composition but also phylogenetic information, and that aims to define the functional part of CNEs by using multiple sequence alignments of human, mouse, chicken, frog and fish.

10:45 to 11:30 Coffee at Newton Institute
11:30 to 12:30 Bayesian analysis of gene expression data

Microarray experiments and gene expression data have a number of characteristics that make them attractive but challenging for Bayesian analysis. There are many sources of variability, the variability is structured at different levels array specific, gene specific, ....) and the ratio of signal to noise is low. Typical experiments involve few samples but a large number of genes, so that borrowing information, e.g. across genes, to improve inference becomes essential. Hence embedding the inference in a hierarchical model formulation is natural.

Bayesian models adapted to the level of information processed have been developed to address some of the questions raised that range from modelling the signal to synthesising gene lists across different experiments. In this talk, I shall illustrate their use in variety of contexts: probe level models attempting to quantify uncertainty of the signal, differential expression mixture models and gene list synthesis. Cutting across these developments are important issues of MCMC performance and model checking. These issues will be illustrated on case studies.

This is joint work with colleagues on the BGX project : Marta Blangiardo, Natalia Bochkina, Anne Mette Hein (now at Aarhus), Alex Lewin, Ernest Turro and Peter Green (Bristol).

Related Links

12:30 to 13:30 Lunch at Wolfson Court
14:00 to 15:00 M Dermitzakis ([Wellcome Trust, Sanger Institute])
Inference of cis and trans regulatory variation in the human genome

The recent comparative analysis of the human genome has revealed a large number of conserved non-coding sequences in mammalian genomes. However, our understanding of the function of non-coding DNA is very limited. In this talk I will present recent analysis in my group and collaborators that aims at the identification of functionally variable regulatory regions in the human genome by correlating SNPs and copy number variants with gene expression data. I will also be presenting some analysis on inference of trans regulatory interactions and evolutionary consequences of gene expression variation.

15:00 to 15:30 Tea at Newton Institute
15:30 to 16:15 A Bayesian approach to association mapping in admixed populations

Admixed populations, such as African Americans, are an important resource for identifying mutations contributing to human disease. Until recently, disease mapping in such populations has used so called “admixture linkage disequilibrium" to map loci for diseases whose incidence varies markedly between populations. These analysis methods typically require unlinked markers spaced at wide intervals, while data now becoming available provide genotype information at much higher densities. High density data provides exquisite information about admixture, and association information, but presents methodological challenges. We have developed and implemented an approach to utilize such data to probabilistically infer admixture segments, and to perform disease mapping. The approach uses the HapMap data as a framework, and employs a fully Bayesian methodology, providing a natural weighting of both broad-scale admixture linkage disequilibrium and fine-scale association information. The software is currently being applied to data from a prostate cancer association study examining African American cases and controls. Future population genetics based directions are briefly discussed.

16:15 to 17:00 C Barnes ([Wellcome Trust, Sanger Institute])
Techniques for the detection of copy number variation using SNP genotyping arrays

The key goal of medical genetics is the search for the genetic variation responsible for disease. A major focus is on the use of single nucleotide polymorphism (SNP) genotyping arrays for genome wide association studies. However, recent studies suggest that copy number variation (CNV) accounts for a significant fraction of the total variation in the human genome. The susceptibility to a number of diseases, including HIV infection, is already known to be associated with copy number variants but the full functional and phenotypic impact of CNVs is not yet fully understood. In order to search for CNVs using SNP genotyping platforms we have developed a number of normalization schemes. These incorporate allele specific corrections, quantile normalization and corrections for the source, GC content and length of the PCR products. We have also developed methods of locating and categorizing CNVs using both existing algorithms, such as SWArray and CBS, and novel tools. These are implemented within a high throughput framework, essential for processing large datasets already available and from future projects. Here we present a map of common CNVs based on studies of a large set of healthy individuals.

17:00 to 18:00 Poster Session 1 and Wine Reception at Newton Institute
18:45 to 19:30 Dinner at Wolfson Court (Residents only)
Wednesday 13th December 2006
09:00 to 10:00 Minimal ancestral recombination graphs

Finding the ancestral recombination graph (ARG) that explains a data set with the least number of recombination is the parsimony-ARG analogue to finding parsimonious phylogenies. This is a hard computational problem and two main approaches will be discussed. Firstly, a "scan along sequences” dynamic programming approach that works up to 10 sequences of any length. Secondly, a "trace history back in time" branch and bound approach that can work very fast for much larger number of sequences, but can also fail totally dependent on data". The second approach can also be extended to include gene conversion. Finally, the number of ancestral states that could be encountered for a given data set it counted for small number of sequences and segregating sites. It is also illustrated how likelihood calculations can be done on a restricted graph that contains close to minimal histories of a set of sequences

Allen, B. and Steel, M., Subtree transfer operations and their induced metrics on evolutionary trees,Annals of Combinatorics 5, 1-13 (2001)

Baroni, M., Grunewald, S., Moulton, V., and Semple, C. Bounding the number of hybridisation events for a consistent evolutionary history. Journal of Mathematical Biology 51 (2005), 171-182

Bordewich, M. and Semple, C. On the computational complexity of the rooted subtree prune and regraft distance. Annals of Combintorics 8 (2004), 409-423

Hein,J.J., T.Jiang, L.Wang & K.Zhang (1996): "On the complexity of comparing evolutionary trees" Discrete Applied Mathematics 71.153-169.

Hein, J., Schierup, M. & Wiuf, C. (2004) Gene Genealogies, Variation and Evolution, Oxford University Press

Lyngsø, R.B., Song, Y.S. & Hein, J. (2005) Minimum Recombination Histories by Branch and Bound. Lecture Notes in Bioinformatics: Proceedings of WABI 2005 3692: 239–250.

Myers, S. R. and Griffiths, R. C. (2003). Bounds on the minimum number of recombination events in a sample history. Genetics 163, 375-394.

Song, Y.S., Lyngsø, R.B. & Hein, J. (2005) Counting Ancestral States in Population Genetics. Submitted.

Song, Y.S. & Hein, J. (2005) Constructing Minimal Ancestral Recombination Graphs. J. Comp. Biol., 12:147–169

Song, Y.S. & Hein, J. (2004) On the minimum number of recombination events in the evolutionary history of DNA sequences. J. Math. Biol., 48:160–186.

Song, Y.S. & Hein, J. (2003) Parsimonious reconstruction of sequence evolution and haplotype blocks: finding the minimum number of recombination events, Lecture Notes in Bioinformatics, Proceedings of WABI'03, 2812:287–302.

10:00 to 10:45 Estimating the effects of SNPs on protein structure

Understanding the effects that non-synonymous single nucleotide polymorphisms have on the structures of the gene products, the proteins, is important in identifying the origins of complex diseases. A method based on environment-specific amino acid substitutions observed within homologous protein families with known 3D structures was used to predict changes in stability caused by mutations. In the task of predicting only the sign of stability change, our method performs comparably or better to other published methods with an accuracy of 71%. The method was applied to a set of disease associated and non-disease associated mutations and was shown to distinguish the two sets in terms of protein stability. Our method may therefore have application in correlating SNPs with diseases caused by protein instability.

10:45 to 11:30 Coffee at Newton Institute
11:30 to 12:30 Probabilistic modelling of metabolic regulation in prokaryotes

Suprisingly little is known about regulatory processes in prokaryotes outside a small group of model species such as Escherichia coli. Probabilistic models can help to combine the comparatively sparse direct experimental evidence for regulation in less well known organisms such as Mycobacterium tuberculosis with gene expression data and results from the application of bioinformatics and genomic tools. I will discuss the challenges of such a project and some of the statistical concepts that might be useful for tackling them.

12:30 to 13:30 Lunch at Wolfson Court INI 1
14:30 to 16:30 Cambridge Walking Tour
20:00 to 18:00 Conference Dinner at Corpus Christi College (Dining Hall)
Thursday 14th December 2006
09:00 to 10:00 Estimating genealogies from marker data: a Bayesian approach

An issue often encountered in statistical genetics is whether, or to what extent, it is possible to estimate the degree in which individuals sampled from a background population are related to each other, on the basis of the available diploid multi-locus genotype data and some information on the demography of that population. In this talk, this question is considered by using an explicit modelling and reconstruction of the pedigrees and gene flows at the marker loci. For computational reasons, the analysis is restricted to a relatively recent history of the population, currently extending, depending of the data, up to ten or twenty generations backwards in time. As a computational tool, we use Markov Chain Monte Carlo numerical integration on the state space of genealogies of the sample individuals. The main technical challenge has been in devising a variety of joint proposal distributions which would guarantee that the algorithm has reasonable mixing properties. As illustrations of the method, we consider the question of relatedness both in terms of individuals (pedigree based relatedness estimation) and at the level of genes/genomes (IBD-estimation), using both simulated and real data.

Related Links

10:00 to 10:45 Detecting natural selection with empirical codon models: a synthesis of population genetics and molecular phylogenetics

The estimation of empirical codon models sheds new light on recently discussed questions about biological pressures and processes acting during codon sequence evolution (Averof et al., Science 287:1283 (2000), Bazykin et al., Nature 429:558 (2004), Friedman and Hughes, MBE 22:1285 (2005), Keightley et al., PLoS Biol 3:282 (2005)).

My results show that modelling the evolutionary process is improved by allowing for single, double and triple nucleotide changes; the affiliation between DNA triplets and the amino acid they encode is a main factor driving evolution; and the nonsynonymous-synonymous rate ratio is a suitable measure to classify substitution patterns observed for different proteins. However, comparing models estimated from genomic data and polymorphism data indicates that double and triple changes are not instantaneous.

This new view of how codon evolution proceeds leads to consequences for selection studies. I will discuss that under the new empirical codon model purifying selection is less purifying and that cases of positive selection are observed weaker than under the standard condon models (Yang et al.,Genetics 155: 431-449 (2000)).

10:45 to 11:30 Coffee at Newton Institute
11:30 to 12:30 R Nielsen ([Copenhagen])
Detecting selection from population genetic data

We will present an analysis of several large scale human populaiton genetic data sets. We use a combination of simulation and analytical approaches to identify genes and genomic regions targeted by Darwinian selection. The biological implications of some of the results are discussed.

12:30 to 13:30 Lunch at Wolfson Court
14:00 to 15:00 N Patterson ([Broad Institute, MIT])
Population structure and eigenanalysis

When analyzing genetic data, one often wishes to determine if the samples are from a population that has structure. Can the samples be regarded as randomly chosen from a homogeneous population, or does the data imply that the population is not genetically homogeneous? We show that an old method (principal components) together with modern statistics (Tracy-Widom theory) can be combined to yield a fast and effective answer to this question. The technique is simple and practical on the largest datasets, and can be applied both to genetic markers that are biallelic or to markers that are highly polymorphic such as microsatellites. The theory also allows us to estimate the data size needed to detect structure if our samples are in fact from two populations that have a given, but small level of differentiation.

15:00 to 15:30 Tea at Newton Institute INI 1
15:30 to 16:15 C Bird ([Wellcome Trust, Sanger Institute])
Exploring the role of noncoding DNA in the function of the human genome through variation

The function of conserved non-coding (CNCs) DNA has been speculated about since its discovery. We have begun to investigate this by using variation data to study the effect of genomic location upon these sequences. We have used the phase II HapMap consortium SNP data to first investigate the signature of selective constraint of non-coding regions. Our results show that new (derived) alleles of SNPs within CNCs are rarer than new alleles in nonconserved regions (P = 3 x10-18), indicating that evolutionary pressure has suppressed CNC-derived allele frequencies. We have used whole genome alignments of the human, chimp and macaque genomes to identify 1356 non-coding sequences, conserved across multiple mammals, which show significantly accelerated substitution rate in the human lineage, indicated by a relative rate test in the human-chimp-macaque alignments. We subsequently test which of these 1356 sequences are a result of relaxation of selective constraint versus positive selection. Detectable segmental duplications are by their nature primate-specific events. An intriguing question is whether these rapidly evolving CNCs are enriched within segmental duplications? The accelerated CNCs could be due to a loss of selective constraint or positive selection, and either of these scenarios could relate to differential gene expression patterns between their associated paralogous genes. We have identified an enrichment of accelerated CNCs in the most recently formed segmental duplications. We are currently investigating the potential for reciprocal changes in duplicated CNCs. We have also recently identified a group of accelerated CNCs that contain SNPs that are identified as significant contributors to gene expression variation. We will present our current computational and functional analysis on the evolutionary properties of CNCs within and between species and the functional consequences on gene expression.

16:15 to 17:00 C Hoggart (Imperial College London)
A hybrid Bayesian method for detecting multiple causal variants from Genome-Wide association studies

Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants of small effect, which is a plausible scenario for many complex diseases. Moreover, many simulation studies assume a single causal variant and so more complex realities are ignored. Analysing large numbers of variants simultaneously is now becoming feasible, thanks to developments in Bayesian stochastic search methods. We combine Bayesian shrinkage methods together with a local stochastic model search to identify complex interactions, both local and distal. Our approach can analyse up to 10,000 SNPs simultaneously, and finds multiple potential disease models each with an associated probability. We illustrate its power in comparison with a range of alternative methods, in simulations that incorporate multiple causal loci, acting singly or in interacting pairs, among 4,000 SNPs in a 20Mb region. We argue that, implemented in a two-stage procedure, our hybrid Bayesian analysis can provide a powerful solution to the problem of extracting maximal information from genome-wide association studies.

17:00 to 18:00 Poster Session 2 and Wine Reception at Newton Institute
18:45 to 19:30 Dinner at Wolfson Court (Residents only)
Friday 15th December 2006
09:00 to 10:00 A Thomas ([Utah])
Towards linkage analysis with markers in linkage disequilibrium by graphical modelling

Recent developments of MCMC integration methods for computations on graphical models for two applications in statistical genetics are reviewed: modelling allelic association and pedigree based linkage analysis. Estimation of graphical models from haploid and diploid genotypes, and the importance of MCMC updating schemes beyond what is strictly necessary for irreducibility, are described and illustrated. We then outline an approach combining these methods to compute linkage statistics when alleles at the marker loci are in linkage disequilibrium. Other extensions suitable for analysis of SNP genotype data in pedigrees are also discussed and programs that implement these methods, and which are available from the author's web site, are described. We conclude with a discussion of how this still experimental approach might be further developed.

Related Links

10:00 to 10:45 Approximate Bayesian computation vs Markov chain Monte Carlo

Approximate Bayesian Computation (ABC) is a recent developed Bayesian technique that can be used to extract information from DNA data. This method has been firstly introduced to Population Genetics in 1997 by Pritchard.

Since 2002, with Beaumont’s paper on the subject, its usage has been strongly increased. This Bayesian approach is used to estimate several demographic history parameters, from populations, using DNA data. Its main advantages are the decrease on computation time demanding and the increase on efficiency and flexibility when dealing with multiparameter models.

In this project it has been studied a particular ABC method similar to the one used by Beaumont in 2006, against a commonly used Markov Chain Monte Carlo (MCMC) method (Hey and Nielsen, 2004) to infer about the accuracy of the first method. It was also explored the use of this method with more complex demographic models. These two approaches use DNA sequence data to extract demographic information (e.g. population sizes, time of splitting events, migration rates).

The study confirms the competitiveness of this method when compared to an MCMC approach as well as its potential role on researches with more complex, therefore more realistic, models.

Related Links

10:45 to 11:30 Coffee at Newton Institute
11:30 to 12:30 Z Yang (University College London)
Lindley's paradox, star-tree paradox, and Bayesian phylogenetics
12:30 to 13:30 Lunch at Wolfson Court
14:00 to 15:00 A genome-wide association study in breast cancer
18:45 to 19:30 Dinner at Wolfson Court (Residents only)
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons