skip to content
 

Seminars (PLG)

Videos and presentation materials from other INI events are also available.

Search seminar archive

Event When Speaker Title Presentation Material
PLGW01 3rd September 2007
10:00 to 11:00
Computational and mathematical challenges involved in very large-scale phylogenetics

Phylogenetic inference presents enormous computational and mathematical challenges, but these are particularly exacerbated when dealing with very large datasets (containing thousands of sequences) or when sequences evolve under complex models of evoluiton. In this talk, I will describe some of the recent progress in large-scale phylogenetics. In particular, I will talk about multiple sequence alignment and its implications for large-scale phylogenetics.

Related Links

PLGW01 3rd September 2007
11:30 to 11:50
Dendroscope - an interactive viewer for large phylogenetic trees

Research in evolution requires software for visualizing and editing phylogenetic trees, for increasingly very large datasets, such as arise in expression analysis or metagenomics, for example. It would be desirable to have a program that provides these services in an efficient and user-friendly way, and that can be easily installed and run on all major operating systems. Although a large number of tree visualization tools are freely available, some as a part of more comprehensive analysis packages, all have drawbacks in one or more domains. They either lack some of the standard tree visualization techniques or basic graphics and editing features, or they are restricted to small trees containing only tens of thousands of taxa. Moreover, many programs are difficult to install or are not available for all common operating systems. We have developed a new program, Dendroscope, for the interactive visualization and navigation of phylogenetic trees. The program provides all standard tree visualizations and is optimized to run interactively on trees containing hundreds of thousands of taxa. The program provides tree editing and graphics export capabilities. To support the inspection of large trees, Dendroscope offers a magnification tool. The software is written in Java 1.4 and installers are provided for Linux/Unix, MacOS X and Windows XP. Dendroscope is a user-friendly program for visualizing and navigating phylogenetic trees, for both small and large datasets.

Related Links

PLGW01 3rd September 2007
11:50 to 12:10
Appropriate models for heterogeneous multi-gene data sites

Multigene data sets are becoming the norm for phylogenetic studies; so called ‘phylogenomic datasets’ may involve hundreds of genes for many species. These data sets create challenges for current phylogenetic methods, as different genes have different functions and hence evolve under different processes. The question is how best to model this heterogeneity to give reliable phylogenetic estimates of the species tree.

Here we discuss two approaches. The first approach involves stochastic partitioning of genes into different classes, the second approach is a form of penalised maximum likelihood where the penalty term 'encourages' parameter estimates to be similar between different gene trees.

PLGW01 3rd September 2007
12:10 to 12:30
H Li Incorporating speices phylogeny in the reconstruction of gene trees

Reconstructing accurate trees is one of the most challenging topics in phylogenetics. In this talk, two new algorithms will be presented in attempt to improve the accuracy of gene tree building by incorporating species phylogeny. Both of them are presented in the CFG (Context-Free Grammar) framework. The first algorithm is called tree merge, which takes a set of gene trees built by different algorithms and then merges them together to get a new tree that can be better than all the input. The second algorithm calculates a likelihood with respect to the species evolution and incorporates this likelihood to the likelihood calculated from the sequence alignment. Maximum likelihood tree can then be searched. The two algorithms have been implemented in NJTREE software which is the core engine of the TreeFam database. NJTREE is also extensively used in the orthology prediction pipeline of Ensembl Compara, and is proved to be more accurate than traditional algorithms in reconstructing phylogenetic trees in a multi-gene family.

Related Links

PLGW01 3rd September 2007
14:00 to 15:00
Mixture models in phylogenetic inference

I describe a general likelihood-based ‘mixture model’ approach for inferring phylogenetic trees from gene-sequence or other character- state data. Conventional models of gene-sequence evolution assume that all sites evolve according to a single homogeneous model of evolution or that rates of evolution vary among sites according to some statistical distribution, such as the gamma. In a phylogenetic context, a mixture model allows for more than one model of the evolutionary process to be fitted to each site in the alignment, with the likelihood being summed over models at each site. I describe how mixture modelling can accommodate cases in which different sites in the alignment evolve in qualitatively distinct ways, a phenomenon we call ‘pattern heterogeneity’. Mixture models can also accommodate sites whose rate of evolution varies in different parts of the tree. This is often called ‘heterotachy’ and I show how a mixture model approach can improve on a parametric covarion model for heterotachous data, and how the model can be used to identify which sites are evolving heterotachously and where in the tree this behaviour occurs. The mixture model does not require prior knowledge of the patterns or of the nature of the heterotachy in the data, nor does it partition the data. We present studies to show that the model correctly retrieves known patterns from simulated gene-sequence data, and present evidence that it generally improves the likelihood of the data substantially over simpler models of gene-sequence evolution. Mixture modelling seems a promising technique for discovering some of the complex evolutionary processes that underly sequence evolution, and may be of interest to researchers who wish to identify how specific sites respond adaptively to particular environments. We implement the model within a Bayesian Markov-Chain Monte Carlo framework, and make it available in our BayesPhylogenies software (www.evolution.rdg.ac.uk).

PLGW01 3rd September 2007
15:30 to 15:50
G Margos Multilocus sequence analysis of Borrelia burgdorferi s.l.

Lyme borreliosis, the most frequent vector-borne disease in the Northern Hemisphere, is caused by spirochaetes belonging to the genus Borrelia. These bacteria now comprise 12 named "species", thus forming a species complex referred to as Borrelia burdorferi sensu lato. All known genotypes of this species complex are maintained by vertebrate hosts and ticks, however, their maintenance systems and pathogenic properties are remarkably diverse. A major question addressed in our project is whether genotypically and ecologically defined Borrelia populations are congruent. So far, most genotyping schemes of Borrelia are based on non-coding spacer regions and genes encoding outer surface proteins, many of which are under frequency-dependent immune selection or prone to recombination. In order to better resolve the geographical population structure and the phylogenetic relationships of these bacteria, we are currently developing and exploring multilocus sequence analysis (MLSA) based on housekeeping genes. In this talk I will present some 'proof of principle' data on the suitability of MLSA for genotyping spirochaetal strains.

PLGW01 3rd September 2007
15:50 to 16:10
Comparing phylogenetic trees

Comparing phylogenetic trees is an important part of analyzing biological data sets. Many of the popular distance metrics are NP-hard but have been shown to efficient to approximate and tractable in small cases. We discuss recent results in comparing two phylogenetic trees under the tree-bisection-and-reconnection (TBR) and subtree-prune-and-regraft (SPR) distances.

PLGW01 3rd September 2007
16:10 to 16:30
Phylogenetic reconstructions based on RNA secondary structure

It is known that RNA secondary structure motifs are usually conserved, and have a vital influence on the activity of the RNA molecules. We present a novel approach for phylogenetic reconstruction, based on comparing secondary structures of homologous RNA molecules from different species. By manipulating the RSmatch algorithm for measuring RNA secondary structure similarity, we computed pairwise distances between secondary structures. Consequently, we applied the neighbor joining algorithm on the distance matrix, to produce phylogenetic trees. Initial analysis of the results exhibits a good agreement with the accepted taxonomy. Comparison with the trees based RNA sequences alone have also produced favorable results.

This is joint work with Moran Cabili, Assaf Meirovich, and Metsada Pasmanik-Chor

PLGW01 4th September 2007
09:00 to 10:00
Advances and limitations of maximum likelihood phylogenetics

I'll present recent developments in inferring phylogenies from sequences by maximum likelihood (ML). In fact, this approach involves two main components: probabilistic modelling of substitution events, and algorithmics to infer near-optimal trees. I'll show that algorithmics has considerably progressed during the last few years, allowing now for fast ML inference from large datasets. On the other hand, I'll underline the difficulties encountered with modelling, especially of protein evolution, and show that elaborating more realistic and improved substitution models is a major challenge.

PLGW01 4th September 2007
10:00 to 10:20
Restrictions on meaningful phylogenetic networks
PLGW01 4th September 2007
10:20 to 10:40
Inference rules for super-network construction

A fundamental aspect of understanding molecular evolution is species phylogeny reconstruction. A commonly used approach in this context is to infer it from a collection of gene tress using, for example, a supertree or supermatrix method. Although popular, one of the main drawbacks of such a method is that it can suppress conflict within the gene trees arising from, for example, lateral gene transfer or recombination. Until now, the only known approaches capable of accommodating such conflict have been Z-closure super-networks and Q-imputation super-networks.

In this talk, we introduce the novel Y-inference rule and show that under certain circumstances this inference rule on its own or in combination with one Meacham's inference rules constitute an attractive alternative approach that complements the above list of super-network approaches.

This is joint work with Stefan Gruenewald (AS-MPG Partner Institute for Computational Biology, China) and Qiong Wu (University of East Anglia, UK).

PLGW01 4th September 2007
10:40 to 11:00
Maximum parsimony on phylogenetic networks: some rigorous results

Phylogenies, the evolutionary histories of groups of organisms, play a major role in representing the interrelationships among biological entities. Although many biological processes can be effectively modeled and summarized as a binary tree, others cannot: recombination, hybrid speciation, and horizontal gene transfer result in networks of relationships rather than trees of relationships.

Maximum parsimony (MP) is one of the most popular methods used for phylogenetic tree reconstruction. Roughly, this method is based on the assumption that "evolution is parsimonious", that is, the best evolutionary trees are the ones that minimize the number of changes along the edges of the tree.

In previous works, we formulated a maximum parsimony (MP) criterion for reconstructing and evaluating phylogenetic networks, and demonstrated its quality on biological as well as synthetic datasets. In this work, we show further theoretical insights and results regarding various aspects of the MP criterion of phylogenetic networks.

PLGW01 4th September 2007
11:30 to 11:50
Integer programming for phylogenetic computations

We consider the following three related problems: 1) Given complete data (taxon by character data) remove the minimum number of characters so that the remaining data is pairwise compatible (contains a perfect phylogeny); 2) Given data with missing entries, impute the missing data in order to minimize, in the completed data, the number of pairs of characters that are incompatible; 3) Given data with missing entries, impute the missing data to minimize, in the completed data, the number of characters that must be removed so that the remaining sites are pairwise compatible. We consider in detail the cases of two, three and four states allowed per character and describe compact integer programming formulations that solve these problems (the approach extends in principle to any fixed number of states). We discuss empirical tests on problem instances of size of interest in phylogenetics, and establish that these instances can be solved very efficiently in practice, even though the problems are all NP-hard.

Related Links

PLGW01 4th September 2007
11:50 to 12:10
Quartet compatability: new results and surprising counterexamples

Many methods in phylogenetic tree reconstruction are not feasible for a large number of taxa. Hence, it is a natural approach to construct trees on small taxa sets first and then to look for a big tree (supertree) on the union of the taxa sets that contains all information of the small trees. However, even if all input trees are quartets, i.e. binary trees with four leaves, it is an NP-complete problem to decide whether they are compatible.

I will present a sufficient condition for a set of binary phylogenetic trees to be compatible. That result can be used to give a much shorter proof of the known characterization of quartet sets of minimal size which are defining, i.e. compatible with a unique supertree. Further, I will show some quartet sets with surprising properties which are counterintuitive and explain the hardness of the compatibility problem. One example is an incompatible collection of quartets for which every two quartets overlap in at most one taxon. It has been known for fifteen years that such quartet sets exist but the example is the first one that has been constructed.

PLGW01 4th September 2007
12:10 to 12:30
On k-leaf powers

In 2002, Nishimura, Ragde and Thilikos introduced the notion of k-leaf power and k-leaf root, motivated by the following: ``... a fundamental problem in computational biology is the reconstruction of the phylogeny, or evolutionary history, of a set of species or genes, typically represented as a phylogenetic tree ...''. The species occur as leaves of the phylogenetic tree. Let G=(V_G,E_G) be a finite undirected graph. For k at least 2, a tree T is a $k$-leaf root of G if V_G is the leaf set of T and two vertices x,y in V_G are adjacent in G if and only if their distance d_T(x,y) in T is at most k. The graph G is a $k$-leaf power if it has a k-leaf root. Obviously, a graph is a 2-leaf power if and only if it is the disjoint union of cliques, i.e., it contains no induced path of three vertices. Recent work on leaf powers contains characterisations of 3- and 4-leaf powers and their variants. It is well known that every k-leaf power is a strongly chordal graph. For k at least 6, no characterisation of $k$-leaf powers and no efficient recognition is known. We discuss some open problems and new results on leaf powers.

PLGW01 4th September 2007
14:00 to 15:00
Tripartitions in phylogenetic networks

Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of reticulate evolution events, like recombination, hybridization, and lateral gene transfer. In a recent series of papers devoted to the study of reconstructibility of phylogenetic networks, the well-known bipartition metric for phylogenetic trees was generalized to the so-called tripartition metric for phylogenetic networks. In this talk, we show that, in fact, this tripartition metric does not satisfy the separation axiom of distances (zero distance means isomorphism, or, in a more relaxed version, zero distance means indistinguishability in some specific sense) in any of the classes of phylogenetic networks where it is claimed to do so. We also present a class of phylogenetic networks whose members can be singled out by means of their sets of tripartitions, and hence, where the latter can be used to define a meaningful metric.

Related Links

PLGW01 4th September 2007
15:30 to 15:50
Branch and bound construction of balanced minimum evolution optimal trees

The Balanced Minimum Evolution (BME) score has been recently proposed as a criterion for reconstructing phylogenetic trees based on a distance matrix. It is based on Pauplin's formula, which provides a natural estimate of the total length of a tree, based on its topology and a matrix of estimated pairwise distances. The objective is to find the tree topology that minimizes this length estimate, as short trees are usually the ones that best reflect the data.

Recently, it has been shown that Neighbor Joining can be viewed as a greedy algorithm trying to minimise BME. Together with other theoretical reasons, there are strong experimental reasons supporting BME-guided tree reconstruction. However, the published methods are heuristic and do not attempt to construct BME-optimal trees.

The main aim of this talk will be to present a Branch and Bound approach for finding BME-optimal trees. We derived a simple bound on the BME score of a tree based on the score of a partially constructed tree. This allows us to eliminate the need to explore large parts of the space of all possible trees, but still guarantees that all optimal trees will be found. The efficiency of this approach compares well with that of other Branch and Bound approaches such as the ones for Maximum Parsimony. Finally the topological accuracy of the reconstructed trees will also be discussed.

PLGW01 4th September 2007
15:50 to 16:10
New analytical results for the speciation times in neutral models

Did a reconstructed phylogeny evolve under a neutral model? We provide an analytical way – opposed to the common simulation approach – for comparing the speciation times of a reconstructed tree with the dates proposed by the neutral models, e.g. via lineages-through-time plots. Further, if dates are missing in a reconstructed tree which evolved under a neutral model, we provide the time of the speciation events. These methods, which make use of the rank function on a phylogenetic tree, can also be applied for calculating analytically the loss in biodiversity when some species become extinct. The neutral models considered in this work are the Yule model and a generalization which includes extinction as well, the critical branching process (CBP) model, recently introduced by D. Aldous and L. Popovic. Trees under these neutral models are conditioned on having $n$ extant species today which is crucial for data analysis.

PLGW01 4th September 2007
16:10 to 16:30
Partitioning the sample space on five taxa for the neighbour joining algorithm

In this talk, we will analyze the behavior of the Neighbor Joining algorithm on five taxa and we will show that the partition of the sample (data) space for estimation of a tree topology with five taxa into subspaces, within each of which the Neighbor Joining algorithm returns the same tree topology. A key of our method to partition the sample space is the action of the symmetric group $S_5$ on the set of distance matrices by changing the labels of leaves. The method described in this paper can be generalized to trees with more than five taxa.

This is joint work with Kord Eickmeyer and a preprint is available at http://arxiv.org/abs/math.CO/0703081.

Related Links

PLGW01 5th September 2007
09:00 to 10:00
Reconstruction of phylogeny by stratophenetic tracing in the fossil record and its comparison with molecular data

The information on the relatedness between taxa and the direction of evolution are only indirectly coded in the input data of all molecular phylogenies. Fossil record, on the other hand, contains direct information on the order, absolute timing and nature of evolutionary events. There is, however, a small hitch: the fossil record only preserves a fraction of the phylogenetically relevant information and it is often so patchy and incomplete, that reconstructions of phylogeny based on fossil data are not superior to molecular methods. A notable exception is presented by the fossil record of marine microplankton, in particular planktonic foraminifera. Their calcite shells are rich in characters and their prolific production and excellent preservation in oceanic sediments produced probably the most complete fossil record on Earth. Here, phylogenetic relationships can be reconstructed by the method of stratophenetic tracing – the nearest one can come to direct observations from a time machine – where both the topology and the timing of divergences are inferred by back-peeling of layers of sediments and tracking changes in morphology of the evolving species through time. The fossil record of planktonic foraminifera thus offers a unique possibility to ground truce molecular phylogenies, the methods that produce them and the assumptions and processes on which the methods are based. We will present a comparison between fossil phylogeny of one monophylum of planktonic foraminifera with corresponding SSU rDNA phylogenies, showing unequivocal direct evidence for long-branch artefacts, enormous differences in substitution rates and incongruent tree topologies, underscoring the potential of foraminifera as a model organism in phylogenetics.

PLGW01 5th September 2007
10:00 to 10:20
Explicit bounds for the stability of maximum likelihood trees

The application of the maximum likelihood framework for inferring phylogenetic trees is based on the adoption of an explicit evolution model on which depend both the analysis of sequence evolution and the computation of the likelihood of a tree. Since this computation is based on the observed nucleotides, which are only in finite number, the likelihood has a random component, possibly large, and the robustness of the inferred tree has to be assessed.

Felsenstein's bootstrap test, along with the subsequently developed bootstrap-based tests, is the most commonly used test of reliability of a tree. However, it discards the relation between the size of the data, the number of species in the study and the stability of the phylogeny. The bootstrap is also based on resampling with replacement, which can be time consuming especially for large sets of data.

We propose to bound the variability of the empirical likelihood around its true value, for a given phylogeny. We also bound the probability of a phylogeny being better than another one "just by chance" when in reality it is worse. These bounds, obtained with measure concentration tools, account for the number of species and the number of nucleotides. In particular, they give the minimum number of nucleotides needed to achieve a given confidence level, when working with a given number of species.

Finally, to illustrate the behaviour of our method, we compare it to bootstrap on a toy example.

PLGW01 5th September 2007
10:20 to 10:40
D Metzler Insertion-deletion models for dequence evolution and Bayesian sampling methods for multiple alignment

In current practice the estimation of a phylogenetic tree is a two-step procedure: first a multiple alignment is computed and subsequently a phylogenetic tree is reconstructed, based on the alignment. However, it is well known that the alignment and the tree reconstruction problem are intertwined. Thus, it is of great interest to estimate alignment and tree simultaneously. A precondition for this is the availability of biologically plausible sequence evolution models which are compatible with efficient alignment algorithms. We discuss a variant of the Thorne-Kishino-Felsenstein model, having equal rates of insertions and of deletions of sequence fragments, for more than two sequences related by a phylogenetic tree. Finally, we discuss the perspectives for statistical multiple alignment based on this type of model.

Most parts of this talk are based on joint work with Roland Fleissner, Arndt von Haeseler, Ana Arribas-Gil, and Anton Wakolbinger.

Related Links

PLGW01 5th September 2007
10:40 to 11:00
Indentifiability of the GTR=gamma substitution model of DNA evolution

The GTR+Gamma+I model of DNA substitution is probably the most widely used model in practical phylogenetic inference. However, a valid proof of the identifiability of its parameters (tree topology, root distribution, rate matrix, proportion of invariable sites, and alpha shape parameter) has not been given.

After indicating the gaps in the previously-published proof of the GTR+Gamma+I model, and discussing various mathematical approaches for establishing identifiability results, a sketch of a recent proof of the identifiability of GTR+Gamma will be given.

This is joint work with Cecile Ane' and John Rhodes.

PLGW01 5th September 2007
11:30 to 12:30
Likelihood calculations on the ancestral recombination graph
PLGW01 6th September 2007
09:00 to 10:00
Model averaging in phylogenies

It is well known that the use of different probabilistic models of evolution may change the results of phylogenetic analyses, and during the last years much research has been devoted to the selection of best-fit models. An alternative to model selection is model averaging or multimodel inference, where all models are simultaneously used to infer different parameters of the substitution process, or the phylogenetic tree (topology and branches) itself. Here I will discuss some preliminary simulations that suggest that model-averaging might not be very useful to estimate substitution parameters, but that might reduce the variance associated to the estimation of tree topology. New software that implements model averaging for phylogenetics will also be introduced.

Related Links

PLGW01 6th September 2007
10:00 to 10:20
FA Matsen Glimpses from the strange world of phylogenetic mixtures

Rates of evolution and evolutionary history may vary across the genome of a given organism. The standard way of modeling the resulting data is a "mixture model" where a given phylogenetic pattern comes from one of a collection of trees with fixed probabilities. In this talk I will focus on recent results (joint with Mike Steel and Elchanan Mossel) which investigate what pattern probabilities are and are not possible when rates of evolution vary on a single topology.

In particular, we have shown that a mixture of two processes on a tree of one topology can produce exactly the same expected site pattern frequencies (i.e. sequence data) as an unmixed process on a different topology. For mixtures of a number of processes the situation is even more dire: in a certain sense the topology is not uniquely determined for a majority of data vectors. I will also present other related results along with the associated mathematics and indicate some interesting future directions for research.

PLGW01 6th September 2007
10:20 to 10:40
Neighbor joining algorithms for inferring phylogenies via LCA-distances

Reconstructing phylogenetic trees efficiently and accurately from distance estimates is an ongoing challenge in computational biology from both practical and theoretical considerations. We study algorithms which are based on a characterization of edge-weighted trees by distances to LCAs ({\em Least Common Ancestors}). This characterization enables a direct application of ultrametric reconstruction techniques to trees which are not necessarily ultrametric. A simple and natural neighbor joining criterion based on this observation is used to provide a family of efficient neighbor-joining algorithms. These algorithms are shown to reconstruct a refinement of the Buneman tree, which implies optimal robustness to noise under criteria defined by Atteson. In this sense, they outperform many popular algorithms such as Saitou\&Nei's NJ. Preliminary experimental results indicate that when executed from an appropriate root taxon, this algorithm provides reconstruction of phylogenies which are competitive with NJ and other common algorithms.

PLGW01 6th September 2007
10:40 to 11:00
Reconstruction of phylogenetic trees with very short branches

One of the central challenges in phylogenetics is the accurate reconstruction of phylogenetic trees from short taxon-sequences. The sequence length required for correct topological reconstruction depends on certain properties of the tree such as its depth and edge-weights. Of special interest are fast converging algorithms, which, under the assumption of a constant lower bound on edge weights, require only polynomial length sequences to guarantee correct topological reconstruction with high probability. However, when the original phylogenetic tree contains very short edges, the required sequence-length may be too long for practical purposes. Moreover, incorrect reconstruction of short edges may (and in practice often does) propagate in the sense that it prevents the correct reconstruction of long edges as well. Therefore, most known reconstruction algorithms assume that the original tree has a fully resolved topology and all its edges are sufficiently long.

In this talk we present a fast converging reconstruction algorithm which avoids this assumption. The algorithm returns a partially resolved topology which contains all sufficiently long edges of the original tree, and no edges which contradict the original topology. It is designed to reconstruct more edges when the input sequences grow longer. Unlike existing fast converging reconstruction algorithms, this method provides a reconstruction guarantee for every taxon-sequence length. This makes it more appropriate for practical use.

PLGW01 6th September 2007
11:30 to 11:50
Nature reserve selection problems

The Nature Reserve Selection Problem (NRS) is a problem that arises in the context of conservation biology. Subject to budgetary constraints, the problem is to select a set of conservation areas to preserve so that the phylogenetic diversity of the species contained within those areas is maximized. It was recently shown that NRS is NP-hard. In this talk, we give a tight polynomial-time approximation algorithm for NRS and describe a closely-related problem for which little is known.

PLGW01 6th September 2007
11:50 to 12:10
Computing phylogenetic diversity for split systems

In conservation biology, it is important to measure, predict, and preserve biodiversity as many species are facing extinction. In 1992, Faith proposed measuring the diversity of a collection of species in terms of their relationship on a phylogenetic tree, and using this information to identify collections of species with high diversity. Here we are interested in some variants of the resulting optimisation problem that arise when considering species whose evolution is better represented by a network rather than a tree. More specifically, we consider the problem of computing phylogenetic diversity relative to split systems. We show that for general split systems, this problem is NP-hard. In addition we provide some efficient algorithms for some special classes of split systems, in particular presenting an optimal O(n) time algorithm for phylogenetic trees and an O(n log n + nk) time algorithm for circular split systems.

Keywords: Phylogenetic tree, Phylogenetic network, Phylogenetic diversity, Biodiversity conservation, Split systems

PLGW01 6th September 2007
12:10 to 12:30
K Hartmann Prioritising species for biodoversity conservation

Conservation organisations are often faced with the problem of allocating limited funds to the conservation of many needy species. The ultimate goal is to ensure that the expected future biodiversity is as high as possible. Other factors that need to be considered are the costs of conserving different species and the expected increase in their survival probability that conservation would provide them. The Noah's Ark Problem (NAP) is a mathematical framework that formalises this problem and uses phylogenetic diversity as a measure of biodiversity.

Whilst much mathematical work has been done on the NAP actual applications of it to real problems has been limited. In this talk I consider some common criticisms of the NAP. I will also briefly introduce a new software package I have developed that bundles common algorithms for the NAP together with a graphical interface.

Related Links

PLGW01 6th September 2007
14:00 to 15:00
Big trees: challenges and opportunites in the phylogenetic analysis of large data sets

It is becoming easier to obtain genetic sequences. Many of these will be used for phylogenetic analyses. But these analyses have traditionally been hampered by the size of the data (usually, the number of sequences) that can be analysed satisfactorily. In this talk, I consider three problems that have arisen as a consequence of having "too much" data. First, supertree reconstruction may provide some relief with large datasets, but what is the best way of reconstructing supertrees? I explore the possibility of constructing maximum-likelihood estimates of supertrees, and discuss some of the features that such trees may possess. Second, I consider how we can break up datasets to make satisfactory inferences. Finally, I look at the new generation of automatic sequencing machines, and the promise of large amounts of data. I outline some computational approaches that may be useful in analysing this type of data. This talk provides very little in the way of concrete results but a great deal of informed speculation. It is not my intention to instruct, but to stimulate discussion.

PLGW01 6th September 2007
15:30 to 15:50
BC Emerson Mitochondrial DNA heteroplasmy and recombination in a reptilian secondary contact zone

Inheritance of mitochondrial DNA in animals is thought to be both strictly maternal and non-recombining. Nevertheless, in the last decade evidence for recombination of mitochondrial DNA has been observed in several animal species, but few reports come from natural populations and most of the time the frequency of individuals possessing more than one mtDNA type is very low. This talk presents evidence for extensive heteroplasmy and mitochondrial recombination in a region of secondary contact of a lizard species (Lacerta lepida). Individuals at the centre of the contact zone possess heteroplasmic point mutations at the mtDNA cytochrome b gene. The heteroplasmic condition was further tested by cloning 20 individuals at the centre of the zone and sequencing 8 to 10 clones per individual. All individuals tested had recombinant mtDNA sequences, often more than one. The results suggest that mtDNA recombination was derived from the fusion of the maternal and paternal mtDNAs due to paternal leakage, which is thought to be more frequent in hybrid zones. To our knowledge this is the first study reporting such extensive heteroplasmy and recombination in the mtDNA genome in natural populations of a species with typical maternal mtDNA inheritance. Our findings suggest that mtDNA may recombine regularly and that secondary contact zones, where highly divergent molecules come into contact, may provide natural laboratories for studying this phenomenon.

PLGW01 6th September 2007
15:50 to 16:10
Generalising from consensus to supertree methods

There have been a number of attempts to generalise from consensus methods to supertree analogues that are equivalent to the consensus method in the consensus case and which preserve desirable properties of the method beyond consensus. Recently, Cotton and Wilkinson (2007) used a characterisation of the classical majority-rule consensus tree in terms of its minimisation of the sum of consensus tree to input tree distances as a basis for generalisation to the supertree case. Here I use the same general approach to generalise from strict, loose (= semi-strict) classical and reduced consensus to the corresponding supertree methods. Not all properties of the consensus methods can be simultaneously generalised beyond consensus and not all properties are desirable in the supertree case. Useful generalisation requires that we distinguish between those consensus properties that are 'essential' or 'most desirable' beyond consensus and those that seem to be merely a consequence of the special case of consensus. The loose supertree method is compared to other supertree methods, particularly to an alternative generalisation of the semi-strict consensus due to Goloboff and Pol (2002) which, it is argued, preserves consensus properties that are unhelpful more generally.

Cotton, J. A., and M. Wilkinson. 2007. Majority-rule supertrees. Systematic Biology 56: 445-452.

Goloboff, P. A., and D. Pol. 2002. Semi-strict supertrees. Cladistics 18: 514-525.

PLGW01 7th September 2007
09:00 to 10:00
Mapping human genetic ancestry
The human genome is a mosaic with respect to its evolutionary history. Based on a phylogenetic analysis of 23,210 DNA sequences alignments from human, chimpanzee, gorilla, orangutan and rhesus, we present a map of human genetic ancestry. For about 23% of our genome we share no immediate genetic ancestry with our closest living relative, the chimpanzee. This encompasses genes and exons to the same extent as intergenic regions. We, conclude that about 1/3 of our genes evolved as human specific lineages before the differentiation of human, chimps and gorillas took place. This explains recurrent findings of very old human specific morphological traits in the fossils record, which predate the recent emergence of the human species about 5 million years ago. Furthermore, the sorting of such ancestral phenotypic polymorphisms in subsequent speciation events provides a parsimonious explanation why evolutionary derived characteristics are shared among species that are not each other's closest relatives.
PLGW01 7th September 2007
10:00 to 10:20
Consistency of phylogenetic tree search algorithms based on the balanced minimum evolution principle

In phylogenetics, it is common practise to infer evolutionary or phylogenetic trees by searching tree-space, that is, searching through the collection of all possible phylogenetic trees on a given set of species or taxa using topological rearrangements and some optimality criterion. Recently, such an approach called the balanced nearest neighbor interchange (BNNI) algorithm was introduced, which is based on the balanced minimum evolution (BME) principle. This algorithm takes as input a distance matrix and a putative phylogenetic tree on a given set of species. It modifies this tree using nearest neighbor interchange (NNI) operations in such a way that the total BME length of the resulting tree decreases relative to the distance matrix. This process is repeated until a tree with (locally) shortest BME length is found. Guided by numerous computer simulations it has been conjectured that the BNNI algorithm is consistent, that is, when it is applied to an input distance that is a tree-metric, then it will always converge to the (unique) tree corresponding to that metric. To date this conjecture remains open. Here, however, we prove that a related algorithm | which we call the BSPR algorithm for short | that results from replacing NNI operations by the more general subtree prune and regraft (SPR) operations in the BNNI algorithm is, in fact, consistent. Moreover, we show that even if the input distance matrix contains (small) errors relative to the tree-metric, then the BSPR algorithm will still return the corresponding tree.

PLGW01 7th September 2007
10:20 to 10:40
Optimal and hereditarily optimal realisation of metric spaces

Given a finite set X and a metric d on X, a realization of (X,d) is a weighted graph G=(V,E,w) such that X is a subset of V and the distance d(x,y) between any pair of elements in X is the shortest distance between x and y in G. A realization such that the total edge weight of G is minimal is known as an optimal realization. It is well-known that if there exists a tree which is a realization of (X,d) then this tree is the unique optimal realization. In general however, optimal realizations are not trees, not necessarily unique, and notoriously hard to find. It has been suggested by Andreas Dress that optimal realizations may be possible to find by deleting edges in a larger graph known as the hereditarily optimal realization of the given metric, which is unique and can be explicitly calculated. Both optimal and hereditarily optimal realizations can be useful concepts in phylogenetics, especially when studying e.g. plants or viruses, where a phylogenetic tree is not necessarily the best representation of the evolutionary relationships.

We will outline some recent progress on this problem, showing both that optimal realizations can be found within hereditarily optimal realizations for all metrics on less than five elements, but also that there exist counterexamples, which in turn give rise to new questions.

Related Links

PLGW01 7th September 2007
10:40 to 11:00
Properties of the TBR metric

Tree rearrangement operations are used in computational biology as a measure of the similarity between two binary phylogenetic trees on the same leaf set. For unrooted trees, the three operations of primary interest are nearest-neighbour interchange (NNI), subtree prune-and-regraft (SPR) and tree bisection-and-reconnection (TBR). All three of these have been shown to induce metrics on the set of unrooted binary phylogenetic trees.

The unit-neighbourhood of a tree T is the set of all trees that can be obtained by performing exactly one tree rearrangement on T. For both NNI and SPR, the size of this unit-neighbourhood is independent of the shape of T, and depends only on the size of the leaf set. The same is not true for TBR however, although both upper and lower bounds are known.

We present a recursion for calculating the size of the TBR unit- neighbourhood for any tree in the space of unrooted binary phylogenetic trees with n leaves. We also use this to establish improved upper and lower bounds on this size, and characterise those trees for which the given upper bound is sharp.

PLGW01 7th September 2007
11:30 to 12:30
Stars, nets, and the "war on error": Six mathematical challenges
PLG 11th September 2007
10:00 to 12:00
Informal discussions on Phylogenetic networks
PLG 11th September 2007
14:00 to 15:00
Some Monte Carlo methods for phylogenetic problems

In this chalk and talk seminar I will describe a generic scheme for exploiting multiple processors in a single long MCMC simulation. The method is likely to be useful in particular for problems with computationally intensive likelihood evaluations, or likelihood approximation schemes, such as MCMC-ABC. Such problems appear nowadays in the context of finite sites substitution models and models of ancestry requiring branching genelogies (such as host-parasite models, and models of recombination).

I will describe a second scheme for MCMC, in which the user makes no full likelihood evaluations, but nevertheless gets the exact same samples they would have got had they made full likelihood evaluations. The analysis is based on estimating the separation time of coupled Markov chains, one for the approximate MCMC, and one for the exact.

So far these ideas have been implemented in statistical applications in geoscience, but they are quite generic.

PLG 17th September 2007
10:00 to 12:00
Informal discussions on Phylogenetic networks
PLG 18th September 2007
14:00 to 15:00
Taxonomy and Phylogenetics in Metagenomics

Metagenomics is the study of microbial communities using DNA sequencing technologies. Key questions are determining and comparing the taxonomical content of different samples and also their "functional content". This talk will start with a brief introduction to the field. We will then discuss one approach to the problem that uses database comparison and tree-based algorithms to place DNA sequence reads into the NCBI taxonomy, as implemented in our metagenomic analysis tool MEGAN. We will then discuss the problem of comparing different metagenomic datasets and ask how one might best include phylogenetic methods to identify species in a sample.

PLG 25th September 2007
14:00 to 15:00
Recovering reticulation: a theoretical algorithm for a practical problem

Reticulate (non-tree-like) evolution is a fundamental process in the evolution of certain groups of species. This process results in species being a composite of DNA regions derived from different ancestors. Consequently, conflicting signals in a data set may not be the result of sampling or modelling errors, but due to the fact that reticulation has played a role in the evolutionary history of the species under consideration. Such species include certain birds and plants.

Assuming that our initial data set is correct, a fundamental problem for biologists is to compute the minimum number of reticulation events that explains this set. This smallest number sets a lower bound on the number of such events and provides an indication of the extent that reticulation has had on the evolutionary history of a collection of present-day species.

In this talk, we focus our attention on this problem for when the initial set consists of two (evolutionary) trees. This may seem rather special, but there are several reasons for this. Firstly, the problem is NP-hard even when the initial set consists of two such trees. Secondly, we are interested in finding a general solution rather than one that is restricted in some way. Lastly, the problem for when the initial data set consists of binary sequences can be interpreted as a sequence of two-tree problems.

PLG 2nd October 2007
14:00 to 15:00
Phylogenetic problems via telescopic lens

I will outline a few examples where looking at phylogenetic problems as specific examples of more general problems in statistics, statistical physics, information theory and other fields led to new insights.

PLG 9th October 2007
14:00 to 15:00
Reconstruction of certain normal networks from the genomes at leaves

A network N is a rooted acyclic directed graph. A base-set X for N is a subset of vertices including the root (or outgroup), all leaves, and all vertices of outdegree 1. A simple model of evolution is considered in which all characters are binary and in which back-mutations occur only at hybrid vertices. It is assumed that the genome is known for each member of the base-set X. If the network is known and is assumed to be "normal," then it is proved that the genome of every vertex is uniquely determined and can be explicitly reconstructed. Under additional hypotheses involving time-consistency and separation of the hybrid vertices, the network itself can also be reconstructed from the genomes of all members of X. An explicit polynomial-time procedure is described for performing the reconstruction.

PLG 11th October 2007
11:30 to 13:00
M Steel & V Moulton & D Huson Phylogenetics: interactions between mathematics and evolution
PLG 16th October 2007
14:00 to 15:00
On the identifiability of substitution models

A prerequisite for the consistency of model-based phylogenetic methods is that the underlying model be identifiable from the joint distribution of states at the leaves.

This talk will focus on the identifiability of substitution models for tree inference, discussing some mathematical approaches to obtaining identifiability results and some analysis of their strengths and weaknesses.

In the final part of the talk, there will be some mention of the more technical details of proving the identifiability of the GTR+Gamma model, some suggestions of open problems in this area, and some indication of the direction of current work on the covarion model in progress at INI.

PLG 30th October 2007
14:00 to 15:00
LA Szekely Phylogeny reconstruction, distinguishing and decision problems
A widely-studied model for generating binary sequences is to `evolve' them on a tree according to a symmetric Markov process. This model is known as the Cavender-Farris-Neyman model in phylogeny, and the `symmetric binary channel' and the `symmetric 2-state Poisson model' in other areas. The CFN model provides a simple model for the evolution of purine-pyrimidine sequences. The significance of this simple model is, that phenomena shown for the CFN model often extend to more realistic models of sequence evolution. The abstract phylogeny reconstruction problem is telling the true (model) tree from the generated sequences with high probability (whp). We show that under the CFN model distinguishing the true tree from a false one whp, using sequences generated on the true tree, is substantially `easier' (in terms of the sequence length needed) than determining the true tree whp. On the other hand the decision problem whether an input tree is true or false whp is not `easier' than reconstructing the true tree whp. This is joint work with E. Mossell and M. A. Steel.
PLG 6th November 2007
14:00 to 15:00
Walking through tree-space and stopping in time

We are presenting a new quartet and nearest neighbour interchange based tree reconstruction method, that overcomes some disadvantages of the original TREE-PUZZLE approach. Finally, we suggest a simple method when to stop the search.

(joined work with Vinh Le Sy, Minh Bui Quang, Heiko Schmidt)

PLG 13th November 2007
14:00 to 15:00
Worst-case optimal approximation algorithims for maximising triplet consistency within phylogenetic networks

This article concerns the following question. Assuming that we restrict the set of phylogenetic networks to some subclass, what is the maximum value of 0 <= p <= 1 such that for every input set T of rooted triplets, there exists some network N(T) from the subclass such that at least p|T| of the triplets are consistent with N(T)? Here we prove that the set containing all triplets (the full triplet set) in some sense defines p, and moreover that any network N achieving a ratio p' for the full triplet set can be converted in polynomial time into an isomorphic network N'(T) achieving ratio >= p' for an arbitrary triplet set T. The result also holds for weighted triplet sets. We demonstrate the power of this technique for the field of phylogenetics by giving worst-case optimal algorithms for the set of level-1 phylogenetic networks, improving considerably upon the 5/12 ratio obtained recently by Jansson, Nguyen and Sung. For level-2 phylogenetic networks we show that p >= 0.61

PLG 20th November 2007
14:00 to 15:00
Constructing level-2 phylogenetic trees from triplets

Jansson and Sung showed that, given a dense set of input triplets T (representing hypotheses about the local evolutionary relationships of triplets of species), it is possible to determine in polynomial time whether there exists a level-1 network consistent with T, and if so to construct such a network. Here we extend this work by showing that this problem is even polynomial-time solvable for the construction of level-2 networks. This shows that, assuming density, it is tractable to construct plausible evolutionary histories from input triplets even when such histories are heavily non-tree like. This further strengthens the case for the use of triplet-based methods in the construction of phylogenetic networks. We implemented the algorithm and applied it to yeast data.

PLG 27th November 2007
14:00 to 15:00
Hao Bailin's whole-genome based phylogenies

Recently, Hao Bailin suggested a method for deriving whole-genome based phylogenies from k-word statistics that --- in contrast to previous attempts --- really seems to work. In the seminar, I will describe his method and the results he could produce as well as extensions that are being currently worked out jointly with Hao Bailin and Alberto Apostolico,

PLG 3rd December 2007
17:00 to 18:00
Phylogenetic Nets: Examples and Theoretical Results
Phylogenetic nets have proved a useful tool in phylogenetic analysis highlighting problematic relationships, alternative solutsion, and even reticulate evoltuion. In the lecture I will provide examples and outline some relevant theoretical results.
PLG 4th December 2007
14:00 to 15:00
New methods for estimating language evolution

Languages evolve through genetic descent, but also through borrowing, and distinguishing between the two can be difficult due to a number of factors. This talk will describe the approach that our group is doing to estimate evolutionary histories in the presence of borrowing, and present our analysis of Indo-European. I will also present a simulation study of langauge evolution, showing how methods for reconstructing trees perform in the presence of borrowing.

PLGW04 6th December 2007
13:00 to 14:00
P Lockhart Phylogenetic models and the origins of chloroplasts
PLGW04 6th December 2007
14:00 to 15:00
Algorithm design for large-scale phylogenetic analysis
PLGW04 6th December 2007
15:30 to 16:30
Phylogenetic models and algebra
PLGW04 6th December 2007
16:30 to 17:30
Phylogenetic combinatorics: analysing branching patterns in evolutionary trees
PLG 11th December 2007
14:00 to 15:00
Approximating phylogenetic tree distances

Many popular distances between phylogenetic trees are difficult to calculate. These include the subtree-prune-reconnect (SPR) and the tree-bisection-reconnection (TBR) distances. We will survey the complexity results for these distances and discuss recent efforts to give approximation algorithms to these important metrics.

PLGW03 17th December 2007
10:00 to 11:00
The tree of life: algorithmic and software challenges

The Tree of Life initiative -- to reconstruct the evolutionary history of all organisms -- is the computational grand challenge of evolutionary biology. Current methods are limited to problems several orders of magnitude smaller and also fail to provide sufficient accuracy at the high end of their range.The Cyberinfrastructure for Phylogenetic Research (CIPRes) project, whcih is funded by a $11.6M Information Technology Grant from the US National Science Foundation, is a multi-institutional project whose goal is to help develop the computational infrastructure for evolutionary biologists so that they can analyze large datasets. In this talk, I will describe the work done by the CIPRES project towards these objectives, and will address some of the interesting and pressing problems that will remain.

Related Links

PLGW03 17th December 2007
11:30 to 11:50
Seeing the wood for trees: analysing multiple alternative phylogenies

Phylogenetic analysis very commonly produces several alternative trees for a given set of taxa. This can be due to having several, possibly conflicting, sources of information (such as different sets of orthologous genes) or as a result of the nature of the analytic method itself (for example, sampling a large number of trees from a posterior distribution). Faced with several alternative trees, biologists typically seek a single concensus phylogeny -- thereby discarding important information.

We review some recent approaches to dealing with alternative phylogenies that arise from a multivariate statistical view-point. We go on to describe a novel approach for comparing alternative phylogenies that is based on the idea of a "tree-of-trees": trees with similar topologies are clustered together in the same way that a single phylogeny clusters species with similar DNA sequences. The tree-of-trees is constructed in an analogous way to a most parsimonious tree for DNA data. We illustrate these ideas by applying our method to some simple data sets.

PLGW03 17th December 2007
11:50 to 12:10
Performance of new invariant methods for phylogenetic reconstuction

An attempt to use phylogenetic invariants for tree reconstruction was made at the end of the 80s and the beginning of the 90s by Lake and Cavender-Felsenstein. However, the efficiency of the methods based on invariants was not obvious ([Huelsenbeck, 1995]), probably because they only used few generators of the set of phylogenetic invariants. Results and simulation studies about the performance of a new method of phylogenetic reconstruction based on invariants ([Casanellas et al., 2005]) are presented in this talk. These simulations show that it is very competitive and highly efficient, especially for nonhomogeneous phylogenetic trees. Recent improvements for this method using algebraic geometry techniques are also discussed. Part of these results has appeared in [Casanellas and Fernandez-Sanchez, 2007].

References: [Casanellas and Fernandez-Sanchez, 2007] Casanellas, M and Fernandez-Sanchez, J. (2007). Performance of a new invariants method method on homogeneous and nonhomogeneous quarted trees. Mol Biol Evol 24 :288-293.

[Casanellas et al., 2005] Casanellas, M; Garcia, L and Sullivant, S (2005). Catalog of small trees. In Pachter, L. and Sturmfels, B., editors, Algebraic Statistics for computational biology, chapter 15. Cambridge University Press.

[Huelsenbeck, 1995] Huelsenbeck, J. (1995). Performance of phylogentic methods in simulation. Syst. Biol., 44:17–48.

PLGW03 17th December 2007
14:00 to 14:20
Incorporating uncertainty in distance-matrix phylogenetic reconstruction

One approach to estimating phylogenetic trees supposes that a matrix of estimated evolutionary distances between taxa is available. Agglomerative methods have been proposed in which closely related taxon-pairs are successively combined to form ancestral taxa. Several of these computationally efficient agglomerative algorithms involve steps to reduce the variance in estimated distances. However, formal statistical models and methods for agglomerative distance-based phylogenetic reconstruction have not previously been published.

We propose a statistical agglomerative phylogenetic method which formally considers uncertainty in distance estimates and how it propagates during the agglomerative process. It simultaneously produces two topologically identical rooted trees, one tree having branch lengths proportional to elapsed time, and the other having branch lengths proportional to underlying evolutionary divergence. The method models two major sources of variation which have been separately discussed in the literature: noise, reflecting inaccuracies in measuring divergences, and distortion, reflecting randomness in the amounts of divergence in different parts of the tree. The methodology is based on successive hierarchical generalised least-squares regressions. It involves only means, variances and covariances of distance estimates, thereby avoiding full distributional assumptions. Exploitation of the algebraic structure of the estimation leads to an algorithm with computational complexity comparable to the leading published agglomerative methods. A parametric bootstrap procedure allows full uncertainty in the phylogenetic reconstruction to be assessed.

PLGW03 17th December 2007
14:20 to 14:40
Adaptive fast convergence - towards optimal reconstruction guarantees for Phylogenetic trees

One of the central challenges in phylogenetics is to be able to reliably resolve as much of the topology of the evolutionary tree from short taxon-sequences. In the past decade much attention has been focused on studying fast converging reconstruction algorithms, which guarantee (w.h.p.) correct reconstruction of the entire tree from sequences of near-minimal length (assuming some accepted model of sequence evolution along the tree). The major drawback of these methods is that when the sequences are too short to correctly reconstruct the tree in its entirety, they do not provide any reconstruction guarantee for sufficiently long edges. Specifically, the presence of some very short edges in the model tree may prevent these algorithms from reconstructing even edges of moderate length.

In this talk we present a stronger reconstruction guarantee called "adaptive fast convergence", which provides guarantees for the correct reconstruction of all sufficiently long edges of the original tree. We then present a general technique, which (unlike previous reconstruction techniques) employs dynamic edge-contraction during the reconstruction of the tree. We conclude by demonstrating how this technique is used to achieve adaptive fast convergence.

Related Links

PLGW03 17th December 2007
14:40 to 15:00
On the hardness of inferring Phylogenies from triplet-dissimilarities

We considers the problem of reconstructing a phylogenetic tree from triplet dissimilarities, which are dissimilarities defined over taxon-triplets. Triplet dissimilarities are possibly the simplest generalization of pairwise dissimilarities, and were used for phylogenetic reconstructions in the past few years. We study the hardness of finding a tree best fitting a given triplet-dissimilarity table under the "maximal difference" criterion. We show that the corresponding decision problem is NP-hard and that the corresponding optimization problem cannot be approximated in polynomial time within a constant multiplicative factor smaller than 1.4. We also address the issue of best-fit under "maximal distortion", which corresponds to the largest ratio between matching entries in two triplet-dissimilarity tables. We show that it is NP-hard to approximate the corresponding optimization problem within any constant multiplicative factor. On the positive side, we present a polynomial time constant-rate approximation algorithm for this problem.

PLGW03 17th December 2007
15:30 to 15:50
A signal-to-noise analysis of phylogeny estimation by neighbour-joining

The property of fast-convergence describes phylogeny reconstruction methods that, with high probability, recover the true tree from sequences that grow polynomially in the number of taxa. While provably fast-converging methods have been developed, the neighbor-joining (NJ) algorithm of Saitou and Nei remains one of the most popular methods used in practice. This algorithm is known to converge for sequences that are exponential in n, but no lower bound for its convergence rate has been established. To address this theoretical question, we analyze the performance of the NJ algorithm on a type of phylogeny known as a "caterpillar tree." We find that, for sequences of polynomial length in the number of taxa, the variability of the NJ criterion is sufficiently high that the algorithm is likely to fail even in the first step of the phylogeny reconstruction process, regardless of the degree of polynomial considered. This result demonstrates that, for general trees, the exponential bound cannot be improved.

PLGW03 17th December 2007
15:50 to 16:10
M Fischer How many characters are needed to recontruct the true tree?

Maximum Parsimony, which basically favors the tree hypothesizing the fewest number of character state changes, is still one of the most important methods of tree reconstruction. Even so, MP is known to be prone to fail when the underlying phylogenetic tree has some extreme characteristics, as it is the case, for example, in the Felsenstein-zone or when interior edges are very short. While the Felsenstein-zone has already been well-examined over the years, some aspects of the latter problem remained unsolved. We investigated this on a binary unrooted 4-taxon phylogenetic tree with a short interior edge and pending edges of multiple length. In my talk, I will present an optimal branch length of the interior edge in this scenario and I will explain how many characters are at least needed to reconstruct the ‘true’ tree. Furthermore, we investigated some more properties of MP and found that for some trees, MP performs better when only a subset of the given set of taxa is considered. I will show such a phylogenetic tree and explain, for instance, why for a subset of size 1, i.e. a single leaf, this surprising scenario cannot happen.

PLGW03 17th December 2007
16:10 to 16:30
The deep phylogeny problem and its consequences

Some of the most interesting and long-standing evolutionary questions concern relationships among ancient lineages. Molecular sequence data have not really provided the anticipated clear-cut answers to these questions. In general relationships that were cloaked in uncertainty based on traditional comparative anatomy remain uncertain when subjected to molecular approaches. I will present an overview of some of the difficulties associated with estimating relationships among ancient vertebrate lineages based on sequence comparisons.

PLGW03 17th December 2007
16:30 to 16:50
P Lockhart On the nature of heterotachy

Heterotachy describes different lineage specific rates of evolution at individual sites in a multiple sequence alignment. Different underlying processes could contribute to this phenomenon. We discuss the form of heterotachy that is likely to occur in Nature. We show using simulations that under a biologically realistic model of heterotachy tree building can have low reconstruction accuracy. We discuss this problem.

PLGW03 18th December 2007
09:00 to 10:00
Recent progress in phylogenetic combinatorics
PLGW03 18th December 2007
10:00 to 10:20
An exact algorithm for computing the geodesic distance between phylogenetic trees

Most of the used measurements for the distinctness of two phylogenetic trees exclusively evaluate topological criteria. For example, the prevalent Robinson-Foulds-distance counts the number of differing splits between two trees. To present a measure which also incorporates branch lengths information, Billera et al. (2001, Adv. Appl. Math) gave a formal mathematical definition of the space of phylogenetic trees and introduced a metric on this space, the so-called geodesic distance. In this framework, a tree topology with n interior branches corresponds to an n-dimensional nonnegative Euclidean space, and special instances of branch lengths define a point in that space. Two tree topologies with m splits in common share an m-dimensional surface. Therefore, the tree space is connected and Billera et al. have shown that there exists a unique shortest path between every two points in this space, the geodesic path of two trees. The length of this path is a continuous distance measure for phylogenetic trees and knowing the exact progression of the path helps in adressing further questions like computing a consensus tree with branch lengths from a set of phylogenetic trees. We demonstrate how to compute the geodesic path and its length in tree space exactly. The method is worst-case exponential in the number of differing splits between the two trees. However, information about incompatibe splits and branch lengths is utilized to avoid the enumeration of futile paths and to improve the performance.

PLGW03 18th December 2007
10:20 to 10:40
Practical distance computation in the space of phylogenetic trees

There are many different methods for constructing phylogenetic trees from biological data. To either compare one such algorithm with another, or to find the likelihood of a certain tree generated from the data, researchers often compute a distance between trees. In this talk, we present a practical algorithm for computing the geodesic distance, as well as the geodesic path, between two phylogenetic trees in the tree space introduced by Billera, Holmes, and Vogtmann(2001). In doing so, we show that the possible shortest paths between two trees can be represented as chains in a partially ordered set, whose elements are the sets formed by a closure relation on the incompatible edges between the trees. Furthermore, we give a linear time algorithm for computing the geodesic candidate corresponding to each maximal chain in this partially ordered set. Dynamic programming techniques can be used to further prune the search of the partially ordered set for the shortest path.

PLGW03 18th December 2007
10:40 to 11:00
Optimal realisations

For a metric $(X,d)$ a weighted graph $G= (V, E, w)$ is called a realization of $d$ if (i) $X \subseteq V$ (ii) $d_g(x,y) = d(x,y)$ for all $x, y \in X$.

A realization is called {\em optimal} if the sum $\sum_{xy \in E} w(xy)$ is minimal among all the realizations. It is known that to find an optimal realization is NP-hard in general. But for the case of a tree-metric, i.e. a metric coming from a tree, it is the underlying (weighted) tree.

in this talk I will discuss properties of optimal realizations and why they are useful for the area of phyogenetic methods

PLGW03 18th December 2007
11:30 to 11:50
The identifyability of matrix parameters for covarion-like models

A covarion or site-specific-rate-variation model of sequence evolution describes a process in which the substitution rate of each character (nucleotide, codon, amino acid) can speed up or slow down over time, as might be caused by changing functional constraints. Though the model was formulated a decade ago, only in the last few years was it shown that the topology of the evolutionary tree is identifiable for this model. We now establish identifiability of the numerical parameters, such as edge lengths and substitution rates. Necessary technical assumptions are that the number of rate classes be less than the number of character states, the tree relates at least 7 taxa, and numerical parameters are `generic'.

PLGW03 18th December 2007
11:50 to 12:10
Optimising phylogenetic diversity across two trees

We present a polynomial-time algorithm for finding an optimal set of taxa that maximizes the weighted-sum of the phylogenetic diversity across two phylogenetic trees. This resolves one of the challenges proposed at the 'Current Challenges and Problems in Phylogenetics' workshop in the first week of the Phylogenetics programme. It also completely closes the gap between optimizing phylogenetic diversity on one tree, which is known to be in P, and optimizing phylogenetic diversity across three or more trees, which is known to be NP-hard.

PLGW03 18th December 2007
14:00 to 14:20
On the optimality of the neighbour-joining algorithm

The popular neighbor-joining (NJ) algorithm used for phylogenetic tree reconstruction is a greedy algorithm for finding the minimum evolution (ME) tree associated to a dissimilarity map. >From this point of view, NJ is ``optimal'' when the algorithm outputs the tree which minimizes the balanced minimum evolution criterion. We use the fact that the NJ tree topology and the ME tree topology are determined by polyhedral subdivisions of the spaces of dissimilarity maps ${\R}_{+}^{n \choose 2}$ to study the optimality of the neighbor-joining algorithm. In particular, we investigate and compare the polyhedral subdivisions for $n \leq 10$. A key requirement is the measurement of volumes of spherical polytopes in high dimension, which we obtain using a combination of traditional Monte Carlo methods and polyhedral algorithms. Our analysis reveals new insights into the performance of the NJ and ME algorithms for phylogenetic reconstruction. In particular we show that highly unrelated trees can be cooptimal in ME reconstruction, and that NJ optimality regions are not convex. We also conjecture that neighbor joining's ability to recover the ME tree depends on the diameter of the ME tree.

This is joint work with K. Eickmeyer, P. Huggins, and L. Pachter.

PLGW03 18th December 2007
14:20 to 14:40
Encoding phylogenetic trees with weighted quartets

Various methods have been proposed for constructing phylogenetic trees (and networks) that work by trying to piece together quartet trees, i.e. phylogenetic trees with four leaves. Hence, it is of interest to characterise when a collection of quartet trees corresponds to a (unique) phylogenetic tree. Recently, Dress and Erdos provided such a characterisation for binary phylogenetic trees, that is, phylogenetic trees all of whose internal vertices have degree 3. Here we provide a new characterisation for arbitrary phylogenetic trees - or, equivalently, compatible split systems - and discuss some extensions of this result to more general split systems.

This is joint work with Stefan Gruenewald, Katharina Huber, Charles Semple and Andreas Spillner.

PLGW03 18th December 2007
14:40 to 15:00
Identifying Phylogenies: extending the notion of unique

Uniqueness is an extremely strong property. Even if the input data contains no conflicting information, to find that there is exactly one evolutionary tree that is consistent with the data is very unlikely. A property that is somewhat weaker and yet essentially just as good is that of identifiability. In this talk, we introduce and describe some recent work on this property.

PLGW03 18th December 2007
15:30 to 15:50
On (k.l)-leaf powers

We say that, for k>1 and l>k, a tree T is a (k,l)-leaf root of a graph G=(V,E), if V is the set of leaves of T, for all edges xy in E, the distance d(x,y) between x and y in T (i.e., the number of edges of the unique path between the leaves x and y in T) is at most k and, for all non-edges xy outside E, d(x,y) is at least l. A graph G is a (k,l)-leaf power, if it has a (k,l)-leaf root.

This new notion generalises the concept of k-leaf power, which was introduced and studied by Nishimura, Ragde and Thilikos in 2002, motivated by the search for underlying phylogenetic trees: "...a fundamental problem in computational biology is the reconstruction of the phylogeny, or evolutionary history, of a set of species or genes, typically represented as a phylogenetic tree..." The species occur as leaves of the phylogenetic tree.

Recently, a lot of work has been done on k-leaf powers and roots as well as on their variants phylogenetic roots and Steiner roots. For k=3 and k=4, structural characterisations and linear time recognition algorithms of k-leaf powers are known, and, recently, a polynomial time recognition of 5-leaf powers was given. For larger k, the recognition problem is open. We give structural characterisations of (k,l)-leaf powers, for some k and l, which also imply efficient recognition of these classes, and in this way we also improve and extend a recent paper from 2006 by Kennedy, Lin and Yan on strictly chordal graphs and leaf powers.

PLGW03 18th December 2007
15:50 to 16:20
Phylogenetics of medieval manuscripts

There are remarkable similarities between the error-prone copying of DNA sequences and the incorporation of errors by scribes as they copied manuscripts in the days before printing. Since the mid-19th century, manuscript scholars have used the distribution of ‘common errors’ to try to reconstruct the copying history of sets of manuscripts in a very labour-intensive way. We have shown that phylogenetic tree-building programs can be a powerful tool for scholars to analyse manuscript copying history much more rapidly than before. Of particular interest is the phenomenon of ‘contamination’, where a scribe would use more than one version of a text in constructing a copy. This has obvious parallels to lateral gene transfer and recombination. We are also interested in analysing DNA from the parchment itself to provide insights into medieval husbandry patterns as well as manuscript provenance.

PLGW03 19th December 2007
09:00 to 10:00
D Aldous Some broad questions about the tree of life

Stepping away from technical issues in tree reconstruction from molecular data on extant species, I will describe broader questions that can be asked about the Tree of Life, and our attempts to devise a framework for their study. Here are three such questions. (i) It is appealing to say that any readily identifiable clade arose through adaptive radiation. Does data support this statement? What would trees look like if they didn't arise through adaptive radiation? (ii) For some clades (e.g. horse) the number $n$ of extant species is much smaller than the maximum (over past time) number $M$ of coexisting species. Is this remarkable? How large should we expect $M/n$ to be for a typical clade? (iii) Given the true phylogenetic tree on the extant species of a clade, but no information about extinct species, can one make any (even crude) estimates of past speciation and extinction rates?

More technical questions concern the likely shape of trees on higher-order taxa, or the likely size of fluctuation in time series of numbers of such taxa.

Based on joint work with Lea Popovic and Maxim Krikun.

Related Links

PLGW03 19th December 2007
10:00 to 10:20
Ranked tree shapes, shuffles, and a new test for neutrality

Phylogenetic tree shapes are an established object of study, and have proven useful for testing macroevolutionary hypotheses. However, tree shapes are typically considered without branch length information, which eliminates a considerable amount of useful data. We propose the use of ``ranked'' phylogenetic tree shapes-- tree shapes such that the order of diversification events is specified-- to test macroevolutionary hypotheses. We describe two classes of null models on ranked tree shapes, and then propose a test for deviation from these models. In particular, our test rejects the null hypothesis when relative rates of diversification for two daughter clades change over time. Thus, we consider the test as looking for evidence of ``bursting'' diversification on one hand, or ``refractory'' diversification on the other. Our test rejects neutral evolution for a sample of sequences of the Hepatitis C virus which will be discussed.

PLGW03 19th December 2007
10:20 to 10:40
B Faller Distribution of phylogenetic diversity under random extinction

Phylogenetic diversity is a measure for describing how much of an evolutionary tree is spanned by a subset of species. If one applies this to the (unknown) subset of current species that will still be present at some future time, then this 'future phylogenetic diversity' provides a measure of the impact of various extinction scenarios in biodiversity conservation. We have studied the distribution of future phylogenetic diversity under a simple model of extinction (a generalized 'field of bullets' model). In this talk, I present our finding that the distribution of future phylogenetic diversity converges to a normal distribution as the number of species grows. This result is significant for biodiversity conservation.

PLGW03 19th December 2007
11:10 to 11:30
10 Men and 1 woman?

Recently Patterson et al. 2006 speculated about a complex speciation of humans and chimpanzees. We re-analysed their data and will discuss a more parsimonious solution. Our interpretation of the data is supported by remains of behavioral traits in isolated human populations and in behavioral aspects of chimpanzees.

PLGW03 20th December 2007
09:00 to 10:00
A Roger Markov models of protein evolution: Lets get real!

Widely-used models of amino acid sequence evolution, such as the WAG and JTT models, are based on empirically-derived rate matrices that capture the exchangeability and frequencies of amino acids observed in large databases of alignments. For phylogenetic estimation, these are typically combined with some kind of parametric model for differeing rates amongst sites. However, these models ignore important effects in the evolution of real protein sequences such as changes in the rate of evolution at a position over the tree, changes in the amino acid frequencies in different lineages and site-specific amino acid exchangeabilities and frequencies that depend on the 3D structural and functional context of the site in the protein. Here I provide an overview of several models we have developed to capture these biologically important effects. We have addressed the problem of rate changes over lineages by implementing a general covarion model for amino acid sequences in the maximum likelihood framework. We have approached the problem of site-specific substitution processes in two ways. First we have implemented a “mixture of frequencies” model that captures the 3 dominant amino acid frequency patterns in a set of 21 taxon-rich alignments. Second, we have implemented an independence energy model that uses statistical energy potentials from the 3D structure of the protein of interest to construct site-specific substitution matrices. Likelihood ratio tests indicate that all of three of these models have significantly better fit to real protein sequence data than the standard empirical amino acid substitution models. Furthermore, some of these more biologically realistic substitution models avoid phylogenetic estimation biases that can arise if these important effects are ignored.

Related Links

PLGW03 20th December 2007
10:00 to 10:20
A phylogenetic definition of structure and further animations on the sequence space

We want to show pictures and animations illustrating possible novel ways to model sequence evolution, as well as discussing the importance of a definition of structure. Our phylogenetic definition of structure consists of three aspects: The substitution matrix, a neighbourhood system and the phylogenetic tree. The substitution matrix specifies the evolutionary dynamics of nucleotide evolution. However, the matrix is influenced by the neighbourhood system, that defines the interactions among sites in a DNA sequences. The phylogenetic tree introduces an additional dependency pattern in the observed sequences. To take all this aspects into account, we have developed SISSI (Simulating Site-Specific Interactions). SISSI simulates the evolution of a nucleotide sequence along a phylogenetic tree taking into account user defined site-specific interactions.Thus, it is possible to mimic sequence evolution under structural constraints. Based on illustrative examples and simulations, we will discuss the different definitions of structure, for example single minimum free energy and consensus structure, as well as our definition of structure from a phylogenetic point of view. This will be demonstrated with different phylogenetic animations.

PLGW03 20th December 2007
10:20 to 10:40
Selecting genomes for reconstruction of ancestral genomes

It is often impossible to sequence all descendent genomes to reconstruct an ancestral genome. This fact leads to studying the genome selection for reconstruction problem. The talk covers our recent work in reconstruction accuracy analysis and greedy algorithm for the problem of selecting a given number of genomes for the best reconstruction.

PLGW03 20th December 2007
10:40 to 11:00
MIP models for phylenetic reconstruction under minimum evolution

Molecular phylogenetics provides several criteria to select a phylogeny among plausible alternative ones. Usually, such criteria can be expressed in terms of objective functions, and the phylogenies optimizing them are referred as optimal. One of the most important criteria is Minimum Evolution (ME) which states that the optimal phylogeny for a given set of organisms is the one whose sum of the edge weights is minimal. Finding the phylogeny satisfying the minimum evolution criterion involves the solution of an optimization problem, called Minimum Evolution Problem (MEP), notoriously NP-Hard. Herewith, we introduce a number of mixed integer programming models and provide possible cuts and lower bounds for the optimal value.

PLGW03 20th December 2007
11:30 to 11:50
Challenges in modeling context-dependent evolution

Over the past decade, several proposals have been made for relaxing the assumption that sites evolve independently in phylogenetic inference, for example by modeling base-pair evolution, CpG effects and certain cases of site-dependent evolution. In this work, we model dependencies between sites by allowing the evolution at a site to depend upon its “context of evolution”, as given by its two immediate neighboring sites. Specifically, each site is assigned a specific evolutionary nucleotide model depending on the identities of its neighbors. To efficiently evaluate the corresponding models, we employ a “data augmentation” approach in an MCMC framework. In this presentation, we introduce the above approach and pay special attention to 2 potential problems. The first originates from our model’s implicit assumption that the neighboring sites remain fixed along each branch, which may be restrictive, especially for long branches. To accommodate this, we add intermediate (single-child) nodes on longer branches to explicitly allow for the neighboring sites to evolve along such branches. We evaluate the importance of these intermediate “ancestral sequences” by calculating the appropriate Bayes factor via thermodynamic integration. The second problem lies in the drastic increase in parameters that may arise when taking into account every possible context of evolution. We discuss efficient Bayesian approaches for model building which acknowledge the trade-off between added number of parameters and increased model fit.

PLGW03 20th December 2007
11:50 to 12:10
A simple model for a complex world

Models of nucleotide substitution have enabled accurate reconstruction of many evolutionary relationships. The majority of commonly used models are special cases of the General Time Reversible (GTR) model with rate variation across sites described by a Γ-distribution (+Γ) and by a proportion of invariant sites (+I). Unfortunately, GTR+Γ+I family models rarely describe all the variation present in real sequence data. Many cases of complex phenomena, such as changes in GC content and heterotachy, have been identified, and several studies have shown that analyses produced from simple models can lead to erroneous conclusions.

I introduce a simple general hidden Markov model of nucleotide substitution that allows many evolutionarily important factors to vary both spatially between sites in the alignment and temporally through the tree. The new model may be considered a generalisation of the covarion model of Tuffley and Steel, which enables the rate of evolution, nucleotide frequencies, and the transition/transversion rate ratio to vary according to a hidden process. I describe the new model’s relationship to existing models and use it to investigate spatial and temporal variation in real data. The results of these analyses show that all of the factors examined have the potential to vary both temporally and spatially. Furthermore, the new model provides substantially better fit to sequence data than GTR+Γ+I models.

PLGW03 20th December 2007
14:00 to 14:20
Evolution of Chloroplasts

The classical, text-book view is that chloroplasts in plants and algae arose from a single endosymbiotic event where an early eukaryote engulfed a photosynthetic cyanobacterium. This cyanobacterium became enslaved, giving rise to the three classes of chloroplast: those in green algae and plants, those in red algae and those in glaucophytes. By analysing the phylogenetic history of all retained, chloroplast genes and with comparison to cyanobacterial genomes, we present evidence showing that the history of the chloroplast may be more complex, and that more than one endosymbiosis event cannot be excluded.

RER Nisbet (Cambridge), M. Spencer (Liverpool), C. J. Howe (Cambridge), P.J. Lockhart (Massey).

PLGW03 20th December 2007
14:20 to 14:40
Untangling tanglegrams

Given two phylogenetic trees on the same leaf set, how best can they be displayed? If the trees are presented "sideways" with their roots on the outside and leaves in the middle, we can join the corresponding leaves in the trees by edges. Call this diagram a tanglegram. The crossing number of a tanglegram is the number of leaf-leaf edges that cross. Work by Dwyer and Schreiber `04 (and rediscovered by Valiente and co-workers, WABI `07) show that the tanglegram with optimal crossing number can be found in polynomial time if one tree is fixed. We present several new results for the more general case when both trees can be rearranged to give the tanglegram with the minimal crossing number.

PLGW03 20th December 2007
14:40 to 15:00
Hybridisation in non-binary trees

Molecular phylogenetics, the study of reconstructing evolutionary trees, is a well-established field of scientific endeavor. However, in certain circumstances, evolution is not best represented by trees. For example, a comparison of gene trees representing a set of taxa and reconstructed for different genetic loci sometimes reveals conflicting tree topologies. These discrepancies are often due to reticulation events such as horizontal gene transfer and hybridization. Assuming that the initial data set is correct, a basic problem is to compute the minimum number of reticulation events to explain the evolutionary history for a set of present-day species.

Recently, it has been shown that a theoretical framework - based on the notion of agreement forests - can be used to calculate the number of reticulation events for two rooted binary phylogenetic trees. Together with two reduction rules, this leads to an algorithm that runs efficiently for many biological examples and gives the exact solution. However, for many data sets, the reconstructed trees are not fully resolved. In this talk, we extend this framework to non-binary trees. Further, we show that the subtree prune and regraft distance between two non-binary trees, which is often used to model reticulate evolution, can be approached in a similar way.

PLGW03 20th December 2007
15:30 to 15:50
Phylogenetic networks

This will be a report on recent work on phylogenetic networks.

Related Links

PLGW03 20th December 2007
15:50 to 16:10
Reconstruction of certain phylogenetic networks from the genomes at leaves

A network N is a rooted acyclic directed graph. A base-set X for N is a subset of vertices including the root (or outgroup), all leaves, and all vertices of outdegree 1. A simple model of evolution is considered in which all characters are binary and in which back-mutations occur only at hybrid vertices. It is assumed that the genome is known for each member of the base-set X. If the network is known and is assumed to be "normal," then the genome of every vertex is uniquely determined and can be explicitly reconstructed. Under additional hypotheses involving time-consistency and separation of the hybrid vertices, the network itself can also be reconstructed from the genomes of all members of X. A polynomial-time procedure is outlined for performing the reconstruction.

PLGW03 21st December 2007
09:00 to 10:00
N Rosenberg Lineage sorting and multi-species coalescent models

Assuming that evolution follows a coalescent model within individual species, we examine the distribution of gene tree topologies conditional on species trees and the computation of this distribution. A particular focus of this talk will be on the properties of "anomalous gene trees," gene tree topologies that are more likely to evolve than the topology that matches the species tree. Some new results will be shown that illustrate the increase in complexity of anomalous gene trees with five or more taxa, and the results will be discussed in terms of their implications for phylogenetic inference from multigene data.

PLGW03 21st December 2007
10:00 to 10:20
J Degnan Coalescent consequences for consensus cladograms

To investigate the theoretical properties of consensus trees obtained from large numbers of gene trees evolving at different loci, we construct consensus trees from independent gene trees that occur in proportion to their probabilities from coalescent theory. We consider majority-rule, rooted triple (R*), and greedy consensus both asymptotically as the number of loci approaches infinity and for finite numbers of loci. We investigate the effects of species tree branch lengths and find different consistency results for the three methods: majority-rule consensus trees can be increasingly likely to be unresolved, although not misleading; greedy consensus trees can be misleading for all species tree topologies with at least 5 taxa; and R* consensus is statistically consistent, i.e., guaranteed to return the species tree topology given enough loci, for any binary species tree topology with any set of branch lengths. We also investigate the performance of consensus trees as species tree estimators when there are finite numbers of loci.

PLGW03 21st December 2007
10:20 to 10:40
The influence of effective population size on a genome-wide positive selection scan

There has been some discussion on the effect of population size on positive selection scans. It has been argued that positive selection as well as purifying selection is less efficient in humans and chimps than rodents because of their smaller population sizes than murids (Keightley et al., 2005; Rhesus Macaque Sequence and Analysis Consortium, 2007).

We have performed a positive selection scan on six mammalian genomes: human, chimpanzee, mouse, rat and dog. We find fewer positive selected genes on the primate clade than on the rodent clade. We try to quantify this effect by estimating ratios of population sizes of the species from the genomic data. The estimation is based on a extension of population genetic interpretation of a branch model of codon substitutions by Halpern and Bruno 1998 (see also Nielsen and Yang, 2003). The estimates we obtain for the ratio of macaque and human effective population sizes are in good agreement with the ratio calculated from polymorphim data (Wall et al., 2003, Hernandez et al, 2007). Calculations of further population size ratios on the mammalian tree with our method are underway. We will also discuss limitations of the estimation of population ratios from genomes of single individuals.

PLGW03 21st December 2007
10:40 to 11:00
A genealogical approach to studying asexuality

Given molecular genetic data from diploid individuals that, at present, reproduce mostly or exclusively asexually without recombination, an important problem in evolutionary biology is detecting evidence of past sexual reproduction (i.e., meiosis and mating) and recombination (both meiotic and mitotic). However, currently there is a lack of computational tools for carrying out such a study. In this talk, I will take a modest step toward filling that gap and describe a new method of testing asexuality by explicitly considering the evolutionary history of asexual diploid individuals.

PLGW03 21st December 2007
11:30 to 11:50
A Labarre Minimum common supergraphs and haplotype networks

In 2005 Cassens, Mardulyn and Milinkovitch proposed a new method for constructing phylogenetic networks in the context of intraspecific genealogies, also known as ``haplotype networks''. The proposed method, called ``Union of Maximum Parsimonious Trees'', is based on the global maximum parsimony approach, which aims at combining all most parsimonious trees into a single graph (in that context, trees are unrooted and undirected). However, their algorithm makes a number of arbitrary choices, produces solutions whose quality depends on the order in which the merging process is performed, and is a heuristic with an unclear objective function. We propose a combinatorial optimisation problem that can be used as a formal model for building such a graph, which consists in finding the minimum common supergraph of a given set of partially labelled trees. We propose a polynomial-time algorithm for solving the problem on two trees of a certain class, and a branch-and-bound algorithm in the case of two arbitrary trees. We will also discuss possible approaches when dealing with more than two trees.

PLGW03 21st December 2007
11:50 to 12:10
Incomplete lineage sorting: consistent phylogeny estimation from multiple loci

We introduce a simple algorithm for reconstructing phylogenies from multiple gene trees in the presence of incomplete lineage sorting, that is, when the topology of the gene trees may differ from that of the species tree. We show that our technique is statistically consistent, that is, it returns the correct tree given sufficiently many unlinked loci. We also show that our technique can tolerate moderate estimation errors.

PLGW05 20th June 2011
10:00 to 11:00
Recent Progress on Multi-State Perfect Phylogeny (Compatibility) Problems
The Multi-State Perfect Phylogeny Problem is an extension of the Binary Perfect Phylogeny (or Compatibility) Problem, allowing evolutionary characters to have more than two states. The problem is of interest due to prior elegant mathematical and algorithmic results, and due to integer data that is increasingly available.

In 1975, Buneman showed a how to view the Multi-State problem as one of triangulating non-chordal graphs, but the result has not been widely exploited as a computational tool. In this talk, I discuss our recent work on Multi-State Perfect Phylogeny problems, mostly by exploiting Buneman's result. I will talk about three sets of results.

The first set of results concerns problems that extend the multi-state problem: The Missing Data (MD) Problem, where some entries in the input are missing and the question is whether (bounded) values can be imputed so that the full data has a multi-state perfect phylogeny; The Character-Removal (CR) Problem, minimizing the number of characters to remove so that the resulting data has a multi-state perfect phylogeny. Using results on triangulation of non-chordal graphs, we establish new necessary and sufficient conditions for the existence of a perfect phylogeny with (or without) missing data. This leads to solutions of the MD and CR problems using integer linear programming.

The second set of results concerns generalization of the famous ``four-gametes test" that characterizes the compatibility of binary data. Such a generalization was sought since the 1970s. Our new generalization establishes that 3-state data has a perfect phylogeny if and only if every subset of three characters has a 3-state perfect phylogeny. We also characterize the patterns in the data that prohibit a 3-state perfect phylogeny. We briefly discuss efforts to generalize this to more states.

The third set of results concerns reductions of multi-state data to binary data, to the 2-SAT problem, and to sparser problem instances.
PLGW05 20th June 2011
11:30 to 11:50
The Complexity of the uSPR distance
We show that subtree prune and regraft (uSPR) distance on unrooted trees is fixed parameter tractable with respect to the distance. We also make progress on a conjecture of Steel on the preservation of uSPR distance under chain reduction.
PLGW05 20th June 2011
11:50 to 12:10
M Fischer Revisiting the question how many characters are needed to reconstruct the true tree
The question of how many sequence sites are required to recover the evolutionary relationship of the underlying species accurately is important for phylogeneticists. It is known that a particularly challenging problem for phylogenetic methods arises when a rapid divergence event occurred in the distant past, which leads to long pending branches and a short internal branch in the corresponding phylogenetic tree.

While most previous approaches tackling this problem considered only 2-state models, we investigate the scenario based on all four (DNA) character states. Particularly, we analyze a binary unrooted 4-taxon phylogenetic tree with a short interior edge and pending edges of multiple length. In my talk, I will present an optimal branch length of the interior edge in this case and I will explain how many characters are at least needed to reconstruct the 'true' tree.
PLGW05 20th June 2011
12:10 to 12:30
S Linz On the Complexity of the Quartet Challenge
A collection of quartets (unrooted phylogenetic trees on four taxa) is compatible if there exists a phylogenetic tree that explains all ancestral relationships given by the quartets. It is well-known that deciding whether or not a set of quartets is compatible is an NP-complete problem. A natural extension of this problem asks if, given a phylogenetic tree T that is compatible with the input, is T the only such tree. Several graph-theoretic characterizations have been established to approach this problem, which is known as the Quartet Challenge. However, the complexity of this challenge remains open. In this talk, we show that the Quartet Challenge is ASP-complete (i.e. it is computationally hard to decide whether another solution exists) by using a particular type of polynomial-time reduction from a variation of the Betweenness problem. This is joint work with Maria Luisa Bonet (UPC, Spain) and Katherine St. John (CUNY, USA).
PLGW05 20th June 2011
15:30 to 15:50
Metrics on Multi-labeled Trees: Interrelationships and Diameter Bounds
Multi-labeled trees or MUL-trees, for short, are trees whose leaves are labeled by elements of some non-empty finite set $X$ such that more than one leaf may be labeled by the same element of $X$. This class of trees includes phylogenetic trees and tree shapes. MUL-trees arise naturally in, for example, biogeography and gene evolution studies and also in the area of phylogenetic network reconstruction. In this talk we introduce novel metrics which may be used to compare MUL-trees, most of which generalize well-known metrics on phylogenetic trees and tree shapes. These metrics can be used, for example, to better understand the space of MUL-trees or to help visualize collections of MUL-trees. In addition, we describe some relationships between the MUL-tree metrics that we present and also give some novel diameter bounds for these metrics.

This is joint work with A. Spillner, University of Greifswald, Germany, and R. Suchecki and V. Moulton, both School of Computing Sciences, University of East Anglia, UK.
PLGW05 20th June 2011
15:50 to 16:10
M Thuillard Identifying Lateral Transfers with Neighbor-Nets
With the ever increasing availability of complete genomes, the importance of lateral transfers in evolution has been clearly recognized and increasing doubts are being raised about the feasibility of inferring a tree of life based on genome evolution. Identifying lateral gene transfers is therefore an important problem in evolutionary biology. A distance matrix in the Farris space that fulfills the Kalmanson inequalities can be exactly represented by an outer-planar split network or an X-tree. Using a simple model of lateral transfers one can show that a circular order of the taxa labeling the leaves of a phylogenetic tree is quite robust to lateral transfers. A circular order of the taxa is an order in which the taxa are encountered through a clockwise scanning of the tree embedded in the plane. If lateral transfers occur only between consecutive taxa in such an order, it can be shown that, within the usual distance-based framework for constructing phylogenetic trees, the o uter-planar network reconstructed by applying the Neighbor-Net algorithm furnishes a circular order of the nodes with the corresponding distance matrices in the Farris space fulfilling the Kalmanson inequalities. There are limits to the robustness of a circular order to lateral transfers. A new approach has been developed to deal with such cases. The approach combines the Neighbor-Net algorithm for computing phylogenetic networks with the Minimum Contradiction method. A subset of taxa, prescribed using Neighbor-Net, is obtained by measuring how closely the Kalmanson inequalities are fulfilled by each taxon. A criterion is then used to identify the taxa possibly involved in a lateral transfer between non-consecutive taxa. We illustrate the utility of the new approach by applying it to a distance matrix for Archaea, Bacteria and Eukaryota. Our new approach gives a way to use Neighbor-Nets to get further biological information, for instance to localize and identify lateral transfers or to detect large deviations to an X-tree or an outer-planar split network topology using distance matrices. In future, we expect that it will become increasingly important to develop new tools such as this that can allow one to extract useful biological information from trees and networks.
PLGW05 20th June 2011
16:10 to 16:30
Exploring Treespace
Phylogenies, or evolutionary histories, play a central role in modern biology, illustrating the interrelationships between species, and also aiding the prediction of structural, physiological, and biochemical properties. The reconstruction of the underlying evolutionary history from a set of morphological characters or biomolecular sequences is difficult since the optimality criteria favored by biologists are NP-hard, and the space of possible answers is huge. The number of possible phylogenetic trees for n taxa is (2n − 5)!!. Due to the hardness and the large number of possible answers, clever searching techniques and heuristics are used to estimate the underlying tree. We explore the underlying space of trees, under different metrics, in particular the nearest-neighbor-interchange (NNI), subtree-prune-and-regraft (SPR), tree-bisection-and-reconnection (TBR), and Robinson-Foulds (RF) distances.
PLGW05 20th June 2011
16:30 to 16:50
B Eldon Concordance between species trees and gene trees with multiple mergers of ancestral lineages
Concordance between species trees and gene genealogies are considered for gene genealogies allowing multiple mergers of ancestral lineages. Donnelly and Kurtz, Pitman, and Sagitov independently introduced coalescent processes (Lambda coalescent) that allow asynchronous multiple mergers of ancestral lineages. Kingman's coalescent allows only two ancestral lineages to merge at a time. Lambda coalescents may be more appropriate than Kingman's coalescent for some marine organisms with high fecundity and broadcast spawning. Multiple mergers, ie allowing any number of ancestral lineages present each time to coalesce, complicates computations. Two methods will be presented to compute the concordance, one relies on enumeration of all possible sequences of mergers, and the other is the spectral decomposition of the rate matrix. Results will be presented for species trees consisting of two to four species, and for different Lambda coalescents.
PLGW05 20th June 2011
16:50 to 17:10
Quantifying diversity and abundance in soil microbial communities from high-throughput sequence data
Soil microbial communities are essential in many different ways. For instance, they support crop productivity and maintain environmental stability through their roles in the carbon, nitrogen, oxygen and sulphur cycles. They also play a central part in plant health and contribute in the bioremediation of contaminated soils. Understanding the complex structure of soil microbial communities is crucial to better manage agricultural soils and minimise the negative impact of agricultural practices. However, this poses important biological, statistical and computational challenges because of the vast numbers and diversity of microbes present in each gram of soil, the paucity of knowledge concerning the majority of them and the sheer amount of data typically observed in soil metagenomics studies.

We address the problem of identifying bacterial groups present in soil and estimating their relative abundance using high-throughput data sampled from agricultural fields. Our procedure computes an overall phylogeny as the agreement among several individual phylogenies independently inferred from carefully selected marker genes. The principle being that such an agreement leads to the true phylogeny underlying all the observed taxa. The overall phylogeny provides information on the identity and abundance of bacterial groups through its clade structure and the relative sizes of these clades. Most importantly, because our method does not bin organisms according to reference databases, it sheds light into the identity and abundance of not only previously known soil bacteria but also of the 'unknowns'.

As a first step, we demonstrate the ability of our method to correctly discriminate between different bacterial groups and to quantify their relative abundance using simulated data. We then go on to analyse Roche/454 data sampled from one of Rothamsted's long-term experimental fields. We present preliminary results on this analysis.
PLGW05 21st June 2011
09:00 to 10:00
Reconciling Gene Trees in the Presence of Incomplete Lineage Sorting and Hybridization
In this talk, I will describe our work on two problems:

1. Inference of species trees from gene trees whose incongruence is assumed to be due to incomplete lineage sorting. Here, I will describe extension of the Minimize Deep Coalescence criterion to deal with gene tree estimates that may be unrooted, incompletely resolved, or both.

2. Inference of hybridization from gene trees, in the presence of incomplete lineage sorting. Here, I will describe our work on extending the concept of coalescent histories to phylogenetic networks, and how hybridization can be detected despite incomplete lineage sorting under both a maximum parsimony setting and a maximum likelihood setting.

The performance of the methods on both biological and simulated data will be shown.
PLGW05 21st June 2011
10:00 to 10:20
Quartets and unrooted phylogenetic networks
Phylogenetic networks were introduced to describe evolution in the presence of exchanges of genetic material between coexisting species or individuals. Split networks in particular were introduced as a special kind of abstract networks to visualize conflicts between phylogenetic trees which may correspond to such exchanges. More recently, methods were designed to reconstruct explicit phylogenetic networks (whose nodes can be interpreted as biological events) from triplet data.

In this presentation, we link abstract and explicit networks through their combinatorial properties, by introducing the unrooted analogue of level-k networks. In particular, we give an equivalence theorem between circular split systems and unrooted level-1 networks. We also show how to adapt to quartets some existing results on triplets, in order to reconstruct unrooted level-k phylogenetic networks. These results give an interesting perspective on the combinatorics of phylogenetic networks and also raise algorithmic and combinatorial questions.
PLGW05 21st June 2011
10:20 to 10:40
On the elusivity of softwired clusters
Rooted phylogenetic networks are often used to represent conflicting phylogenetic signals. One approach requires us to construct a phylogenetic network such that, for each cluster (i.e. clade) that is present in at least one input gene tree, at least one tree embedded in the network contains that cluster. This is often called the softwired cluster approach. Motivated by parsimony we might wish to construct such a network using as few reticulations as possible, or minimizing the maximum number of reticulations used in any "tangled" region of the network (known as the level of the network).

In this talk we present a number of new algorithmic results, both positive and negative, which emphasize the curious mathematical structure of the softwired cluster model. We also describe the relationship between our results and other recent results in constructing rooted phylogenetic networks. We conclude with a number of intruiging open questions.

This talk will be given by Steven Kelk, based on forthcoming joint work between Steven Kelk, Celine Scornavacca and Leo van Iersel.
PLGW05 21st June 2011
10:40 to 11:00
Ancestral Recombination Graphs for Missing and Removable Data
Recombination is a widespread feature of genomes and the study of recombination patterns is an important component in the design and analysis of disease association studies. Advances in algorithms for studying recombination must take into consideration the increasing size of data in both population size and number of sites typed. The goal of our work is to develop and implement algorithms to analyze recombination events in data that cannot be analyzed by existing methods, including data sets that are too large or data sets that include missing entries. We demonstrate a branch and bound approach that is practical, robust, and efficient, and furthermore show that the approach can be extended to analyze recombination bounds on circular data and input data with missing entries. We apply our methods to analyze simulated data and a benchmark lipoprotein gene and find that by including sites ignored in previous analysis (due to missing entries), the amount of overall detected reco mbination can be increased over prior methods.
PLGW05 21st June 2011
11:30 to 11:50
Tanglegrams for Rooted Phylogenetic Trees and Networks
In systematic biology, one is often faced with the task of comparing different phylogenetic trees, in particular in multi- gene analysis or cospeciation studies. One approach is to use a tanglegram in which two rooted phylogenetic trees are drawn opposite each other, using auxiliary lines to connect matching taxa. There is an increasing interest in using rooted phylogenetic networks to represent evolutionary history, so as to explicitly represent reticulate events, such as horizontal gene transfer, hybridization or reassortment. Thus, the question arises how to define and compute a tanglegram for such networks.

Here we present the first formal definition of a tanglegram for rooted phylogenetic networks and present a heuristic approach for computing one. We compare the performance of our method with existing tree tanglegram algorithms and also show a typical application to real biological data sets. For maximum usability, the algorithm does not require that the trees or networks are bifurcating or bicombining, or that they are on identical taxon sets. The algorithm is implemented in our program Dendroscope 3, which is freely available from www.dendroscope. org.
PLGW05 21st June 2011
11:50 to 12:10
Visualizing reticulate evolution with planar split networks
Split networks have been developed as a tool to visualize collections of bipartions of a finite set, so called split systems. We present a new method that allows to compute drawings of split networks without crossing edges for a rich class of split systems that include as special cases compatible and circular split systems. In addition, we show that the drawings produced by our method are optimal in the sense that any drawing with less edges would necessarily miss out some of the information contained in the given split system.
PLGW05 21st June 2011
12:10 to 12:30
Reconstructing the parameters of a network from its tree-average distances
A phylogenetic network is a rooted acyclic directed graph with labeled leaves which correspond to extant species. The network depicts the course of evolutionary history as species mutate and hybridize. Often each arc is weighted by a nonnegative real number measuring the amount of genetic change along the arc. A "tree-average distance" is defined which tells the average of the distances between the leaves in each displayed tree using these weights. Given a normal network N with certain restrictions, we show how the weights may be reconstructed from the tree-average distances. With additional assumptions we also indicate how the network N itself may be reconstructed. 

PLGW05 21st June 2011
14:00 to 15:00
BR Holland Phylogenies from DArTs: A stochastic Dollo proces with censored data
Diversity Array Technologies (DArT) are a relatively new kind of DNA marker system that seem like they could be usefully applied to phylogenetics (a few papers have already explored this). Like marker systems such AFLP and RFLP, the method produces presence absence data, but unlike these methods it is very unlikely for shared presences to occur by chance.

The basic idea is as follows. One or a small number of genomes are selected to form the genomic representation. Two enzymes are used to cut the DNA from these genomes at certain recognition sites (a rare 6bp recognition site and a more frequent 4bp recognition site). Fragments of DNA whose ends have been cut by two rare recognition sites are amplified. These fragments, which are said to form the genomic representation, are arranged on a microchip. Other genomes can then been checked to see which fragments within the genomic representation they have copies of in their own sequence. For each other genome that is compared to the genomic representation this results in a binary sequence that indicates presence (1) or absence (0) of each of the fragments.

The first obvious advantage of this approach is that it creates a representation of the whole genome rather than just a few genes. This alleviates the problem of picking a small set of genes that may not be representative of the evolutionary history of the species. The second advantage is that in comparison to an individual site, long fragments of DNA are very unlikely to be similar due to chance. So if two species share a fragment it is vastly more likely that they share it due to common ancestry rather than due to a chance similarity.

To use these data for phylogenetics it would be useful to develop a likelihood equivalent of Dollo parsimony (in which characters can be lost multiple times but gained only once), such models have already been explored in the context of language evolution and gene content evolution. However, another complicating issue is the censoring effect created by only being able to see those fragments that were in the original genomic representation, i.e. fragments that are shared by a group of species but that are not present in the original species used to make the genomic representation are missing from the data.
PLGW05 21st June 2011
15:30 to 15:50
Phylogenetic trees and k-dissimilarity maps
Phylogenetic trees are versatile tools in the analysis of sequence data. Several approaches exist to construct such trees from data using metrics or distances, and the Tree Metric Theorem of Dress gives an explicit condition when such a metric defines a tree. More recently, (see, e.g., Levy, Yoshida and Pachter [Mol. Biol. Evol. 23(3):491–498. 2006]), it was suggested to use the phylogenetic diversity not of pairs, but of triplets, quartets or, in general, k-tuples, to construct trees from the given data. Maps assigning values to such k-tuples of taxa as opposed to pairs are called k-dissimilarity maps.

In this talk, we will analyse when such k-dissimilarity maps correspond to trees and give generalisations of the Tree Metric Theorem.

This is joint works with Katharina Huber, Vincent Moulton and Andreas Spillner.
PLGW05 21st June 2011
15:50 to 16:10
Corrections to a class of methods for inferring species trees from gene trees
Although many methods for inferring species trees from gene trees while taking into account gene tree/species tree discordance provide consistent estimates of the species tree topology, many do not estimate branch lengths or are computationally slow. Exceptions include the GLASS method of Mossel and Roch (2010), the STEAC method of Liu et al. (2009), and the Shallowest Divergence method of Maddison and Knowles (2006). Given estimated gene trees for a set of loci, these methods estimate the divergence time between a pair of taxa by computing either the mean or minimum interspecific coalescence time between the taxa at each locus, and then taking either the mean or minimum of these times across all loci. The species tree is then constructed by applying a suitable hierarchical clustering method to the estimated divergence times. These methods estimate both species tree topologies and branch lengths and are computationally fast. However all of these methods systematically overest imate divergence times. We derive corrections that can substantially reduce the bias and mean squared error in the estimates of divergence times made by these methods and by a fourth method based on the same principle.
PLGW05 21st June 2011
16:10 to 16:30
A Kupczok Modeling the Evolutionary Dynamics of CRISPR spacers
We present a phylogenetic insertion- and deletion-model for CRISPR spacer turnover. CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is an adaptive heritable immune system found in Eubacteria and Archaea. The system consists of a number of CRISPR associated (CAS) proteins and an array of repeats and spacers - the later represent the viral/plasmid targeting sequences and the system functions in an analogous way to the eukaryotic siRNA system. The length and content of the spacer array varies considerably among individuals within species (suggesting a rapid arms race) and it has been suggested that there is a selective cost, in the absence of parasites, associated with maintaining these arrays. At one point in time and space spacers can be beneficial, if a parasite with the corresponding sequence exists in the community, or they can be useless and thus neutral or slightly deleterious.

The rate at which spacers are gained and lost from these arrays provides insight into the evolutionary dynamics of host-parasite interactions. To this end we model spacer insertion and deletion as a continuous-time two-state Markov process. The model parameters are then estimated by maximum likelihood along the phylogeny. We assume different dynamics, modeled by different two-state Q-matrices, for beneficial and neutral spacers. An underlying switching process allows for changes between these Q-matrices. The information whether a spacer is beneficial or neutral is not known, thus we use ambiguous states. Using known spacer annotation we can also apply a partition model allowing the evolutionary rates to differ between spacers of different sources. We evaluate the switching model by simulation and also analyse bacterial data sets.
PLGW05 21st June 2011
16:30 to 16:50
Principal components analysis in tree space
Phylogenetic analysis commonly gives rise to a collection or sample of inferred evolutionary trees, each differing from the others. There is a need for methods that visualize, compare, and quantify variability in such sets of trees, in terms of both topological and geometrical differences. Standard tools of multivariate analysis such as multi-dimensional scaling and clustering have been applied to sets of trees, but Principal Components Analysis (PCA) cannot be applied directly since the space of evolutionary trees on a fixed set of taxa is not a vector space. I propose a novel geometrical approach to PCA in tree-space that works in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic path is sought that maximises the variance of the data under a form of projection within tree-space onto the path. Geodesic paths identified in this way reveal and quantify the principal sources of variation in the original collection of trees in terms of both topology and branch lengths, and can be visualized as animations of smoothly changing alternative evolutionary trees. The potential of the approach is illustrated by applying tree-space PCA to experimental data from metazoa and a simulation study of long-branch attraction.
PLGW05 21st June 2011
16:50 to 17:10
L Jermiin New methods of identifying fast-evolving sites in aligned sequence data
Rate-heterogeneity across sites is a widely recognized problem in molecular phylogenetic studies. Attempts to overcome the problem involve deleting the fastest-evolving sites from the data before the phylogenetic analysis and/or using phylogenetic methods that assume rate-heterogeneity across sites can be modeled by a distribution of probabilities. Each of these methods has its advantages and limitations. Here, we describe and compare five metrics that may be used to identify the fast-evolving sites. One of these metrics (i.e., the probability of compatibility) turned out to be particular good at identifying fast-evolving sites, and when these sites are removed from the data, the result is an increase in the consistency of the alignment and, in some cases, another phylogeny.
PLGW05 22nd June 2011
09:00 to 10:00
Evaluating the goodness of fit between a phylogenetic model and an alignment
As models of sequence evolution become more and more complicated, many criteria for model selection have been proposed, and tools are available to select the best model for an alignment under a particular criterion. However, in many instances the selected model fails to explain the data adequately as reflected by large deviations between observed pattern frequencies and the corresponding expectation. We present an approach to evaluate the goodness of fit. We introduces a minimum number of "extra substitutions" on the inferred tree to provide a biologically motivated explanation why the alignment may deviate from expectation. These extra substitutions plus the evolutionary model then fully explain the alignment. We illustrate the method on several examples and then give a survey about the goodness of fit of the selected models to the alignments in the PANDIT database.
PLGW05 22nd June 2011
10:00 to 10:20
T Downing Genome sequencing of Leishmania donovani clinical lines reveals dynamic phylogenetic variation
Leishmaniases are a range of debilitating neglected parasitic illnesses that have several forms: cutaneous, mucocutaneous and most potently, the visceral, which has a global incidence of approximately half a million cases per year. Consequently, a range of early-detection and treatment programs are implemented to limit the spread of the disease, but these are forced to combat the emergence of drug resistance in patients. The causative agent of visceral leishmanisis, the species of the Leishmania donovani complex, is found most frequently in equatorial regions and has a range from southern Europe to Africa to southern Asia. This presentation focuses on distinct regions of India and Nepal where the pathogen displays extensive variation in its response to the drugs most commonly used in treatment. Despite a low level of genetic diversity, clinical samples of these parasites manifest pronounced differences in phenotype. We sequenced, assembled and annotated the genome sequence fo r L. donovani, and used this as a basis for exploring variation in 17 samples cloned from patients with varying treatment outcomes. Many layers of diversity between these genome-sequenced strains were evident at the level of single DNA nucleotides, repetitive gene copies, whole chromosome duplications and extra-chromosomal fragments. By combining this population-level picture with the other sequenced Leishmania species genomes, candidate variants relevant to drug resistance were identified. This study provides evidence of the power and scope of genome-level surveillance of emergent drug resistance in pathogens.
PLGW05 22nd June 2011
10:20 to 10:40
AD Drummond Bayesian time-tree priors
Bayesian phylogenetic inference requires specification of a prior distribution on the space of trees. In BEAST, the phylogenetic history is explicitly separated into a time-tree (by definition ultrametric for contemporaneous taxa) and the rates of evolution, which may vary from lineage to lineage. The combination can account for phylogenetic trees with non-ultrametric branches in units of substitutions per site. Hierarchical Bayesian priors on the time-tree lead to inference of population and dynamical priors through application of the coalescent, Yule or Birth-death tree priors, and calibration of the individual nodes in the time-tree leads to estimation of the absolute age of other divergence times of interest. Here I discuss difficulties and recent advances in the development of calibrated time-tree priors for Bayesian phylogenetic inference, including calculation of the marginal distribution of divergence times under various time-tree priors and a class of new time-tree priors based on a "Birth-death skyline" model of branching.
PLGW05 22nd June 2011
10:40 to 11:00
D Kühnert Phylodynamic inference - accounting for the interaction of evolutionary and ecological processes
Rapidly evolving viruses such as HIV, HCV and Influenza virus are of major interest in phylogenetics. They are special in that their ecological and evolutionary processes act on the same timescale and are therefore entangled. The cross-reaction of the two processes must be accounted for when inferring epidemiological parameters and/or phylogenetic history.

Our aim is a joint epidemiological phylogenetic, i.e. phylodynamic, analysis of genomic data by incorporating the dynamics of an SIR model into Bayesian phylogenetic inference. A new version of the birth-death model (Stadler, 2010) that incorporates sampling-through-time and can also be extended to allow birth and death rates to change over time can be parameterized to facilitate modeling SIR-like population dynamics while reconstructing phylogenetic history and simultaneously estimating epidemiological parameters.

References Stadler, T., 2010. Sampling-through-time in birth-death trees. Journal of Theoretical Biology.
PLGW05 22nd June 2011
11:30 to 12:30
Sequence Classification using Reference Taxonomies
Next generation sequencing technologies have opened up an unprecedented opportunity for microbiology by enabling the culture-independent genetic study of complex microbial communities, which were so far largely unknown. The analysis of metagenomic data is challenging, since a sample may contain a mixture of many different microbial species, whose genome has not necessarily been sequenced beforehand. In this talk, we address the problem of analyzing metagenomic data for which databases of reference sequences are already known. We discuss both composition and alignment-based methods for the classification of sequence reads, and present recent results on the assignment of ambiguous reads to microbial species at the best possible taxonomic rank.
PLGW05 22nd June 2011
14:00 to 15:00
O Gascuel Modelling protein evolution
I'll describe our recent efforts to model protein evolution and amino-acid substitutions by (1) designing an easy to use pipeline to estimate amino-acid replacement matrices for specific protein groups, (2) using the protein structure, (3) simplifying mixture models that have been proposed so far. I'll discuss several issues related to models (e.g. complexity trade-off), biology (e.g. independence of nucleotide mutations) and theory (e.g. identifiability, prediction of ancestral sequences).
PLGW05 22nd June 2011
15:00 to 15:20
B Blackburne An empirical study of the effect of sequence alignment on phylogenetic analysis
Phylogenetic analyses start with a multiple sequence alignment, which is often accepted as known despite wide recognition that errors may impact downstream phylogenetic analysis. Many phylogenetic methods involve testing which of a range of competing hypotheses best describe the evolution of a set of sequences. These tests may be justified statistically when using the correct alignment, but errors in the alignment lead to non-homologous characters being placed together, which in turn may systematically bias the test. We investigate empirically the impact of different alignment methods on phylogenetic analyses and assess the relative impact of different approximations used by different alignment methods.

We examine the effect of alignment on two phylogenetic analyses that are commonly used in computational biology: the inference of a maximum-likelihood tree using RAxML, and a test for positive selection by comparing the M7 and M8 models in PAML. We test 200 sets of sequences from the Adaptive Evolution Database using the popular aligners ClustalW, Muscle, MAAFT, ProbCons, and the phylogenetic aligner Prank. We also sample from the posterior distribution of the statistical aligner BAli-Phy, which enables us to compare the relative impact of aligner choice to uncertainty from a single aligner.

The algorithmic basis of an aligner tends to determine the outcome of the phylogenetic analysis. For example, trees estimated from progressive aligners tend to be more similar to one another than those estimated from phylogenetically aware (Prank) or consensus (ProbCons) aligners. Moreover the spread of phylogenetic parameter estimates inferred from BAli-Phy’s posterior distribution of alignments is much smaller than the differences between other aligners, suggesting differences are larger than could be expected by chance. Of the aligners examined, our results suggest that the phylogenetically informed Prank provides the closest approximation to full statistical alignment.
PLGW05 22nd June 2011
16:00 to 17:00
JO McInerney Are there alternatives to handling site-to-site rate variation in evolutionary characters
For more than 20 years, we have used the gamma distribution to effectively model site-to-site variation in evolutionary rate. This approach is usually implemented by first of all inferring a phylogenetic tree and then using a gamma distribution to improve the fit of the model and data. The result is almost always an improvement in likelihood scores, because apart from pseudogenes, most genes exhibit real variation in evolutionary rates among different sites. In our work we have sought to explore rate variation across characters and I will present our method and how it can work to improve phylogeny reconstruction. I will also speak about the need for future developments in this area and what I perceive to be the outstanding issues.
PLGW05 23rd June 2011
09:00 to 10:00
A Mooers Trees as representations of evolutionary information worth keeping
Weighted trees can be viewed as depictions of informational redundancy among tips. There is a rich literature in conservation biology on how this view can be used to maximize this information in triage situations or predict the future loss of such evolutionary information. Early views that redundancy is expected to be high is supported neither by theory on tree shapes nor by surveys of phylogenies. This means that work on maximization is relevant and timely. However, much triage is done within, rather than between, independent lineages; more theoretical work is needed to translate relevant metrics to, e.g., haplotype networks on landscapes.
PLGW05 23rd June 2011
10:00 to 10:20
W Hordijk Speciation & Extinction Rate Estimation from Phylogenetic Trees
Recently there has been an increasing interest in trying to estimate rates of evolution (in particular speciation and extinction rates) from phylogenetic tree data. However, this is currently still a difficult problem. We use a method based on lineage survival probabilities and actual speciation times to infer speciation and exitinction rates from (simulated) phylogenetic trees. Our results show that large trees (several hundred species) are necessary to get statistically significant results. This work is still ongoing, and we are investigating how to get more accurate and reliable results on smaller trees, and whether this is possible at all given the statistical nature of the problem.
PLGW05 23rd June 2011
10:20 to 10:40
Coalescent-based Species Tree Inference from Gene Tree Topologies Under Incomplete Lineage Sorting by Maximum Likelihood
Incomplete lineage sorting can cause incongruence between phylogenetic history of genes (the gene tree) and that of the species (the species tree), which can complicate the inference of phylogenies. Developing robust computational inference approaches is currently of interests in studying incomplete lineage sorting. In this talk, I will present a new coalescent-based algorithm for inferring species tree with maximum likelihood. I will first describe an improved method for computing the probability of a gene tree topology for a species tree, which is much faster than an existing algorithm. Based on this method, I will present a practical algorithm that takes a set of gene tree topologies and infers species trees with maximum likelihood. In this algorithm, we search for the best species tree by starting from candidate species trees found by a parsimony method and performing local search to obtain better trees with higher likelihood. This algorithm, called {STELLS}, has been imp lemented in a program that is downloadable from the author's web page. The simulation results show that the STELLS algorithm is more accurate than several existing methods for many datasets, especially when there is noise (in terms of topology, branch lengths and rooting) in gene trees. We also show the STELLS algorithm is efficient and can be applied to real biological datasets.
PLGW05 23rd June 2011
10:40 to 11:00
Tree reconciliations: beyond the LCA mapping
The least common ancestor (LCA) mapping is an important tool to reconcile discordance between gene tree and species tree, and has been well studied in the literature. In this talk, we report some recent results in studying general reconciliations, i.e., leaf-preserving and order-preserving maps, between gene tree and species tree. Here we introduce three cost models to infer, respectively, the gene duplication, gene loss and deep coalescent events. And we also show that the canonical partial order dened on the set of reconciliations is compatible with these cost models. Finally, we discuss how to generalize these results to non-binary gene trees.
PLGW05 23rd June 2011
11:30 to 11:50
A Lie algebraic classification of continuous-time Markov models
In recent work we have discussed the importance of multiplicative closure for the Markov models used in phylogenetics. For continuous-time Markov chains, a sufficient condition for multiplicative closure of a model class is ensured by demanding that the set of rate matrices belonging to the model class form a Lie algebra. It is the case that some Markov models do form Lie algebras (eg. JC, F81, K3ST and GMM), and we refer to these models as ``Lie Markov models''. Importantly, other Markov models unequivocally do not form Lie algebras (the most conspicuous example being GTR).

The question then naturally arises: How do we generate a full list of Lie Markov models? To answer this question in full generality seems quite difficult as the interaction between the operations of a Lie algebra and the stochastic requirements of rate matrices are somewhat at odds, and it seems that this question has not been addressed in the algebraic or stochastic mathematics literature.

In this talk, we will discuss how we have made signficant progress in generating Lie Markov models by demanding that the models have certain symmetrics under nucleotide permutations. >From a theoretical perspective, we show that the Lie Markov models include, and hence provide a unifying concept for, ``group-based'' and "equivariant" models, whilst also generating many other interesting examples. We also argue that our scheme is very pleasing in the context of applied phylogenetics, as, for a given symmetry of nucelotide substitution, it provides a natural hierarchy of models with increasing number of parameters.

This is joint work with Jesús Fernández-Sánchez and Peter Jarvis.
PLGW05 23rd June 2011
11:50 to 12:10
A generalization of Stirling numbers and distribution of phylogenetic trees
P.L. Erdos and L.A. Szekely provided a bijection between rooted semi-labeled trees and set partitions, and hence Stirling numbers of the second kind. This, with the asymptotic normality of the Sirling numbers of the second kind (Harper) translates into the asymptotic normality of rooted semi-labeled trees with a fixed number of vertices and a variable number of internal vertices. We apply Harper's method and the Erdos-szekely bijection to obtain the asymptotic normality of of phylogenetic trees.
PLGW05 23rd June 2011
12:10 to 12:30
Mammalian phylogeny reveals recent diversi cation rate shifts
Phylogenetic trees of present-day species allow investigation of the rate of evolution which led to the present-day diversity. A recent analysis of the complete mammalian phylogeny challenged the view of explosive mammalian evolution after the K/T boundary (65 Ma). However, due to lack of appropriate methods, the diversication rates in the more recent past of mammalian evolution could not be determined. Here, I provide a method which reveals that the tempo of mammalian evolution did not change until about 33 Ma. This constant period was followed by a peak of diversication rates between 33 and 30 Ma. Thereafter, diversication rates remained high and constant until 8.55 Ma. Diversication rates declined signicantly at 8.55 and 3.35 Ma. Investigation of mammalian subgroups (marsupials, placentals, and the six largest placental subgroups) reveals that the diversication rate peak at 33-30 Ma ago is mainly driven by rodents, cetartiodactyla and marsupials. The recent diversication rate decrease is signicant for all analyzed subgroups but eulipotyphla, cetartiodactyla and primates. My likelihood approach is not limited to mammalian evolution. It provides a robust framework to infer diversication rate changes and mass extinction events in phylogenies, reconstructed from e.g. present-day species or virus data. In particular, the method is very robust towards noise and uncertainty in the phylogeny, and can account for incomplete taxon sampling.
PLGW05 23rd June 2011
14:00 to 15:00
Integrative analysis if environmental sequences
In metagenomics, the aim is to understand the composition and operation of complex microbial assemblages in environmental samples through sequencing and analysis of their DNA. Similarly, metatranscriptomics and metaproteomics target the RNA and proteins obtained from such samples. A major challenge in the analysis of environmental sequences is data integration. The question is how to anaylze different types of data in a unified approach, addressing both taxonomic and functional aspects. We will discuss how to perform taxonomic analysis based on the NCBI taxonomy. Functional analysis can be based on classifications such as the SEED classification of functional roles and subsystems or the KEGG classification of pathways and enzymes. We will also discuss how to compare different datasets using statistical approaches, ecological indices and methods such as Neighbor-net and multi-dimensional scaling.
PLGW05 24th June 2011
09:00 to 10:00
O Bininda-Emonds Assessing the limits of phylogenomics: can too much data be a bad thing?
In contrast to the situation not even 20 years ago, molecular sequence data is now plentiful (if still patchily distributed) and phylogenomic studies of hundreds of taxa on a broad taxonomic scale are becoming increasingly common. Whereas the accuracy of phylogenetic analysis was limited until recently by a shortage of data (and then for both taxa and characters), the results of large and comprehensive phylogenomic studies where data are not limiting are also not without their problems. Analyses including large numbers of taxa run up against the superexponential increase in the number of possible solutions, requiring any or all of more time, faster computers in conjunction with parallel processing, and cleverer heuristics to find a hopefully near optimal solution. Perhaps less appreciated, however, is that the increasing taxonomic scope of our analyses demands the use of large amounts of molecular sequence data with significant rate heterogeneity across the data set (whether between or within partitions) to achieve full resolution throughout the tree. In this talk, I examine how the performance of phylogenetic analysis is affected when analyzing large number of taxa or a large multigene data set incorporating the degree of rate heterogeneity that is to be found, if not needed, in typical phylogenomic data sets.
PLGW05 24th June 2011
10:20 to 10:40
Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions
Distance-based phylogenetic reconstruction methods rely heavily on accurate pairwise distance estimates. There are two separate sources of error in this estimation process:

(1) the relatively short sequence alignments used to obtain distance estimates induce a "stochastic error" corresponding to estimation of model parameters from finite data;

(2) model misspecification leads to a "fixed error" which does not depend on sequence length.

It is common practice to assume some substitution model over the sequence data and use an additive substitution rate function for that model when computing pairwise distances. In the providential case when the assumed model coincides with the true model, which is typically unkown, the distance estimates will not be afflicted with fixed error. But even then, there is no reason to a-priori enforce a zero fixed error, when this causes elevated rates of stochastic error, especially in the case of short sequence alignments.

This work challenges this paradigm of "using the most additive distance function at any cost". We do this by studying the contribution and effect of both fixed and stochastic error in distance estimation. We present a formal framework for quantifying the fixed error associated with a specific distance function and a given phylogenetic tree in a homogeneous substitution model. As an example, we study the behavior of the Jukes-Cantor distance formula in homogeneous instances of Kimura's two parameter substitution model. The effects of fixed error are observed through analytic results and experiments on simulated data. In addition, we compare the performance of various distance functions on biological sequences. We evaluate reconstruction accuracy by comparing the reconstructed trees to an independently validated species tree. Our study indicates that often enough simple distance functions outperform more sophisticated functions, despite the fact that the given sequence data appears to have poor fit to the substitution model they assume.
PLGW05 24th June 2011
10:40 to 11:00
The combinatorics of distance-based tree inference
Several popular methods for phylogenetic inference (or hierarchical clustering) are based on a matrix of pairwise distances between taxa (or any kind of objects): the objective is to construct a tree with branch lengths so that the distances between the leaves in that tree are as close as possible to the input distances. If we hold the structure (topology) of the tree fixed, in some relevant cases the optimal values for the branch lengths can be expressed using simple combinatiorial formulae. Here we define a general form for these formulae and show that they all have two desirable properties: first, the common tree reconstruction approaches (least squares, minimum evolution), when used in combination with these formulae, are guaranteed to infer the correct tree when given enough data (consistency); second, the branch lengths of all the simple (NNI) rearrangements of a tree can be calculated, optimally, in quadratric time in the size of the tree. The study presented here may form the basis for novel effcient search algorithms for distance-based tree reconstruction.
PLGW05 24th June 2011
11:30 to 12:30
Estimating ultra-large phylogenies and alignments
Phylogenetic (evolutionary) tree estimation is fundamental to biology, and has applications to much biomedical research. Phylogenetic tree estimation is enormously difficult: the best approaches are NP-hard, and many real datasets take weeks or months of analysis. In particular, the estimation of multiple sequence alignment turns out to be one of the most challenging and important steps in a phylogenetic analysis.

In this talk I will present several methods that produce greatly improved trees as compared to other phylogeny estimation methods. These methods utilize divide-and-conquer strategies in order to improve the accuracy of traditional phylogeny estimation methods. Among the methods I will present are SATe (Liu et al. 2009, Science Vol 324, no. 5934) and DACTAL (in preparation). SATe simultaneously estimates a tree and a multiple sequence alignment, while DACTAL estimates a tree without ever computing an alignment on the entire set of sequences. Both methods provide great improvements in tree accuracy over other methods, and do so fairly efficiently.
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons