# Workshop Programme

## for period 20 - 24 June 2011

### Phylogenetics: New Data, New Phylogenetic Challenges

20 - 24 June 2011

Timetable

Monday 20 June | ||||

08:30-09:45 | Registration | |||

09:45-10:00 | Opening remarks and welcome from Dr Ben Mestel (INI Deputy Director) | |||

10:00-11:00 | Gusfield, DM (University of California, Davis) |
|||

Recent Progress on Multi-State Perfect Phylogeny (Compatibility) Problems | Sem 1 | |||

The Multi-State Perfect Phylogeny Problem is an extension of the Binary Perfect Phylogeny (or Compatibility) Problem, allowing evolutionary characters to have more than two states. The problem is of interest due to prior elegant mathematical and algorithmic results, and due to integer data that is increasingly available. In 1975, Buneman showed a how to view the Multi-State problem as one of triangulating non-chordal graphs, but the result has not been widely exploited as a computational tool. In this talk, I discuss our recent work on Multi-State Perfect Phylogeny problems, mostly by exploiting Buneman's result. I will talk about three sets of results. The first set of results concerns problems that extend the multi-state problem: The Missing Data (MD) Problem, where some entries in the input are missing and the question is whether (bounded) values can be imputed so that the full data has a multi-state perfect phylogeny; The Character-Removal (CR) Problem, minimizing the number of characters to remove so that the resulting data has a multi-state perfect phylogeny. Using results on triangulation of non-chordal graphs, we establish new necessary and sufficient conditions for the existence of a perfect phylogeny with (or without) missing data. This leads to solutions of the MD and CR problems using integer linear programming. The second set of results concerns generalization of the famous ``four-gametes test" that characterizes the compatibility of binary data. Such a generalization was sought since the 1970s. Our new generalization establishes that 3-state data has a perfect phylogeny if and only if every subset of three characters has a 3-state perfect phylogeny. We also characterize the patterns in the data that prohibit a 3-state perfect phylogeny. We briefly discuss efforts to generalize this to more states. The third set of results concerns reductions of multi-state data to binary data, to the 2-SAT problem, and to sparser problem instances. |
||||

11:00-11:30 | Morning coffee | |||

11:30-11:50 | Bonet, ML (Politècnica de Catalunya) |
|||

The Complexity of the uSPR distance | Sem 1 | |||

We show that subtree prune and regraft (uSPR) distance on unrooted trees is fixed parameter tractable with respect to the distance. We also make progress on a conjecture of Steel on the preservation of uSPR distance under chain reduction. |
||||

11:50-12:10 | Fischer, M (CIBIV, Vienna) |
|||

Revisiting the question how many characters are needed to reconstruct the true tree | Sem 1 | |||

The question of how many sequence sites are required to recover the evolutionary relationship of the underlying species accurately is important for phylogeneticists. It is known that a particularly challenging problem for phylogenetic methods arises when a rapid divergence event occurred in the distant past, which leads to long pending branches and a short internal branch in the corresponding phylogenetic tree. While most previous approaches tackling this problem considered only 2-state models, we investigate the scenario based on all four (DNA) character states. Particularly, we analyze a binary unrooted 4-taxon phylogenetic tree with a short interior edge and pending edges of multiple length. In my talk, I will present an optimal branch length of the interior edge in this case and I will explain how many characters are at least needed to reconstruct the 'true' tree. |
||||

12:10-12:30 | Linz, S (Tübingen) |
|||

On the Complexity of the Quartet Challenge | Sem 1 | |||

A collection of quartets (unrooted phylogenetic trees on four taxa) is compatible if there exists a phylogenetic tree that explains all ancestral relationships given by the quartets. It is well-known that deciding whether or not a set of quartets is compatible is an NP-complete problem. A natural extension of this problem asks if, given a phylogenetic tree T that is compatible with the input, is T the only such tree. Several graph-theoretic characterizations have been established to approach this problem, which is known as the Quartet Challenge. However, the complexity of this challenge remains open. In this talk, we show that the Quartet Challenge is ASP-complete (i.e. it is computationally hard to decide whether another solution exists) by using a particular type of polynomial-time reduction from a variation of the Betweenness problem. This is joint work with Maria Luisa Bonet (UPC, Spain) and Katherine St. John (CUNY, USA). |
||||

12:30-13:30 | Lunch at Wolfson Court | |||

14:00-15:00 | Free Time | |||

15:00-15:30 | Afternoon tea | |||

15:30-15:50 | Huber, K (East Anglia) |
|||

Metrics on Multi-labeled Trees: Interrelationships and Diameter Bounds | Sem 1 | |||

Multi-labeled trees or MUL-trees, for short, are trees whose leaves are labeled by elements of some non-empty finite set $X$ such that more than one leaf may be labeled by the same element of $X$. This class of trees includes phylogenetic trees and tree shapes. MUL-trees arise naturally in, for example, biogeography and gene evolution studies and also in the area of phylogenetic network reconstruction. In this talk we introduce novel metrics which may be used to compare MUL-trees, most of which generalize well-known metrics on phylogenetic trees and tree shapes. These metrics can be used, for example, to better understand the space of MUL-trees or to help visualize collections of MUL-trees. In addition, we describe some relationships between the MUL-tree metrics that we present and also give some novel diameter bounds for these metrics. This is joint work with A. Spillner, University of Greifswald, Germany, and R. Suchecki and V. Moulton, both School of Computing Sciences, University of East Anglia, UK. |
||||

15:50-16:10 | Thuillard, M | |||

Identifying Lateral Transfers with Neighbor-Nets | Sem 1 | |||

With the ever increasing availability of complete genomes, the importance of lateral transfers in evolution has been clearly recognized and increasing doubts are being raised about the feasibility of inferring a tree of life based on genome evolution. Identifying lateral gene transfers is therefore an important problem in evolutionary biology. A distance matrix in the Farris space that fulfills the Kalmanson inequalities can be exactly represented by an outer-planar split network or an X-tree. Using a simple model of lateral transfers one can show that a circular order of the taxa labeling the leaves of a phylogenetic tree is quite robust to lateral transfers. A circular order of the taxa is an order in which the taxa are encountered through a clockwise scanning of the tree embedded in the plane. If lateral transfers occur only between consecutive taxa in such an order, it can be shown that, within the usual distance-based framework for constructing phylogenetic trees, the o uter-planar network reconstructed by applying the Neighbor-Net algorithm furnishes a circular order of the nodes with the corresponding distance matrices in the Farris space fulfilling the Kalmanson inequalities. There are limits to the robustness of a circular order to lateral transfers. A new approach has been developed to deal with such cases. The approach combines the Neighbor-Net algorithm for computing phylogenetic networks with the Minimum Contradiction method. A subset of taxa, prescribed using Neighbor-Net, is obtained by measuring how closely the Kalmanson inequalities are fulfilled by each taxon. A criterion is then used to identify the taxa possibly involved in a lateral transfer between non-consecutive taxa. We illustrate the utility of the new approach by applying it to a distance matrix for Archaea, Bacteria and Eukaryota. Our new approach gives a way to use Neighbor-Nets to get further biological information, for instance to localize and identify lateral transfers or to detect large deviations to an X-tree or an outer-planar split network topology using distance matrices. In future, we expect that it will become increasingly important to develop new tools such as this that can allow one to extract useful biological information from trees and networks. |
||||

16:10-16:30 | St John, K (CUNY) |
|||

Exploring Treespace | Sem 1 | |||

Phylogenies, or evolutionary histories, play a central role in modern biology, illustrating the interrelationships between species, and also aiding the prediction of structural, physiological, and biochemical properties. The reconstruction of the underlying evolutionary history from a set of morphological characters or biomolecular sequences is difficult since the optimality criteria favored by biologists are NP-hard, and the space of possible answers is huge. The number of possible phylogenetic trees for n taxa is (2n − 5)!!. Due to the hardness and the large number of possible answers, clever searching techniques and heuristics are used to estimate the underlying tree. We explore the underlying space of trees, under different metrics, in particular the nearest-neighbor-interchange (NNI), subtree-prune-and-regraft (SPR), tree-bisection-and-reconnection (TBR), and Robinson-Foulds (RF) distances. |
||||

16:30-16:50 | Eldon, B (Oxford) |
|||

Concordance between species trees and gene trees with multiple mergers of ancestral lineages | Sem 1 | |||

Concordance between species trees and gene genealogies are considered for gene genealogies allowing multiple mergers of ancestral lineages. Donnelly and Kurtz, Pitman, and Sagitov independently introduced coalescent processes (Lambda coalescent) that allow asynchronous multiple mergers of ancestral lineages. Kingman's coalescent allows only two ancestral lineages to merge at a time. Lambda coalescents may be more appropriate than Kingman's coalescent for some marine organisms with high fecundity and broadcast spawning. Multiple mergers, ie allowing any number of ancestral lineages present each time to coalesce, complicates computations. Two methods will be presented to compute the concordance, one relies on enumeration of all possible sequences of mergers, and the other is the spectral decomposition of the rate matrix. Results will be presented for species trees consisting of two to four species, and for different Lambda coalescents. |
||||

16:50-17:10 | Loza, E (Rothamsted Research) |
|||

Quantifying diversity and abundance in soil microbial communities from high-throughput sequence data | Sem 1 | |||

Soil microbial communities are essential in many different ways. For instance, they support crop productivity and maintain environmental stability through their roles in the carbon, nitrogen, oxygen and sulphur cycles. They also play a central part in plant health and contribute in the bioremediation of contaminated soils. Understanding the complex structure of soil microbial communities is crucial to better manage agricultural soils and minimise the negative impact of agricultural practices. However, this poses important biological, statistical and computational challenges because of the vast numbers and diversity of microbes present in each gram of soil, the paucity of knowledge concerning the majority of them and the sheer amount of data typically observed in soil metagenomics studies. We address the problem of identifying bacterial groups present in soil and estimating their relative abundance using high-throughput data sampled from agricultural fields. Our procedure computes an overall phylogeny as the agreement among several individual phylogenies independently inferred from carefully selected marker genes. The principle being that such an agreement leads to the true phylogeny underlying all the observed taxa. The overall phylogeny provides information on the identity and abundance of bacterial groups through its clade structure and the relative sizes of these clades. Most importantly, because our method does not bin organisms according to reference databases, it sheds light into the identity and abundance of not only previously known soil bacteria but also of the 'unknowns'. As a first step, we demonstrate the ability of our method to correctly discriminate between different bacterial groups and to quantify their relative abundance using simulated data. We then go on to analyse Roche/454 data sampled from one of Rothamsted's long-term experimental fields. We present preliminary results on this analysis. |
||||

17:10-18:10 | Welcome wine reception |

Tuesday 21 June | ||||

09:00-10:00 | Nakhleh, L (Rice) |
|||

Reconciling Gene Trees in the Presence of Incomplete Lineage Sorting and Hybridization | Sem 1 | |||

In this talk, I will describe our work on two problems: 1. Inference of species trees from gene trees whose incongruence is assumed to be due to incomplete lineage sorting. Here, I will describe extension of the Minimize Deep Coalescence criterion to deal with gene tree estimates that may be unrooted, incompletely resolved, or both. 2. Inference of hybridization from gene trees, in the presence of incomplete lineage sorting. Here, I will describe our work on extending the concept of coalescent histories to phylogenetic networks, and how hybridization can be detected despite incomplete lineage sorting under both a maximum parsimony setting and a maximum likelihood setting. The performance of the methods on both biological and simulated data will be shown. |
||||

10:00-10:20 | Gambette, P (la Méditerranée (Aix-Marseille II)) |
|||

Quartets and unrooted phylogenetic networks | Sem 1 | |||

Phylogenetic networks were introduced to describe evolution in the presence of exchanges of genetic material between coexisting species or individuals. Split networks in particular were introduced as a special kind of abstract networks to visualize conﬂicts between phylogenetic trees which may correspond to such exchanges. More recently, methods were designed to reconstruct explicit phylogenetic networks (whose nodes can be interpreted as biological events) from triplet data. In this presentation, we link abstract and explicit networks through their combinatorial properties, by introducing the unrooted analogue of level-k networks. In particular, we give an equivalence theorem between circular split systems and unrooted level-1 networks. We also show how to adapt to quartets some existing results on triplets, in order to reconstruct unrooted level-k phylogenetic networks. These results give an interesting perspective on the combinatorics of phylogenetic networks and also raise algorithmic and combinatorial questions. |
||||

10:20-10:40 | Kelk, SM (CWI) |
|||

On the elusivity of softwired clusters | Sem 1 | |||

Rooted phylogenetic networks are often used to represent conflicting phylogenetic signals. One approach requires us to construct a phylogenetic network such that, for each cluster (i.e. clade) that is present in at least one input gene tree, at least one tree embedded in the network contains that cluster. This is often called the softwired cluster approach. Motivated by parsimony we might wish to construct such a network using as few reticulations as possible, or minimizing the maximum number of reticulations used in any "tangled" region of the network (known as the level of the network). In this talk we present a number of new algorithmic results, both positive and negative, which emphasize the curious mathematical structure of the softwired cluster model. We also describe the relationship between our results and other recent results in constructing rooted phylogenetic networks. We conclude with a number of intruiging open questions. This talk will be given by Steven Kelk, based on forthcoming joint work between Steven Kelk, Celine Scornavacca and Leo van Iersel. |
||||

10:40-11:00 | Lam, F (UC, Davis) |
|||

Ancestral Recombination Graphs for Missing and Removable Data | Sem 1 | |||

Recombination is a widespread feature of genomes and the study of recombination patterns is an important component in the design and analysis of disease association studies. Advances in algorithms for studying recombination must take into consideration the increasing size of data in both population size and number of sites typed. The goal of our work is to develop and implement algorithms to analyze recombination events in data that cannot be analyzed by existing methods, including data sets that are too large or data sets that include missing entries. We demonstrate a branch and bound approach that is practical, robust, and efficient, and furthermore show that the approach can be extended to analyze recombination bounds on circular data and input data with missing entries. We apply our methods to analyze simulated data and a benchmark lipoprotein gene and find that by including sites ignored in previous analysis (due to missing entries), the amount of overall detected reco mbination can be increased over prior methods. |
||||

11:00-11:30 | Morning coffee | |||

11:30-11:50 | Scornavacca, C (Tübingen) |
|||

Tanglegrams for Rooted Phylogenetic Trees and Networks | Sem 1 | |||

In systematic biology, one is often faced with the task of comparing different phylogenetic trees, in particular in multi- gene analysis or cospeciation studies. One approach is to use a tanglegram in which two rooted phylogenetic trees are drawn opposite each other, using auxiliary lines to connect matching taxa. There is an increasing interest in using rooted phylogenetic networks to represent evolutionary history, so as to explicitly represent reticulate events, such as horizontal gene transfer, hybridization or reassortment. Thus, the question arises how to define and compute a tanglegram for such networks. Here we present the first formal definition of a tanglegram for rooted phylogenetic networks and present a heuristic approach for computing one. We compare the performance of our method with existing tree tanglegram algorithms and also show a typical application to real biological data sets. For maximum usability, the algorithm does not require that the trees or networks are bifurcating or bicombining, or that they are on identical taxon sets. The algorithm is implemented in our program Dendroscope 3, which is freely available from www.dendroscope. org. |
||||

11:50-12:10 | Spillner, A (Ernst-Moritz-Arndt-Universität Greifswald) |
|||

Visualizing reticulate evolution with planar split networks | Sem 1 | |||

Split networks have been developed as a tool to visualize collections of bipartions of a finite set, so called split systems. We present a new method that allows to compute drawings of split networks without crossing edges for a rich class of split systems that include as special cases compatible and circular split systems. In addition, we show that the drawings produced by our method are optimal in the sense that any drawing with less edges would necessarily miss out some of the information contained in the given split system. |
||||

12:10-12:30 | Willson, S (Iowa State) |
|||

Reconstructing the parameters of a network from its tree-average distances | Sem 1 | |||

A phylogenetic network is a rooted acyclic directed graph with labeled leaves which correspond to extant species. The network depicts the course of evolutionary history as species mutate and hybridize. Often each arc is weighted by a nonnegative real number measuring the amount of genetic change along the arc. A "tree-average distance" is defined which tells the average of the distances between the leaves in each displayed tree using these weights. Given a normal network N with certain restrictions, we show how the weights may be reconstructed from the tree-average distances. With additional assumptions we also indicate how the network N itself may be reconstructed. |
||||

12:30-13:30 | Lunch at Wolfson Court | |||

14:00-15:00 | Holland, BR (Massey) |
|||

Phylogenies from DArTs: A stochastic Dollo proces with censored data | Sem 1 | |||

Diversity Array Technologies (DArT) are a relatively new kind of DNA marker system that seem like they could be usefully applied to phylogenetics (a few papers have already explored this). Like marker systems such AFLP and RFLP, the method produces presence absence data, but unlike these methods it is very unlikely for shared presences to occur by chance. The basic idea is as follows. One or a small number of genomes are selected to form the genomic representation. Two enzymes are used to cut the DNA from these genomes at certain recognition sites (a rare 6bp recognition site and a more frequent 4bp recognition site). Fragments of DNA whose ends have been cut by two rare recognition sites are amplified. These fragments, which are said to form the genomic representation, are arranged on a microchip. Other genomes can then been checked to see which fragments within the genomic representation they have copies of in their own sequence. For each other genome that is compared to the genomic representation this results in a binary sequence that indicates presence (1) or absence (0) of each of the fragments. The first obvious advantage of this approach is that it creates a representation of the whole genome rather than just a few genes. This alleviates the problem of picking a small set of genes that may not be representative of the evolutionary history of the species. The second advantage is that in comparison to an individual site, long fragments of DNA are very unlikely to be similar due to chance. So if two species share a fragment it is vastly more likely that they share it due to common ancestry rather than due to a chance similarity. To use these data for phylogenetics it would be useful to develop a likelihood equivalent of Dollo parsimony (in which characters can be lost multiple times but gained only once), such models have already been explored in the context of language evolution and gene content evolution. However, another complicating issue is the censoring effect created by only being able to see those fragments that were in the original genomic representation, i.e. fragments that are shared by a group of species but that are not present in the original species used to make the genomic representation are missing from the data. |
||||

15:00-15:30 | Afternoon tea | |||

15:30-15:50 | Herrmann, S (East Anglia) |
|||

Phylogenetic trees and k-dissimilarity maps | Sem 1 | |||

Phylogenetic trees are versatile tools in the analysis of sequence data. Several approaches exist to construct such trees from data using metrics or distances, and the Tree Metric Theorem of Dress gives an explicit condition when such a metric defines a tree. More recently, (see, e.g., Levy, Yoshida and Pachter [Mol. Biol. Evol. 23(3):491–498. 2006]), it was suggested to use the phylogenetic diversity not of pairs, but of triplets, quartets or, in general, k-tuples, to construct trees from the given data. Maps assigning values to such k-tuples of taxa as opposed to pairs are called k-dissimilarity maps. In this talk, we will analyse when such k-dissimilarity maps correspond to trees and give generalisations of the Tree Metric Theorem. This is joint works with Katharina Huber, Vincent Moulton and Andreas Spillner. |
||||

15:50-16:10 | Jewett, E (Michigan) |
|||

Corrections to a class of methods for inferring species trees from gene trees | Sem 1 | |||

Although many methods for inferring species trees from gene trees while taking into account gene tree/species tree discordance provide consistent estimates of the species tree topology, many do not estimate branch lengths or are computationally slow. Exceptions include the GLASS method of Mossel and Roch (2010), the STEAC method of Liu et al. (2009), and the Shallowest Divergence method of Maddison and Knowles (2006). Given estimated gene trees for a set of loci, these methods estimate the divergence time between a pair of taxa by computing either the mean or minimum interspecific coalescence time between the taxa at each locus, and then taking either the mean or minimum of these times across all loci. The species tree is then constructed by applying a suitable hierarchical clustering method to the estimated divergence times. These methods estimate both species tree topologies and branch lengths and are computationally fast. However all of these methods systematically overest imate divergence times. We derive corrections that can substantially reduce the bias and mean squared error in the estimates of divergence times made by these methods and by a fourth method based on the same principle. |
||||

16:10-16:30 | Kupczok, A (IST Austria) |
|||

Modeling the Evolutionary Dynamics of CRISPR spacers | Sem 1 | |||

We present a phylogenetic insertion- and deletion-model for CRISPR spacer turnover. CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is an adaptive heritable immune system found in Eubacteria and Archaea. The system consists of a number of CRISPR associated (CAS) proteins and an array of repeats and spacers - the later represent the viral/plasmid targeting sequences and the system functions in an analogous way to the eukaryotic siRNA system. The length and content of the spacer array varies considerably among individuals within species (suggesting a rapid arms race) and it has been suggested that there is a selective cost, in the absence of parasites, associated with maintaining these arrays. At one point in time and space spacers can be beneficial, if a parasite with the corresponding sequence exists in the community, or they can be useless and thus neutral or slightly deleterious. The rate at which spacers are gained and lost from these arrays provides insight into the evolutionary dynamics of host-parasite interactions. To this end we model spacer insertion and deletion as a continuous-time two-state Markov process. The model parameters are then estimated by maximum likelihood along the phylogeny. We assume different dynamics, modeled by different two-state Q-matrices, for beneficial and neutral spacers. An underlying switching process allows for changes between these Q-matrices. The information whether a spacer is beneficial or neutral is not known, thus we use ambiguous states. Using known spacer annotation we can also apply a partition model allowing the evolutionary rates to differ between spacers of different sources. We evaluate the switching model by simulation and also analyse bacterial data sets. |
||||

16:30-16:50 | Nye, T (Newcastle) |
|||

Principal components analysis in tree space | Sem 1 | |||

Phylogenetic analysis commonly gives rise to a collection or sample of inferred evolutionary trees, each differing from the others. There is a need for methods that visualize, compare, and quantify variability in such sets of trees, in terms of both topological and geometrical differences. Standard tools of multivariate analysis such as multi-dimensional scaling and clustering have been applied to sets of trees, but Principal Components Analysis (PCA) cannot be applied directly since the space of evolutionary trees on a fixed set of taxa is not a vector space. I propose a novel geometrical approach to PCA in tree-space that works in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic path is sought that maximises the variance of the data under a form of projection within tree-space onto the path. Geodesic paths identified in this way reveal and quantify the principal sources of variation in the original collection of trees in terms of both topology and branch lengths, and can be visualized as animations of smoothly changing alternative evolutionary trees. The potential of the approach is illustrated by applying tree-space PCA to experimental data from metazoa and a simulation study of long-branch attraction. |
||||

16:50-17:10 | Jermiin, L (CSIRO Ecosystem Sciences (CES)) |
|||

New methods of identifying fast-evolving sites in aligned sequence data | Sem 1 | |||

Rate-heterogeneity across sites is a widely recognized problem in molecular phylogenetic studies. Attempts to overcome the problem involve deleting the fastest-evolving sites from the data before the phylogenetic analysis and/or using phylogenetic methods that assume rate-heterogeneity across sites can be modeled by a distribution of probabilities. Each of these methods has its advantages and limitations. Here, we describe and compare five metrics that may be used to identify the fast-evolving sites. One of these metrics (i.e., the probability of compatibility) turned out to be particular good at identifying fast-evolving sites, and when these sites are removed from the data, the result is an increase in the consistency of the alignment and, in some cases, another phylogeny. |
||||

17:15-18:00 | Poster session |

Wednesday 22 June | ||||

09:00-10:00 | von Haeseler, A (MFPL, Vienna) |
|||

Evaluating the goodness of fit between a phylogenetic model and an alignment | Sem 1 | |||

As models of sequence evolution become more and more complicated, many criteria for model selection have been proposed, and tools are available to select the best model for an alignment under a particular criterion. However, in many instances the selected model fails to explain the data adequately as reflected by large deviations between observed pattern frequencies and the corresponding expectation. We present an approach to evaluate the goodness of fit. We introduces a minimum number of "extra substitutions" on the inferred tree to provide a biologically motivated explanation why the alignment may deviate from expectation. These extra substitutions plus the evolutionary model then fully explain the alignment. We illustrate the method on several examples and then give a survey about the goodness of fit of the selected models to the alignments in the PANDIT database. |
||||

10:00-10:20 | Downing, T (Wellcome Trust Sanger Institute) |
|||

Genome sequencing of Leishmania donovani clinical lines reveals dynamic phylogenetic variation | Sem 1 | |||

Leishmaniases are a range of debilitating neglected parasitic illnesses that have several forms: cutaneous, mucocutaneous and most potently, the visceral, which has a global incidence of approximately half a million cases per year. Consequently, a range of early-detection and treatment programs are implemented to limit the spread of the disease, but these are forced to combat the emergence of drug resistance in patients. The causative agent of visceral leishmanisis, the species of the Leishmania donovani complex, is found most frequently in equatorial regions and has a range from southern Europe to Africa to southern Asia. This presentation focuses on distinct regions of India and Nepal where the pathogen displays extensive variation in its response to the drugs most commonly used in treatment. Despite a low level of genetic diversity, clinical samples of these parasites manifest pronounced differences in phenotype. We sequenced, assembled and annotated the genome sequence fo r L. donovani, and used this as a basis for exploring variation in 17 samples cloned from patients with varying treatment outcomes. Many layers of diversity between these genome-sequenced strains were evident at the level of single DNA nucleotides, repetitive gene copies, whole chromosome duplications and extra-chromosomal fragments. By combining this population-level picture with the other sequenced Leishmania species genomes, candidate variants relevant to drug resistance were identified. This study provides evidence of the power and scope of genome-level surveillance of emergent drug resistance in pathogens. |
||||

10:20-10:40 | Drummond, AD (Auckland) |
|||

Bayesian time-tree priors | Sem 1 | |||

Bayesian phylogenetic inference requires specification of a prior distribution on the space of trees. In BEAST, the phylogenetic history is explicitly separated into a time-tree (by definition ultrametric for contemporaneous taxa) and the rates of evolution, which may vary from lineage to lineage. The combination can account for phylogenetic trees with non-ultrametric branches in units of substitutions per site. Hierarchical Bayesian priors on the time-tree lead to inference of population and dynamical priors through application of the coalescent, Yule or Birth-death tree priors, and calibration of the individual nodes in the time-tree leads to estimation of the absolute age of other divergence times of interest. Here I discuss difficulties and recent advances in the development of calibrated time-tree priors for Bayesian phylogenetic inference, including calculation of the marginal distribution of divergence times under various time-tree priors and a class of new time-tree priors based on a "Birth-death skyline" model of branching. |
||||

10:40-11:00 | Kühnert, D (Auckland) |
|||

Phylodynamic inference - accounting for the interaction of evolutionary and ecological processes | Sem 1 | |||

Rapidly evolving viruses such as HIV, HCV and Influenza virus are of major interest in phylogenetics. They are special in that their ecological and evolutionary processes act on the same timescale and are therefore entangled. The cross-reaction of the two processes must be accounted for when inferring epidemiological parameters and/or phylogenetic history. Our aim is a joint epidemiological phylogenetic, i.e. phylodynamic, analysis of genomic data by incorporating the dynamics of an SIR model into Bayesian phylogenetic inference. A new version of the birth-death model (Stadler, 2010) that incorporates sampling-through-time and can also be extended to allow birth and death rates to change over time can be parameterized to facilitate modeling SIR-like population dynamics while reconstructing phylogenetic history and simultaneously estimating epidemiological parameters. References Stadler, T., 2010. Sampling-through-time in birth-death trees. Journal of Theoretical Biology. |
||||

11:00-11:30 | Morning coffee | |||

11:30-12:30 | Valiente, G (Technical University of Catalonia) |
|||

Sequence Classification using Reference Taxonomies | Sem 1 | |||

Next generation sequencing technologies have opened up an unprecedented opportunity for microbiology by enabling the culture-independent genetic study of complex microbial communities, which were so far largely unknown. The analysis of metagenomic data is challenging, since a sample may contain a mixture of many different microbial species, whose genome has not necessarily been sequenced beforehand. In this talk, we address the problem of analyzing metagenomic data for which databases of reference sequences are already known. We discuss both composition and alignment-based methods for the classification of sequence reads, and present recent results on the assignment of ambiguous reads to microbial species at the best possible taxonomic rank. |
||||

12:30-13:30 | Lunch at Wolfson Court | |||

14:00-15:00 | Gascuel, O (CNRS) |
|||

Modelling protein evolution | Sem 1 | |||

I'll describe our recent efforts to model protein evolution and amino-acid substitutions by (1) designing an easy to use pipeline to estimate amino-acid replacement matrices for specific protein groups, (2) using the protein structure, (3) simplifying mixture models that have been proposed so far. I'll discuss several issues related to models (e.g. complexity trade-off), biology (e.g. independence of nucleotide mutations) and theory (e.g. identifiability, prediction of ancestral sequences). |
||||

15:00-15:20 | Blackburne, B (Manchester) |
|||

An empirical study of the effect of sequence alignment on phylogenetic analysis | Sem 1 | |||

Phylogenetic analyses start with a multiple sequence alignment, which is often accepted as known despite wide recognition that errors may impact downstream phylogenetic analysis. Many phylogenetic methods involve testing which of a range of competing hypotheses best describe the evolution of a set of sequences. These tests may be justified statistically when using the correct alignment, but errors in the alignment lead to non-homologous characters being placed together, which in turn may systematically bias the test. We investigate empirically the impact of different alignment methods on phylogenetic analyses and assess the relative impact of different approximations used by different alignment methods. We examine the effect of alignment on two phylogenetic analyses that are commonly used in computational biology: the inference of a maximum-likelihood tree using RAxML, and a test for positive selection by comparing the M7 and M8 models in PAML. We test 200 sets of sequences from the Adaptive Evolution Database using the popular aligners ClustalW, Muscle, MAAFT, ProbCons, and the phylogenetic aligner Prank. We also sample from the posterior distribution of the statistical aligner BAli-Phy, which enables us to compare the relative impact of aligner choice to uncertainty from a single aligner. The algorithmic basis of an aligner tends to determine the outcome of the phylogenetic analysis. For example, trees estimated from progressive aligners tend to be more similar to one another than those estimated from phylogenetically aware (Prank) or consensus (ProbCons) aligners. Moreover the spread of phylogenetic parameter estimates inferred from BAli-Phy’s posterior distribution of alignments is much smaller than the differences between other aligners, suggesting differences are larger than could be expected by chance. Of the aligners examined, our results suggest that the phylogenetically informed Prank provides the closest approximation to full statistical alignment. |
||||

15:20-15:50 | Afternoon tea | |||

16:00-17:00 | McInerney, JO (National University of Ireland) |
|||

Are there alternatives to handling site-to-site rate variation in evolutionary characters | Sem 1 | |||

For more than 20 years, we have used the gamma distribution to effectively model site-to-site variation in evolutionary rate. This approach is usually implemented by first of all inferring a phylogenetic tree and then using a gamma distribution to improve the fit of the model and data. The result is almost always an improvement in likelihood scores, because apart from pseudogenes, most genes exhibit real variation in evolutionary rates among different sites. In our work we have sought to explore rate variation across characters and I will present our method and how it can work to improve phylogeny reconstruction. I will also speak about the need for future developments in this area and what I perceive to be the outstanding issues. |
||||

17:00-19:30 | Free time | |||

19:30-22:00 | Hog Roast Garden Party at The Moller Centre |

Thursday 23 June | ||||

09:00-10:00 | Mooers, A (Simon Fraser University) |
|||

Trees as representations of evolutionary information worth keeping | Sem 1 | |||

Weighted trees can be viewed as depictions of informational redundancy among tips. There is a rich literature in conservation biology on how this view can be used to maximize this information in triage situations or predict the future loss of such evolutionary information. Early views that redundancy is expected to be high is supported neither by theory on tree shapes nor by surveys of phylogenies. This means that work on maximization is relevant and timely. However, much triage is done within, rather than between, independent lineages; more theoretical work is needed to translate relevant metrics to, e.g., haplotype networks on landscapes. |
||||

10:00-10:20 | Hordijk, W (Lausanne) |
|||

Speciation & Extinction Rate Estimation from Phylogenetic Trees | Sem 1 | |||

Recently there has been an increasing interest in trying to estimate rates of evolution (in particular speciation and extinction rates) from phylogenetic tree data. However, this is currently still a difficult problem. We use a method based on lineage survival probabilities and actual speciation times to infer speciation and exitinction rates from (simulated) phylogenetic trees. Our results show that large trees (several hundred species) are necessary to get statistically significant results. This work is still ongoing, and we are investigating how to get more accurate and reliable results on smaller trees, and whether this is possible at all given the statistical nature of the problem. |
||||

10:20-10:40 | Wu, Y (Connecticut) |
|||

Coalescent-based Species Tree Inference from Gene Tree Topologies Under Incomplete Lineage Sorting by Maximum Likelihood | Sem 1 | |||

Incomplete lineage sorting can cause incongruence between phylogenetic history of genes (the gene tree) and that of the species (the species tree), which can complicate the inference of phylogenies. Developing robust computational inference approaches is currently of interests in studying incomplete lineage sorting. In this talk, I will present a new coalescent-based algorithm for inferring species tree with maximum likelihood. I will first describe an improved method for computing the probability of a gene tree topology for a species tree, which is much faster than an existing algorithm. Based on this method, I will present a practical algorithm that takes a set of gene tree topologies and infers species trees with maximum likelihood. In this algorithm, we search for the best species tree by starting from candidate species trees found by a parsimony method and performing local search to obtain better trees with higher likelihood. This algorithm, called {STELLS}, has been imp lemented in a program that is downloadable from the author's web page. The simulation results show that the STELLS algorithm is more accurate than several existing methods for many datasets, especially when there is noise (in terms of topology, branch lengths and rooting) in gene trees. We also show the STELLS algorithm is efficient and can be applied to real biological datasets. |
||||

10:40-11:00 | Wu, T (National University of Singapore) |
|||

Tree reconciliations: beyond the LCA mapping | Sem 1 | |||

The least common ancestor (LCA) mapping is an important tool to reconcile discordance between gene tree and species tree, and has been well studied in the literature. In this talk, we report some recent results in studying general reconciliations, i.e., leaf-preserving and order-preserving maps, between gene tree and species tree. Here we introduce three cost models to infer, respectively, the gene duplication, gene loss and deep coalescent events. And we also show that the canonical partial order dened on the set of reconciliations is compatible with these cost models. Finally, we discuss how to generalize these results to non-binary gene trees. |
||||

11:00-11:30 | Morning coffee | |||

11:30-11:50 | Sumner, J (Tasmania) |
|||

A Lie algebraic classification of continuous-time Markov models | Sem 1 | |||

In recent work we have discussed the importance of multiplicative closure for the Markov models used in phylogenetics. For continuous-time Markov chains, a sufficient condition for multiplicative closure of a model class is ensured by demanding that the set of rate matrices belonging to the model class form a Lie algebra. It is the case that some Markov models do form Lie algebras (eg. JC, F81, K3ST and GMM), and we refer to these models as ``Lie Markov models''. Importantly, other Markov models unequivocally do not form Lie algebras (the most conspicuous example being GTR). The question then naturally arises: How do we generate a full list of Lie Markov models? To answer this question in full generality seems quite difficult as the interaction between the operations of a Lie algebra and the stochastic requirements of rate matrices are somewhat at odds, and it seems that this question has not been addressed in the algebraic or stochastic mathematics literature. In this talk, we will discuss how we have made signficant progress in generating Lie Markov models by demanding that the models have certain symmetrics under nucleotide permutations. >From a theoretical perspective, we show that the Lie Markov models include, and hence provide a unifying concept for, ``group-based'' and "equivariant" models, whilst also generating many other interesting examples. We also argue that our scheme is very pleasing in the context of applied phylogenetics, as, for a given symmetry of nucelotide substitution, it provides a natural hierarchy of models with increasing number of parameters. This is joint work with Jesús Fernández-Sánchez and Peter Jarvis. |
||||

11:50-12:10 | Czabarka, E (South Carolina) |
|||

A generalization of Stirling numbers and distribution of phylogenetic trees | Sem 1 | |||

P.L. Erdos and L.A. Szekely provided a bijection between rooted semi-labeled trees and set partitions, and hence Stirling numbers of the second kind. This, with the asymptotic normality of the Sirling numbers of the second kind (Harper) translates into the asymptotic normality of rooted semi-labeled trees with a fixed number of vertices and a variable number of internal vertices. We apply Harper's method and the Erdos-szekely bijection to obtain the asymptotic normality of of phylogenetic trees. |
||||

12:10-12:30 | Stadler, T (ETH Zürich) |
|||

Mammalian phylogeny reveals recent diversi cation rate shifts | Sem 1 | |||

Phylogenetic trees of present-day species allow investigation of the rate of evolution which led to the present-day diversity. A recent analysis of the complete mammalian phylogeny challenged the view of explosive mammalian evolution after the K/T boundary (65 Ma). However, due to lack of appropriate methods, the diversication rates in the more recent past of mammalian evolution could not be determined. Here, I provide a method which reveals that the tempo of mammalian evolution did not change until about 33 Ma. This constant period was followed by a peak of diversication rates between 33 and 30 Ma. Thereafter, diversication rates remained high and constant until 8.55 Ma. Diversication rates declined signicantly at 8.55 and 3.35 Ma. Investigation of mammalian subgroups (marsupials, placentals, and the six largest placental subgroups) reveals that the diversication rate peak at 33-30 Ma ago is mainly driven by rodents, cetartiodactyla and marsupials. The recent diversication rate decrease is signicant for all analyzed subgroups but eulipotyphla, cetartiodactyla and primates. My likelihood approach is not limited to mammalian evolution. It provides a robust framework to infer diversication rate changes and mass extinction events in phylogenies, reconstructed from e.g. present-day species or virus data. In particular, the method is very robust towards noise and uncertainty in the phylogeny, and can account for incomplete taxon sampling. |
||||

12:30-13:30 | Lunch at Wolfson Court | |||

14:00-15:00 | Huson, D (University of Tubingen) |
|||

Integrative analysis if environmental sequences | Sem 1 | |||

In metagenomics, the aim is to understand the composition and operation of complex microbial assemblages in environmental samples through sequencing and analysis of their DNA. Similarly, metatranscriptomics and metaproteomics target the RNA and proteins obtained from such samples. A major challenge in the analysis of environmental sequences is data integration. The question is how to anaylze different types of data in a unified approach, addressing both taxonomic and functional aspects. We will discuss how to perform taxonomic analysis based on the NCBI taxonomy. Functional analysis can be based on classifications such as the SEED classification of functional roles and subsystems or the KEGG classification of pathways and enzymes. We will also discuss how to compare different datasets using statistical approaches, ecological indices and methods such as Neighbor-net and multi-dimensional scaling. |
||||

15:00-15:30 | Afternoon tea | |||

15:30-17:30 | Free afternoon |

Friday 24 June | ||||

09:00-10:00 | Bininda-Emonds, O (Carl von Ossietzky Universität Oldenburg) |
|||

Assessing the limits of phylogenomics: can too much data be a bad thing? | Sem 1 | |||

In contrast to the situation not even 20 years ago, molecular sequence data is now plentiful (if still patchily distributed) and phylogenomic studies of hundreds of taxa on a broad taxonomic scale are becoming increasingly common. Whereas the accuracy of phylogenetic analysis was limited until recently by a shortage of data (and then for both taxa and characters), the results of large and comprehensive phylogenomic studies where data are not limiting are also not without their problems. Analyses including large numbers of taxa run up against the superexponential increase in the number of possible solutions, requiring any or all of more time, faster computers in conjunction with parallel processing, and cleverer heuristics to find a hopefully near optimal solution. Perhaps less appreciated, however, is that the increasing taxonomic scope of our analyses demands the use of large amounts of molecular sequence data with significant rate heterogeneity across the data set (whether between or within partitions) to achieve full resolution throughout the tree. In this talk, I examine how the performance of phylogenetic analysis is affected when analyzing large number of taxa or a large multigene data set incorporating the degree of rate heterogeneity that is to be found, if not needed, in typical phylogenomic data sets. |
||||

10:20-10:40 | Doerr, D (Bielefeld) |
|||

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions | Sem 1 | |||

Distance-based phylogenetic reconstruction methods rely heavily on accurate pairwise distance estimates. There are two separate sources of error in this estimation process: (1) the relatively short sequence alignments used to obtain distance estimates induce a "stochastic error" corresponding to estimation of model parameters from finite data; (2) model misspecification leads to a "fixed error" which does not depend on sequence length. It is common practice to assume some substitution model over the sequence data and use an additive substitution rate function for that model when computing pairwise distances. In the providential case when the assumed model coincides with the true model, which is typically unkown, the distance estimates will not be afflicted with fixed error. But even then, there is no reason to a-priori enforce a zero fixed error, when this causes elevated rates of stochastic error, especially in the case of short sequence alignments. This work challenges this paradigm of "using the most additive distance function at any cost". We do this by studying the contribution and effect of both fixed and stochastic error in distance estimation. We present a formal framework for quantifying the fixed error associated with a specific distance function and a given phylogenetic tree in a homogeneous substitution model. As an example, we study the behavior of the Jukes-Cantor distance formula in homogeneous instances of Kimura's two parameter substitution model. The effects of fixed error are observed through analytic results and experiments on simulated data. In addition, we compare the performance of various distance functions on biological sequences. We evaluate reconstruction accuracy by comparing the reconstructed trees to an independently validated species tree. Our study indicates that often enough simple distance functions outperform more sophisticated functions, despite the fact that the given sequence data appears to have poor fit to the substitution model they assume. |
||||

10:40-11:00 | Pardi, F (CNRS) |
|||

The combinatorics of distance-based tree inference | Sem 1 | |||

Several popular methods for phylogenetic inference (or hierarchical clustering) are based on a matrix of pairwise distances between taxa (or any kind of objects): the objective is to construct a tree with branch lengths so that the distances between the leaves in that tree are as close as possible to the input distances. If we hold the structure (topology) of the tree fixed, in some relevant cases the optimal values for the branch lengths can be expressed using simple combinatiorial formulae. Here we define a general form for these formulae and show that they all have two desirable properties: first, the common tree reconstruction approaches (least squares, minimum evolution), when used in combination with these formulae, are guaranteed to infer the correct tree when given enough data (consistency); second, the branch lengths of all the simple (NNI) rearrangements of a tree can be calculated, optimally, in quadratric time in the size of the tree. The study presented here may form the basis for novel effcient search algorithms for distance-based tree reconstruction. |
||||

11:00-11:30 | Morning coffee | |||

11:30-12:30 | Warnow, T (University of Texas at Austin) |
|||

Estimating ultra-large phylogenies and alignments | Sem 1 | |||

Phylogenetic (evolutionary) tree estimation is fundamental to biology, and has applications to much biomedical research. Phylogenetic tree estimation is enormously difficult: the best approaches are NP-hard, and many real datasets take weeks or months of analysis. In particular, the estimation of multiple sequence alignment turns out to be one of the most challenging and important steps in a phylogenetic analysis. In this talk I will present several methods that produce greatly improved trees as compared to other phylogeny estimation methods. These methods utilize divide-and-conquer strategies in order to improve the accuracy of traditional phylogeny estimation methods. Among the methods I will present are SATe (Liu et al. 2009, Science Vol 324, no. 5934) and DACTAL (in preparation). SATe simultaneously estimates a tree and a multiple sequence alignment, while DACTAL estimates a tree without ever computing an alignment on the entire set of sequences. Both methods provide great improvements in tree accuracy over other methods, and do so fairly efficiently. |
||||

12:30-12:35 | Closing remarks and end | |||

12:35-13:30 | Lunch at Wolfson Court |