Skip to content

Workshop Programme

for period 24 - 28 March 2014

Mathematical, Statistical and Computational Aspects of the New Science of Metagenomics

24 - 28 March 2014


Monday 24 March
10:15-11:15 Registration and Morning Coffee
11:15-11:25 Welcome from John Toland (INI Director)
11:25-11:30 Welcome by the Programme Organisers
Session: Community profiling and comparative metagenomics - 1
Chair: D Huson/G Valiente
11:30-12:30 Meyer, F (Argonne National Laboratory, USA)
  Lessons learned from operating a big metagenomics resource Sem 1

Metagenomics is a relatively new technique to assay microbial communities in laboratory and also natural settings. For the first time the genomes of many of the millions of previously uncharacterized species can be explored at least partially.

Metagenomics is fundamentally different from traditional genomics and requires new approaches that match the specific data types and the data volume.

Large scale comparative analysis of metagenomic data provides another unique opportunity for learning lessons about microbial biology, but also presents a significant number of challenges to both bioinformatics and computer science.

I will provide a lessons learned that summarizes some of the pitfalls encountered in recent years as well as new technologies developed to overcome those problems.

Related Links: - MG-RAST

12:30-13:45 Lunch at Wolfson Court
Session: Community profiling and comparative metagenomics - 2
Chair: D Huson/G Valiente
13:45-14:45 Garrity, GM (Michigan State University and NamesforLife, LLC)
  Reasonable names and reasonable terms for Bacteria and Archaea Sem 1

Co-authors: Charles T. Parker (NamesforLife, LLC), Nenad B. Krdzavac (Michigan State University), Kevin Petersen (Michgan State University), Grace Rodriguez (NamesforLife, LLC)

One of the principle goals of metagenomics studies is to better understand the complex relationships that exist between microbial communities and the environments in which they are found. A complicating factor is that each exerts an influence on the other in significant but poorly understood ways that are often in a state of flux. To gain useful information from such studies, one must understand not only the community composition, but also the metabolic capabilities of the community members under the observed conditions. High throughput sequencing, coupled with various ordination and phylogenetic methods provides a way of uncovering interesting patterns in the data, but extraction of knowledge from the data requires integration with other databases and the literature, both of which are typically accessed based on inferred identity of community members. For such inference to be meaningful, one must have a fundamental understanding of both the nomenclature and the terminologi es that are used to describe unambiguously the species of interest, preferably in a manner that is amenable to automation. This presentation will focus on the development of a generalized semantic model that has been developed to disambiguate biological nomenclature and to provide both humans and machines with direct access to the correct information about all of the validly named prokaryotic taxa. Current research efforts on developing an ontology of microbial phenotypes, which supports machine reasoning, will also be discussed.

Related Links: - Speaker bio (academic) - Speaker bio (general/industrial) - company website - Standards in Genomic Sciences (editor-in-chief) - International Committee on Systematics of Prokaryotes (chair) - International Jorunal of Systematic and Evolutionary Microbiology (list/nomenclature editor and reviewer)

14:45-15:30 Afternoon Tea
Session: Community profiling and comparative metagenomics - 3
Chair: D Huson/G Valiente
15:30-16:15 Huson, D (University of Tuebingen)
  Fast and accurate analysis of metagenomic data Sem 1

The number and size of metagenome datasets is continuously growing calling for ever faster algorithms and methods for analyzing them. In this presentation we present a number of methods that we are currently developing for taking some of the pain out of computational metagenome analysis.

16:15-17:00 Warnow, T (University of Texas at Austin)
  TIPP: Taxon Identification and Phylogenetic Profiling Sem 1

Co-authors: Nam Nguyen (UT-Austin), Siavash Mirarab (UT-Austin), Mihai Pop (University of Maryland, College Park), Bo Liu (University of Maryland, College Park)

Abundance profiling is a crucial step in understanding the diversity of a metagenomic sample. Key to estimating an accurate profile is the ability to perform taxonomic identification of the metagenomic reads. Many methods have been developed for taxonomic identification, however, the current best methods often fail to classify many fragments, especially at lower taxonomic levels.

We present TIPP (taxon identification and phylogenetic profiling), a new marker-based taxon identification and abundance profiling method. TIPP combines a highly accurate phylogenetic placement method (SEPP) with statistical techniques to control the precision and recall of the classification results. We show that TIPP is more robust to sequencing error and has better recall than other marker-based taxon identification methods. TIPP also yields highly accurate abundance profiles, matching or improving on many previous approaches, including NBC, PhymmBL, MetaPhyler, and MetaPhlAn.

17:00-18:00 Welcome Wine Reception
Tuesday 25 March
Session: Metagenome sequence assembly - 1
Chair: T Warnow
09:30-10:30 Pop, M (University of Maryland)
  Mission impossible: metagenomic assembly Sem 1

Computer scientists have repeatedly proven that genome assembly, even for a single organism, is an intractable computational problem. In metagenomics, the problem is even harder - we are trying to simultaneously reconstruct many different and not so different genomes. Should we even try?

In my talk I will outline the main reasons to be pessimistic about metagenomic assembly and discuss possible ways forward. In particular, I will argue that 'metagenomic assembly' is an ill defined concept and that we, as a community, need to identify and formalize specific use cases that address relevant biological problems and that are computationally tractable. I will also discuss issues related to the validation of metagenomic assemblies, an important yet often overlooked problem.

10:30-11:00 Morning Coffee
Session: Metagenome sequence assembly - 2
Chair: T Warnow
11:00-11:45 Beerenwinkel, N (ETH Zurich)
  Estimating within-host viral genetic diversity from next-generation sequencing data Sem 1

Next-generation sequencing allows for cost-effective probing of pathogen populations at an unprecedented level of detail. The massively parallel sequencing approach can detect low-frequency alleles and it provides a snapshot of the structure of the entire population. However, analyzing ultra-deep sequencing data obtained from mixed samples is challenging, because reads contain amplification and sequencing errors and the read length is typically shorter than the genomic region of interest. Thus, ultra-deep sequencing experiments provide only indirect evidence of the underlying population structure. We will present computational and statistical methods for read error correction and haplotype reconstruction in intra-patient virus populations, and show how to infer evolutionary parameters from these data.

11:45-12:30 Leggett, R (The Genome Analysis Centre (TGAC))
  Metagenomic assembly and characterisation of viral signatures with MetaCortex Sem 1

Co-authors: Ricardo Ramirez Gonzalez (The Genome Analysis Centre (TGAC)), Kate Baker (Wellcome Trust Sanger Institute), Pablo Murcia (MRC-University of Glasgow Centre for Virus Research), Mario Caccamo (The Genome Analysis Centre (TGAC))

Most standard tools for assembly of next generation sequencing reads are not designed for metagenomic datasets, rather the aim is to assemble a single genome into a relatively small number of large contigs or scaffolds. In part this is achieved by implementation of algorithms designed to remove sequencing errors and to smooth small variations that would otherwise break contiguity. However, this approach has limits when used to analyse environmental samples, in which fragments of DNA from multiple organisms may be present and where the aim is not to assemble a single genome, but to understand the wider biological composition of the sample.

The last 2 years has seen the emergence of a number of specialised metagenomic assemblers aimed at addressing this tools gap. One example of this is our own work, MetaCortex, which is capable of assembling sequencing reads and outputting contigs indicative of the entire set of organisms present in a sample. MetaCortex works by identifying sub-graphs within the wider de Bruijn graph assembly and building contigs, which represent the consensus path through these components. In doing so, variation information is preserved and it is possible to detect organisms present in the sample with low read coverage. We applied this method to the analysis of DNA samples from Eidolon helvum, a species of bat implicated in the transmission of zoonotic disease to humans and were able to identify hundreds of different viral DNA signatures within samples.

12:30-13:45 Lunch at Wolfson Court
Session: Metagenome sequence assembly - 3
Chair: T Warnow
13:45-14:45 Durbin, R (Welcome Trust Sanger Institute, Cambridge, UK)
  tba Sem 1
14:45-15:30 Afternoon Tea
Session: Metagenome sequence assembly - 4
Chair: T Warnow
15:30-16:15 Iqbal, Z (University of Oxford)
  Species identification from medical metagenomic sequence data Sem 1

Species identification from metagenomic sequence data has attracted considerable attention recently, and a number of software solutions are now available, based on whoe-genome assemblies, mapping to references, or informative kmers. In this talk I focus on a small corner of this area, that of species identification from medical samples, where we have an idea from the patient's condition (and perhaps from inspection of culture) what genus of bacterium we expect to find, but we do not know if there is a mixture. I specifically look at the case of Staphylococcus aureus, a commensal organism which lives in the noses of ~30% of us, and yet which can be a pathogen. Many assemblies of S. aureus strains are present, but generally only one assembly per species for other Staphylococcus species (some of which can cause illness). I'll talk about how we can address the issues which arise when we want confident answers in the face of such a biased set of prior information.

16:15-17:00 Chin, F (University of Hong Kong)
  NGS Sequence Assembly for Metagenomic Data Sem 1

Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from thousands of different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads sampled from a single genome, only a few assemblers work on metagenomic data. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory because of the following reasons: (1) up to 99% of the species in the sample are unknown and have no reference genomes, (2) sequencing depths of genomes from different species are highly uneven because different species in a sample have different abundances (over 100 times), and (3) genomes of subspecies and species in a sample can be very similar and the existence of common regions across different genomes make the assembly problem much more complicated. In this talk, different techniques (IDBA-UD and MetaIDBA), with possible incorporation with binning results (MetaCluster), for solving these three problems will be presented. As the last two problems are more severe for RNA-Seq data from metagenomic sample (metatranscriptomic data) than metagenomic data, no existing assemblers work well on the metatranscriptome data (even though gene sequence information of some microbes might be known). Main features of a metatranscriptome assembler (IDBA-MT) will also be discussed.

Wednesday 26 March
Session: Beyond taxonomic and functional characterisation - 1
Chair: W Gilks
09:30-10:30 Rubin, E (Joint Genome Institute, Lawrence Berkeley National Laboratory)
  Understanding Biology from the Sequence Data of Uncultured Organisms Sem 1

The Joint Genome Institute (JGI) is a large scale sequencing and analysis facility focused on understanding environmental processes through genomics. I will present a series of metagenomic studies and the resulting insights and discoveries that have been carried out at the Institute. These will range from the Neanderthal Genome, microbes living in cow and sheep rumen and the discovery of novel genetic codes operating in nature.

10:30-11:00 Morning Coffee
Session: Beyond taxonomic and functional characterisation - 2
Chair: W Gilks
11:00-11:45 Allen, R (University of Edinburgh)
  Predictability and unpredictability in the dynamics of nutrient-cycling microbial ecosystems Sem 1

Co-authors: Timothy Bush (School of Physics and Astronomy, University of Edinburgh), Eulyn Pagaling (School of Physics and Astronomy, University of Edinburgh), Fiona Strathdee (School of Biological Sciences, University of Edinburgh), Andrew Free (School of Biological Sciences, University of Edinburgh)

Microbial communities mediate crucial biogeochemical, biomedical and biotechnological processes, yet our understanding of their assembly, and our ability to control its outcome, remain poor. We investigate experimentally whether microbial ecosystem assembly is predictable, or inherently unpredictable. In our experiments, source microbial communities colonize a pristine microcosm environment to form complex, nutrient-cycling ecosystems. We find that when the source communities colonize a novel environment, final community composition and function are unpredictable, but when the source communities are pre-conditioned to their new habitat, community development is more reproducible. Our results suggest strategies for improving the design of complex microbial communities for biotechnological applications. I will also discuss recent attempts to construct mathematical models for the dynamics of our microcosm communities, and to use them to predict the sensitivity of microbial ecosystems to environmental change.

11:45-12:30 Falush, D (Max Planck Institute for Evolutionary Anthropology, Leipzig)
  Ecological genomics from metagenomic data? Sem 1

In this talk I will briefly review the information that isolate-based genome sequencing has provided on ecological questions. I will then describe how these insights could be built upon using metagenomic data. Finally, I will discuss how this can be expanded to test ideas about competition, dispersal and diversity from ecological theory. Throughout I will focus on Campylobacter as a model system.

12:30-13:45 Lunch at Wolfson Court
Session: Beyond taxonomic and functional characterisation - 3
Chair: W Gilks
13:45-14:30 O'Brien, J (Bowdoin College)
  A Bayesian approach to inferring the phylogenetic structure of microbial communities: the case of a seasonal Antarctic lake Sem 1

Co-authors: Daniel Falush (Max Planck Evolutionary Anthropology), Xavier Didelot (Imperial College London)

Whole-genome metagenomic sampling allows researchers to investigate microbial communities as they shape and are shaped by their environments. A particularly interesting case occurs when a species (or closely-related species) evolves within a set of samples. As the samples are often mixtures of these recently-evolved lineages, a statistical difficulty arises in attempting to infer the clonal history of this differentiation while accounting for the mixture among samples. In this talk, I present the lineage model, a Bayesian attempt to simultaneously infer both the underlying phylogeny and the sample mixing using metagenomic data. I discuss the assumptions of the model, how it differs from other approaches, and simulation results. I conclude with an extended example on Chlorobium limicola, an important photosynthetic bacteria, found to be the dominant species in samples from a seasonal lake in Antarctica.

Related Links: - The lineage model paper, currently under review

14:30-15:00 Pricop-Jeckstadt, M (University of Bonn)
  Convergence analysis of balancing principle for nonlinear Tikhonov regularization in Hilbert scales for statistical inverse problems Sem 1

In this talk we focus on results regarding inverse problems described by nonlinear operator equations both in a deterministic and statistical framework. The last developments in the methodology are reviewed and similarities and di erences related to the nature of the setting are emphasized. Furthermore, a convergence analysis leading to order optimal rates in the deterministic case and order-optimal rates up to a log-factor in the stochastic case for the Lepskii choice of the regularization parameter for a range of smoothness classes and with a milder smallness assumptions is presented. These assumptions are shown to be satisfied by a Volterra-Hammerstein non-linear integral equation that has several applications as population growth model in the population dynamics.

References Hohage T. and Pricop M."Nonlinear Tikhonov regularization in Hilbert scales for inverse boundary value problems with random noise".Inverse Problems and Imaging, Vol. 2, 271{ 290, 2008. Bissantz N., Hohage T. and Munk A."Consistency and rates of convergence of nonlinear Tikhonov regularization with random noise". Inverse Problems, Vol. 20, 1773{1791, 2004.

15:00-15:45 Afternoon Tea
Session: Microbial community transcriptomes, proteomes and metabolomes - 1
Chair: W Gilks
15:45-16:30 Mendes, R (Empresa Brasileira de Pesquisa Agropecuária (EMBRAPA))
  The living layer: the microbial interface between superior organisms and the environment as revealed by metagenomics Sem 1

Co-authors: Emilie Chapelle (Wageningen University), Peter A. H. M. Bakker (Utrecht University), Jos M. Raaijmakers (Netherlands Institute of Ecology)

A staggering amount of microbes lives in a close association with eukaryotic organisms and they play key role in the evolution and functioning of the host. The microbiome concept, i.e. the collection of commensal, symbiotic, and pathogenic microorganisms living in association with a given host, was first used in the context of microorganisms sharing the human body space, and after that several other authors have used this term in other mammals, insects or plants. The human microbiome significantly contributes to the human metabolism, and provides traits that humans did not need to evolve on their own. The set of genes present in the microbiome can be considered as the secondary genome of the host. The same view can be transported to other mammals, as ruminants, or even plants, which also rely in their microbiome for specific traits. Here, we will discuss how metagenomics has shed light on understanding of the microbiomes as being the connecting layer placed between the host a nd the environment mediating a range of host physiological process, and ultimately impacting on health and disease providing life support for the host.

16:30-17:00 Sczyrba, A (Universität Bielefeld)
  Metagenome, metatranscriptome and single cell genome sequencing of biogas-producing microbial communities from production-scale biogas plants Sem 1

Co-authors: Andreas Bremges (Bielefeld University), Yvonne Stolze (Bielefeld University), Irena Maus (Bielefeld University), Vera Ortseifen (Bielefeld University), Tanja Woyke (DOE Joint Genome Institute), Alfred Pühler (Bielefeld University), Andreas Schlüter (Bielefeld University)

Production of biogas by means of anaerobic digestion of biomass is becoming increasingly important as biogas is regarded a clean, renewable and environmentally compatible energy source. Moreover, generation of energy from biogas relies on a balanced carbon dioxide cycle. In 2013, more than 7,500 digesters were operating in Germany, generating 25 TWh of electricity per year. The microbiology of biogas fermentation from biomass is complex and involves interaction of different microorganisms. The majority of the participating microbes are still unknown, as is their influence on reactor performance.

Our study aims at the elucidation of the microbiology of biogas producing microbial communities by applying metagenome, metatranscriptome and single cell sequencing of dominant community members, combined with metaproteomics. In this study we profile four different production scale biogas plants. Extracted whole community DNA and RNA were deeply sequenced using the Illumina’s HiSeq sequencer, while 16S rRNA gene sequencing was done on the MiSeq system. Metagenome sequencing resulted in 403 Gbp total sequence information, while metatranscriptome sequencing yielded 290 Gbp.

The obtained sequencing depth enables a profound taxonomic and functional analysis, as well as the reconstruction of microbial genomes representing so far uncultured species and relevant metabolic pathways. This is the first study of production-scale biogas plant microbial communities combining metagenomics, metatranscriptomics, metaproteomics and single cell genome sequencing, providing comparative insights into the microbiology of different fermenters, with respect to differences in operation conditions and fed substrates, different pathways involved in substrate hydrolysis, acidogenesis, acetogenesis, methane production and stress response.

19:30-22:30 Conference Dinner at Cambridge Union Society hosted by Cambridge Dining Company
Thursday 27 March
Session: Microbial community transcriptomes, proteomes and metabolomes - 2
Chair: W Gilks
09:30-10:30 McHardy, AC (Heinrich Heine University Düsseldorf)
  Inferring genotype-phenotype relationships from (meta-)genomes Sem 1

Co-authors: Aaron Weimann (Heinrich Heine University Düsseldorf), Sebastian Konietzny (Heinrich Heine University Düsseldorf), Ivan Gregor (Heinrich Heine University Düsseldorf), Johannes Droege (Heinrich Heine University Düsseldorf), Phil B. Pope (Norwegia University of Life Sciences)

Next generation sequencing allows to extensively survey the genome-wide genetic diversity of microbial communities, as well as populations from all domains of life. A major challenge is the development of computational methods for hypothesis generation and basic computational analysis of these large-scale data sets. I will present our recent work on computational methods for metagenome analysis. We are working on methods for predicting and characterizing microbial phenotypes as well as identifying the relevant protein repertoire for a given phenotype, focusing hereby on microbial plant biomass degradation.

10:30-11:00 Morning Coffee
Session: Microbial community transcriptomes, proteomes and metabolomes - 3
Chair: W Gilks
11:00-11:45 Hirsch, P (Rothamsted Research)
  Digging into the soil metagenome Sem 1

Co-authors: Elisa Loza (Rothamsted Research), Tim Mauchline (Rothamsted Research), Andy Neal (Rothamsted Research), Ian Clark (Rothamsted Research)

Soil is the most biodiverse environment on earth, typically containing 109 bacterial cells from 106 different species per g. At most, 1% of these cells can grow in the laboratory; the majority remained obscure until the recent development of culture-independent methods. Usually taken for granted, soil is an invaluable resource providing essential ecosystem services as well as food to sustain the growing human population. More information on the genetic diversity of soil communities, in particular functional genes for the biological processes that underpin soil quality, is needed to establish the resilience of the system to perturbation. How many individuals and phylogenetic groups are potentially capable of performing a function? How many of these are active in particular conditions? Are there multiple possible combinations that provide similar functionality? Does the system always return to one combination after perturbation? This is especially relevant to arable agriculture where soils are deliberately manipulated to support crop rotations, during conversion of land to different management systems or with increasingly unpredictable climate change.

The advent of NGS-based metagenomics and metatranscriptomics with quantitative data on the presence and activity of phylogenetic and functional groups provides an unprecedented opportunity to describe and interpret soil biological systems. Long-term field experiments at Rothamsted Research enable comparison of the relative impact of different fertilizer inputs, crops and cultivation on microbial communities. Data amassed on these soils including 16S rRNA gene amplicons, full metagenomes and transcriptomes will be presented. At present, the large volumes of the datasets, the lack of appropriate analytical pipelines and relevant statistics is a constraint on processing and analysing the data. Nevertheless, it has generated information on which groups respond to nitrogen fertilizer, ploughing and changes in plant cover.

11:45-12:30 Moulton, V (University of East Anglia)
  Using metatranscriptomics to make global predictions: The impact of temperature on marine phytoplankton resource allocation and metabolism Sem 1

Marine phytoplankton are responsible for roughly 50% of the carbon dioxide that is fixed annually worldwide, and contribute massively to other biogeochemical cycles in the oceans. In this talk we present an analysis of metatranscriptomes from eukaryotic phytoplankton sampled from distinct latitudinal temperature zones. We also present some predictions that were made by integrating this data within a global ecosystems model. These predictions add to current concerns on the effect of global warming on marine ecosystem functioning.

This is joint work with groups based at University of East Anglia, University of Exeter, Alfred-Wegener Institute for Polar and Marine Research, and University of the Algarve. More details can be found in Nature Climate Change 3:979-984, (2013), and also in a poster that will be presented in the workshop by Andrew Toseland.

12:30-13:45 Lunch at Wolfson Court
Session: Bioinformatic tools for metagenomics
Chair: D Huson/G Valiente
13:45-14:15 Koslicki, D (Oregon State University)
  Quikr: Rapid Bacterial Community Reconstruction Via Compressive Sensing Sem 1

Co-authors: Simon Foucart (University of Georgia), Gail Rosen (Drexel University)

Many metagenomic studies compare hundreds to thousands of environmental and health-related samples by extracting and sequencing their DNA. However, one of the first steps - to determine what bacteria are actually in the sample - can be a computationally time-consuming task since most methods rely on computing the classification of each individual read out of tens to hundreds of thousands of reads. We introduce Quikr: a QUadratic, K-mer based, Iterative, Reconstruction method which computes a vector of taxonomic assignments and their proportions in the sample using an optimization technique motivated from the mathematical theory of compressive sensing. On both simulated and actual biological data, we demonstrate that Quikr is typically more accurate as well as typically orders of magnitude faster than the most commonly utilized taxonomic assignment techniques for both whole genome techniques (Metaphyler, Metaphlan) and 16S rRNA techniques (the Ribosomal Database Project's Naive Bayesian Classifier). We also show that in general nonnegative L1 minimization can be reduced to a simple nonnegative least squares problem.

Related Links: - WGSQuikr preprint - Quikr preprint - Sparse recovery by means of nonnegative least squares

14:15-14:45 Santamaria, M (CNR)
  New bioinformatic resources for taxonomic assignment of metagenomic NGS data: BioMaS, ITSoneDB and SARMA Sem 1

Co-authors: Bruno Fosso (Institute of Biomembranes and Bioenergetics, CNR, via Amendola 165/A, 70126 Bari, Italy), Mattia D’Antonio (CINECA, Roma), Gabriel Valiente (Algorithms, Bioinformatics, Complexity and Formal Methods Research Group, Technical University of Catalonia, E-08034, Barcelona, Spain), Giacinto Donvito (Institute national of Nuclear Physics, Via Orabona, 4, 70126 Bari, Italy), Alfonso Monaco (Institute national of Nuclear Physics, Via Orabona, 4, 70126 Bari, Italy), Pasquale Notarangelo (Institute national of Nuclear Physics, Via Orabona, 4, 70126 Bari, Italy), Giorgio Pietro Maggi (Institute national of Nuclear Physics, Via Orabona, 4, 70126 Bari, Italy and Politecnico of Bari, Via Orabona, 4, 70126 Bari), Graziano Pesole (Institute of Biomembranes and Bioenergetics, CNR, via Amendola 165/A, 70126 Bari, Italy and Department of Biosciences, Biotechnology and Pharmacological Sciences, University of Bari, Bari)

Substantial advances in molecular microbiology have been carried out in recent years thanks to Metagenomics. Next Generation Sequencing (NGS) technologies are massively supporting this approach but imply, at the same time, new hard challenges concerning the analysis of the huge amount of data they produce. In this scenario we present three new bioinformatic resources aimed at supporting molecular researches in advanced analyses of NGS metagenomic data. The first one is BioMaS (Bioinformatic analysis of Metagenomic ampliconS), a comprehensive pipeline for the taxonomic analyses of meta-barcode sequences. BioMaS, in its current version, allows the analysis of both bacterial and fungal populations through the processing of data obtained by Roche 454 or Illumina platforms. BioMaS will be soon implemented as a very friendly publicly available web-service. The second one is ITSoneDB, a curated collection of taxonomically annotated ITS1 sequences suitable for metagenomic studies of fungal communities. ITSoneDB is available at Finally, SARMA (Species Assignment of Reads from Metagenomic Analysis), whose construction is still ongoing, has been designed for the taxonomic characterization of metagenomic data produced by shotgun NGS, particularly oriented towards the study of human or other host microbiomes. The pipeline in the first step carries out serial searches against human, bacteria, fungi, protozoa and virus reference sequences, and then reads uniquely mapping to the host are removed. In the second step the taxon assignment of all reads uniquely mapping to the above life domains is carried out. A webservice/Galaxy implementation of SARMA is currently in progress, enabling the usage of several heterogeneous computing infrastructure thus remarkably optimizing the computational load and run time.

Chair: D Huson/G Valiente
14:45-15:15 Ruscheweyh, H-J (University of Tuebingen)
  Webserver-supported storage of metagenomic datasets using MEGANv5 Sem 1

Co-author: Daniel Huson (University of Tuebingen, Algorithms in Bioinformatics, Tuebingen, Germany)

Background: Metagenomics is a rapidly growing field of research that aims at studying assemblages of uncultured organisms with the help of sequencing, with the hope of understanding the true diversity of microbes, their functions, cooperation and evolution. While early papers studied isolated or small numbers of samples, there is now an increasing number of projects that involve systematically collecting multiple samples of, due to sinking sequencing costs, growing size. Moreover, more attention is being paid to the problem of recording relevant environmental parameters (so-called metadata). There is a need for tools that allow one to store and analyze multiple metagenomic datasets in the context of their metadata.

Results: We announce an extension to our metagenome analysis tool MEGAN, called MeganServer, that allows one to store metagenomic datasets on a secure server in order to reduce redundancy and enhancing the ease of sharing large datasets between project members or making the publicly available. The software allows one, additionally, to capture the metadata associated with datasets and then use it to form new composite datasets by combining primary datasets based on the values of their environmental parameters. While the user can analyze any such combined dataset exactly like a primary dataset using MEGAN, internally, a combined dataset refers back to the primary datasets and thus does not duplicate any reads or matches.

Conclusions: With sinking sequencing costs, metagenomic datasets are growing to sizes too large to be stored locally. Installing MeganServer on an computer cluster or using a publicly available instance allows one to store datasets on a server without losing the benefits of using MEGAN locally. Also, combining datasets based on environmental features is an important step in the comparative analysis of metagenome datasets.

15:15-16:00 Afternoon Tea
Session: Statistical methods in metagenomics - 1
Chair: W Gilks
16:00-16:30 Greenman, C (The Genome Analysis Centre)
  Inferring Mixed Viral Populations Sem 1

Any growing population of RNA viruses will undergo processes of mutation and selection resulting in a mixed population of viral particles. High throughput sequencing of a viral population subsequently contains a mixed signal of the underlying haplotypes that need identifying. We utilize a pigeon-hole principle to extract haplotype information, estimate the prevalence of underlying clones, and examine the evolutionary structure. The method utilises both linkage information from paired reads and depth information. The method utilises tree-like and recombination network evolution structures and has best applicability to time series data sets of high prevalence mutations within viral segments. Daily data from influenza are used to highlight the method, where both recombination within segments and re-assortment between segments are observed.

16:30-17:30 Open discussion
Friday 28 March
Session: Statistical methods in metagenomics - 2
Chair: W Gilks
09:30-10:15 Li, H (University of Pennsylvania)
  Microbiome, Metagenomics and High-dimensional Compositional Data Analysis Sem 1

Next-generation sequencing technologies allow 16S ribosomal RNA gene surveys or whole metagenome shotgun sequencing in order to characterize taxonomic and functional compositions of gut microbiomes. The outputs from such studies are short sequence reads derived from a mixture of genomes of different species in a given microbial community. We first present a brief overview of the statistical methods we used for 16S rRNA data analysis. We then introduce a multi-sample model-based method to quantify the bacterial compositions based on shotgun metagenomics using species-specific marker genes. The resulting data are high-dimensional compositional data, which complicate many of the downstream analyses. We introduce the GLMs with linear constraint on regression parameters in order to identify the bacterial taxa that are associated clinical outcomes and a composition-adjusted thresholding procedure to estimate correlation network from compositional data. We demonstrate the met hods using two on-going gut microbiome studies at the University of Pennsylvania.

10:15-11:00 Morning Coffee
Session: Statistical methods in metagenomics - 3
Chair: W Gilks
11:00-11:45 Tsivtsivadze, E (TNO Research Institute)
  Statistical Machine Learning for Modeling Early Respiratory Microbiota Composition Sem 1

Co-authors: Giske Biesbroek (UMC Utrecht), Elisabeth A.M. Sanders (UMC Utrecht), Roy Montijn (TNO Research Institute), Reinier H. Veenhoven (4. Research Center Linnaeus Institute), Bart J.F. Keijser (TNO Research Institute), Debby Bogaert (UMC Utrecht)

Many bacterial pathogens causing respiratory infections in children are common residents of the respiratory tract. Insight into bacterial colonization patterns and stability at young age may allow identification of biomarker strains that elucidate healthy or susceptible conditions for development of respiratory disease.

We used statistical machine learning algorithms for analysis of complex nasopharyngeal microbiota profiles of 60 healthy children at the ages of 6 weeks, and 6, 12 and 24 months. Our unsupervised and semi-supervised learning methods are particularly suitable for high dimensional metagenomic datasets. The methods stem from a recently proposed class of multi-view algorithms (closely related to ensembles and consensus techniques) that aim to combine multiple clustering hypotheses for increased accuracy and are not limited to a single similarity measure, thus leading to robust and reliable results. Furthermore, our algorithms allow identification of the optimal number of clusters via construction of co-occurrence matrices and detection of biomarker species by using unsupervised greedy forward feature selection approach.

We identified 6 distinct microbiota profiles represented by the dominant genera Moraxella, Haemophilus, Streptococcus, or Staphylococcus, a combination of Dolosigranulum and Corynebacterium, plus cluster-specific low abundant biomarker bacteria. The current study enabled us to gain insight in the dynamic nature of nasoparyngeal microbiota in infants. Our results suggest that the composition of early-life microbiota is associated with long-term stability and may predict susceptibility to disease.

Chair: W Gilks
11:45-12:30 Holmes, S (Stanford University)
  Waste Not, Want Not: Why Rarefying Microbiome Data is not an optimal normalization procedure Sem 1

Co-author: Paul Joey McMurdie (Stanford University)

The interpretation of metagenomic count data originating from the current generation of DNA sequencing platforms requires special attention. In particular, the per-sample library sizes often vary by orders of magnitude from the same sequencing run, and the counts are overdispersed relative to a simple Poisson model. These challenges can be addressed using an appropriate mixture model that simultaneously accounts for library size differences and biological variability. This approach is already well-characterized and implemented for RNA-Seq data in R packages such as edgeR and DESeq. We use statistical theory, extensive simulations, and empirical data to show that variance stabilizing normalization using a mixture model like the negative binomial is appropriate for microbiome count data. In simulations detecting differential abundance, normalization procedures based on a Gamma-Poisson mixture model provided systematic improvement in performance over crude proportions or rarefied counts -- both of which led to a high rate of false positives. In simulations evaluating clustering accuracy, we found that the rarefying procedure discarded samples that were nevertheless accurately clustered by alternative methods, and that the choice of minimum library size threshold was critical in some settings, but with an optimum that is unknown in practice. Techniques that use variance stabilizing transformations by modeling microbiome count data with a mixture distribution, such as those implemented in edgeR and DESeq, substantially improved upon techniques that attempt t o normalize by rarefying or crude proportions. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

Related Links: - Arxiv Version of Paper. - Phyloseq Package Description and Philosophy

12:30-13:45 Lunch at Wolfson Court
Session: Statistical methods in metagenomics - 4
Chair: W Gilks
13:45-14:30 Quince, C (University of Glasgow)
  Linking taxa to function through contig clustering of microbial metagenomes Sem 1

Co-authors: Johannes Alneberg (KTH Royal Institute of Technology, Stockholm, Sweden), Brynjar Smaari Bjarnason (KTH Royal Institute of Technology, Stockholm, Sweden), Ino de Bruijn (KTH Royal Institute of Technology, Stockholm, Sweden), Melanie Schirmer (University of Glasgow), Joshua Quick (University of Birmingham), Nicholas J. Loman (University of Birmingham), Anders F. Andersson (KTH Royal Institute of Technology, Stockholm, Sweden), Konstantinos Gerasimidis (University of Glasgow)

Taxonomic profiling of microbial communities can answer the question of “Who is there?” This can be achieved either through marker gene sequencing or true shotgun metagenomics. The latter because the functional genes of all community members are sequenced allows us to answer the additional question: “What are they doing?” However, there is a third question that is key to understanding microbial communities: “Who is doing what?” This question has received much less attention because to answer it requires the extraction of complete genomes from metagenomes. Assembly of metagenomes can generate millions of contigs, assembled genome fragments, with no information on which contig derives from which genome. Here I will present CONCOCT, a novel algorithm that combines sequence composition, coverage across multiple samples, and read-pair linkage to automatically cluster contigs into genomes. CONCOCT uses a dimensionality reduction coupled to a Gaus sian mixture model, fit using a variational Bayesian algorithm which automatically identifies the optimal number of clusters. We demonstrate high recall and precision rates on artificial as well as real human gut metagenome datasets. Linking contigs into genome clusters, allows the frequencies of those clusters to be related to metadata, revealing function. We apply this approach to fecal metagenomes obtained from the E. coli O104:H4 epidemic (Germany, 2011) and are able to directly extract the outbreak genome. We also use it to identify organisms associated with inflammation in samples from children with Crohn’s disease.

Related Links - arXiv preprint

14:30-15:00 Morfopoulou, S (University College London)
  Application of Bayesian model averaging and population Monte Carlo to inference from metagenomic mixture Sem 1

Co-author: Vincent Plagnol (University College London Genetics Institute)

For many practical applications, for example to uncover the pathogen that caused an infection after the acute phase, very deep short read sequencing can be effective provided that we can reliably assign short sequencing reads to species. This problem of assignment of reads to species is complicated by the fact that, in the absence of very large contigs, most short reads reads match to multiple species. This is essentially a mixture model, where the complete knowledge of all species present in the mixture provides information about the assignment of each read individually. However, metagenomic data analysis rarely formulates the problem in these terms because the very large number of potential species typically makes the inference intractable. Here, we propose a Bayesian model averaging strategy designed to explore the high dimensional space of species present in a metagenomic mixture. We use approximate Bayesian computation and a Monte Carlo strategy to implement the search o f the most appropriate mixture models. Owing to the computationally intensive aspects of the work, we used a population Monte Carlo Markov Chain to leverage the use of parallel computing. We find that the methodolgy is effective to provide a full Bayesian inference for samples with > 10M reads, hence providing interpretable Bayes Factors and posterior probabilities for practical problems that regularly arise in a clinical context.

15:00-15:45 Afternoon Tea

Back to top ∧