Population-based analysis of genome structural variation using broad, highly parallel population sequencing
McCarroll, S (Harvard University)
Thursday 15 July 2010, 12:00-12:30
Seminar Room 1, Newton Institute
Abstract
Accurate and complete characterization of genome variation in the largest possible number of individuals will be required to understand the role of genome variation in disease. We present a novel analytical framework, informed by population genetics, for characterizing genome structural polymorphism using sequence data distributed across hundreds or thousands of genomes. Our approach integrates diverse features of next-generation sequencing (NGS) data – read pairs, read depth, and the distribution of evidence across samples and around each genomic locus – to identify genomic loci where multiple features of NGS data from many genomes coalesce around a population-genetic model of alternate structural alleles segregating within a population. We instantiated this framework in software (Genome STRucture in Populations). Genome STRiP identified structural polymorphisms across 180 genomes with unprecedented sensitivity and accuracy, even when sequencing was shallow (2 - 4x) in each individual genome. We developed a framework for genotyping structural polymorphisms by integrating multiple types of NGS information (read depth, read pairs, split reads), and used this approach to accurately genotype more than nine thousand deletion polymorphisms, allowing the production of a phased SNP/deletion haplotype map for the 1000 Genomes Project.