Isaac Newton Institute for Mathematical Sciences

Mathematical and Statistical Aspects of Molecular Biology

A hidden Markov model of evolution and structure for multiple sequence alignment

Authors: Ari Loytynoja (EMBL-EBI), Nick Goldman (EMBL-EBI)

Abstract

Motivation: Progressive algorithms are widely used heuristics for multiple sequence alignment. Probabilistic methods can more realistically model the different processes of evolution than traditional deterministic methods do, and, additionally, provide measures of global and/or local reliability of individual solutions. Our aim is to develop a computationally efficient probabilistic alignment method that incorporates the current knowledge of sequence structure and evolution. Results: Recursions for a new multiple sequence alignment method are presented. The method combines a HMM modelling the match/gap-process as well as an internal structure within sequences, progressive algorithms defining ancestral sequences from the pairwise alignments of their child nodes, and a probabilistic evolution model describing the character substitution processes in different structure states. The structure within sequences, e.g., a gene structure with introns and exons, a protein secondary structure with either buried and exposed sites, or alpha, beta and coil elements, or just slowly and fast evolving sequence regions, can be pre-defined based on external knowledge, or estimated during the alignment process. The posterior probabilities of sequence sites being aligned and the process being in a given structure state at a given position, provide a measure of local reliability of the alignment and the probability of a sequence site belonging to one of the structural categories.