Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK

MTA-ELTE Theoretical Biology and Ecology Group, Pázmány Péter sétány 1/c 1117 Budapest, Hungary

Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK

Department of Mathematical Sciences, University of Aarhus, Ny Munkegade, Building 530, DK-8000 Aarhus C, Denmark

Abstract

Background

Two central problems in computational biology are the determination of the alignment and phylogeny of a set of biological sequences. The traditional approach to this problem is to first build a multiple alignment of these sequences, followed by a phylogenetic reconstruction step based on this multiple alignment. However, alignment and phylogenetic inference are fundamentally interdependent, and ignoring this fact leads to biased and overconfident estimations. Whether the main interest be in sequence alignment or phylogeny, a major goal of computational biology is the co-estimation of both.

Results

We developed a fully Bayesian Markov chain Monte Carlo method for coestimating phylogeny and sequence alignment, under the Thorne-Kishino-Felsenstein model of substitution and single nucleotide insertion-deletion (indel) events. In our earlier work, we introduced a novel and efficient algorithm, termed the "indel peeling algorithm", which includes indels as phylogenetically informative evolutionary events, and resembles Felsenstein's peeling algorithm for substitutions on a phylogenetic tree. For a fixed alignment, our extension analytically integrates out both substitution and indel events within a proper statistical model, without the need for data augmentation at internal tree nodes, allowing for efficient sampling of tree topologies and edge lengths. To additionally sample multiple alignments, we here introduce an efficient partial Metropolized independence sampler for alignments, and combine these two algorithms into a fully Bayesian co-estimation procedure for the alignment and phylogeny problem.

Our approach results in estimates for the posterior distribution of evolutionary rate parameters, for the maximum

Conclusion

Joint analysis of multiple sequence alignment, evolutionary trees and additional evolutionary parameters can be now done within a single coherent statistical framework.

Background

Two central problems in computational biology are the determination of the alignment and phylogeny of a set of biological sequences. Current methods first align the sequences, and then infer the phylogeny given this fixed alignment. Several software packages are available that deal with one or both of these sub-problems. For example, ClustalW

This leads on to the second issue, which is that heuristic methods are used to deal with insertions and deletions (indels), and sometimes also substitutions. This lack of a proper statistical framework makes it very difficult to accurately assess the reliability of the alignment estimate, and the phylogeny depending on it.

The relevance of statistical approaches to evolutionary inference has long been recognised. Time-continuous Markov models for substitution processes were introduced more than three decades ago

Statistical modelling and MCMC approaches have a long history in population genetic analysis. In particular, coalescent approaches to genealogical inference have been very successful, both in maximum likelihood

The only missing ingredient for a full co-estimation procedure is an alignment sampler. Unfortunately, there exists no Gibbs alignment sampler that corresponds to the analytic algorithm referred to above. In this paper we introduce a partial importance sampler to resample alignments, based on a proposal mechanism built on a partial score-based alignment procedure. This type of sampler supports the data format we need for efficient likelihood calculations, while still achieving good mixing in reasonable running time (see Results).

We implemented the likelihood calculator and the alignment sampler in Java, and interfaced them with an existing MCMC kernel for phylogenetics and population genetics

Results

Definition of the TKF model

The TKF91 model is a continuous-time reversible Markov model describing the evolution of nucleotide (or amino acid) sequences. It models three of the main processes in sequence evolution, namely

Since subsequences evolve independently, it is sufficient to describe the evolution of a single character-link pair. In a given finite time span, this pair evolves into a finite subsequence of characters and links. Since insertions originate from links, only the first character of this descendant subsequence may be homologous to the original character, while subsequent ones will have been inserted and therefore not be homologous to ancestral characters. The model as applied to pairwise alignments was solved analytically in

Because the TKF91 model is time reversible, the root placement does not influence the likelihood, an observation known as Felsenstein's "Pulley Principle"

Computing the likelihood of a homology structure

The concept of _{1}, _{2}, ..._{m }be sequences, related by a tree _{i}, and let _{1}, ..., _{m }is an equivalence relation ~ on the set of all the characters of the sequences, _{h}, exists on

(Here, _{h }_{h }_{h }_{h }corresponds to the ordering of columns of homologous characters in an alignment. Note that for a given homology structure, this ordering may not be unique (see Fig.

Alignments and homology structure

**Alignments and homology structure**. (Left:) Two alignments representing the same homology structure. A "homology structure" is defined as the set of all homology relationships between residues from the given sequences; residues are homologous if they appear in the same alignment column. Our recursion includes contributions from all alignments compatible with a given homology structure (itself represented by a single alignment). (Right:) Due to the evolutionary process acting on the sequences, homology relationships (arrows) will never 'cross' as depicted. This restriction on the equivalence relation ~ is codified by <_{h }(see text).

The one-state recursion, which calculates the likelihood of a homology structure, is a convolution of two dynamic programming algorithms. The top-level algorithm traverses the prefix set of the multiple alignments representing the homology structure (see Figure

Dynamic programming table traversal

**Dynamic programming table traversal**. The multiple alignment prefixes (represented by o symbols) traversed by the one-state recursion, when the input is the homology structure of Fig. 1. (For clarity, the vectors are plotted in two dimensions instead of the actual three.) The homology structure is represented by the graph, and each directed path on this graph uniquely corresponds to an alignment that is compatible with the homology structure.

A partial Metropolized independence sampler

Because our algorithm does not require the phylogenetic tree to be augmented with missing data, proposing changes to the evolutionary tree is easy, and mixing in tree space is very good. The drawback however is that without data augmentation, it is unclear how to perform Gibbs sampling of alignments, and we have to resort to other sampling schemes. One straightforward choice would be a standard Metropolis-Hastings procedure with random changes to the alignment, but we expect slow mixing from such an approach. Another general approach is Metropolized independence sampling. Its performance depends on the difference between the proposal distribution and the target distribution, and this will inevitably become appreciable with growing dimension of the problem, as measured by the number and length of the sequences to be aligned. We therefore opted for a

The proposal algorithm

The proposal algorithm is as follows. A window size and location is proposed, the alignment of subsequences within this window is removed, and a new alignment is proposed by a stochastic version of the standard score-based progressive pairwise alignment method. First, dynamic programming (DP) tables are filled as for a deterministic score-based multiple alignment, starting at the tree tips and working towards the root, aligning sequences and profiles. We used linear gap penalties, and a similarity scoring matrix that was obtained by taking the log-odds of a probabilistic substitution matrix. The underlying phylogeny was used to define divergence times, and served as alignment guide tree. After filling the DP tables, we applied stochastic traceback. The probabilities for the three possible directions at each position was taken to be proportional to exp(

Generating the proposal alignment

**Generating the proposal alignment**. This figure illustrates the stochastic sequence aligner. In the deterministic fill-in process, the three scores are _{1}, _{2 }and _{3}, hence the value in this cell is max{_{1}, _{2}, _{3}}. In the stochastic traceback phase, the three neighbor cells are choosen with probabilities proportional to exp(_{i}), where

Correctness of the sampler

There are two problems with the proposal sampler introduced above. First, we propose alignments instead of homology structures. We need the latter, since the algorithm derived in this paper calculates the likelihood of the homology structure, not the particular alignment. Although it would be conceptually and (for the sampler) computationally simpler to use alignments, we are not aware of any efficient algorithm that can calculate such

Fortunately, we can solve both problems efficiently. We can sample alignments uniformly inside a homology structure, and at the same time sample homology structures according to their posterior probabilities. As biologically meaningful questions refer to homologies and not particular alignments, it seems reasonable to impose a simple uniform distribution over alignments within homology structures. The second problem is solved by not calculating an alignment proposal probability, but the proposal probability of the combination of an alignment and a resampling window. For a proposal of alignment _{2 }and window _{1}, we use the following Metropolis-Hastings ratio:

where _{1 }and _{2 }are homology structures corresponding to the alignments _{1 }and _{2 }respectively, |_{1}| and |_{2}| are their cardinalities (i.e. the number of alignments representing these homology structures), and

where the final equality holds because of the symmetry of the left-hand side. The cardinality of a homology structure, |_{1}|, is the number of possible directed paths in the graph spanned by the one-state recursion; in other words, the number of permutations of alignment columns that result in alignments compatible with the given homology structure (see Fig.

Discussion

The one-state recursion provides a method for calculating the likelihood

where

This XML file specifies the MCMC run for the example phylogeny and alignment co-estimation given in this paper (see Figs.

Click here for file

Posterior distribution of deletion rate

**Posterior distribution of deletion rate μ**. Estimated posterior densities of the deletion rate

For alignments, the maximum

Maximum posterior decoding alignment, and column reliabilities

**Maximum posterior decoding alignment, and column reliabilities**. The maximum posterior decoding alignment of ten globins (human, chicken and turtle alpha hemoglobin, beta hemoglobin, myoglobin and bean leghemoglobin). Posterior probabilities for aligned columns were estimated as their rate in the Markov chain. Common alpha helices are indicated with '

Figure

Maximum

**Maximum a-posteriori phylogeny**. The maximum

The estimated time of the most recent common ancestor of each of the alpha, beta and myoglobin families are all mutually compatible (result not shown), suggesting that the molecular clock hypothesis is at least approximately valid. Analysis of a four sequence dataset demonstrate consistency in

Conclusion

In this paper we present a new cosampling procedure for phylogenetic trees and sequence alignments. The underlying likelihood engine uses recently introduced and highly efficient algorithms based on an evolutionary model (the Thorne-Kishino-Felsenstein model) that combines both the substitution and insertion-deletion processes in a principled way

One motivation for using a fully probabilistic model, and for using a co-estimation procedure for alignments and phylogeny, is that this makes it possible to assess the uncertainties in the inferences. Fixing either the alignment or the phylogeny leads to an underestimate of the uncertainty in the other, and score-based methods give no assessment of uncertainty whatsoever.

We show that the confidence estimates so obtained can contain biologically meaningful information. In the case of the multiple alignment of globin sequences, peaks in the posterior column reliabilities correspond broadly to the various conserved alpha helices that constitute the sequences (see Fig.

At the heart of the method lies a recently introduced algorithm, termed the "indel peeling algorithm", that extends Felsenstein's peeling algorithm to incorporate insertion and deletion events under the TKF91 model

We also developed a method for sampling multiple alignments, which is applicable for the data augmentation scheme we used for the efficient likelihood calculations. By combining the two samplers, we can co-sample alignments, evolutionary trees and other evolutionary parameters such as indel and substitution rates. The resulting samples from the posterior distribution can be summarized in traditional ways. We obtained maximum

As was already mentioned in

Availability and requirements

The BEAST package (AJ Drummond and A Rambaut), which includes the algorithm described in this paper, is available from

Authors' contributions

IM conjectured and GL proved the one-state recursion. GL and IM independently implemented the algorithms, and wrote the paper. JLJ simplified the proof of the recursion, GL suggested to use it within an MCMC phylogeny cosampler, and IM suggested to use a Metropolised importance sampler and proved its correctness. GL and AD interfaced the Java algorithms to the BEAST phylogeny sampling package

Acknowledgements

The authors thank Yun Song, Dirk Metzler, Anton Wakolbinger and Ian Holmes for several useful suggestions and discussions. This research is supported by EPSRC (code HAMJW) and MRC (code HAMKA). I.M. was further supported by a Békésy György postdoctoral fellowship.