Abstract
Background
Recent years have seen the emergence of genome annotation methods based on the phylogrammar, a probabilistic model combining continuoustime Markov chains and stochastic grammars. Previously, phylogrammars have required considerable effort to implement, limiting their adoption by computational biologists.
Results
We have developed an open source software tool, xrate, for working with reversible, irreversible or parametric substitution models combined with stochastic contextfree grammars. xrate efficiently estimates maximumlikelihood parameters and phylogenetic trees using a novel "phyloEM" algorithm that we describe. The grammar is specified in an external configuration file, allowing users to design new grammars, estimate rate parameters from training data and annotate multiple sequence alignments without the need to recompile code from source. We have used xrate to measure codon substitution rates and predict protein and RNA secondary structures.
Conclusion
Our results demonstrate that xrate estimates biologically meaningful rates and makes predictions whose accuracy is comparable to that of more specialized tools.
Background
Hidden Markov models [HMMs], together with related probabilistic models such as stochastic contextfree grammars [SCFGs], are the basis of many algorithms for the analysis of biological sequences [11,8,10,16]. An appealing feature of such models is that once the general structure of the model is specified, the parameters of the model can be estimated from representative "training data" with minimal user intervention (typically using the Expectation Maximization [EM] algorithm [14]). Combined with the continuoustime Markov chain theory of likelihoodbased phylogeny, stochastic grammar approaches are finding similarly broad application in comparative sequence analysis, in particular the annotation of multiple alignments [83,26,53,46,74,80] (and, in some cases, simultaneous alignment and annotation [2,58]). This combined model has been dubbed the phylogrammar. By contrast to the singlesequence case (for which there is much prior art in the field of computational linguistics [72,51]), the automated parameterization of phylogrammars from training data is somewhat uncharted territory, partly because the application of the EM algorithm to phylogenetics is a recent addition to the theoretical toolbox. The phylogrammar approaches that have been used to date have often used approximate and/or inefficient versions of EM to estimate parameters [59,81], or have been limited to particular subclasses of model, e.g. reversible or otherwise constrained models [9,38].
Previously, we showed how to apply the EM algorithm to estimate substitution rates in a phylogenetic reversible continuoustime Markov chain model [38]. This EM algorithm is exact and without approximation, using an eigenvector decomposition of the rate matrix to estimate summary statistics for the evolutionary history. We refer to this version of EM as "phyloEM".
Here, we report several extensions to the phyloEM method. Specifically, we give a version of the phyloEM algorithm for the fully general, irreversible substitution model on a phylogenetic tree (noting that the irreversible model is a generalisation of the reversible case). We then present a flexible package for multiple alignment annotation using phyloHMMs and phyloSCFGs that implements these algorithms and is similar, in spirit, to the Dynamite package for generic dynamic programming using HMMs [5].
Using this package, it is extremely easy to design, train and apply a novel phylogrammar, since new models can be loaded from an external, userspecified grammar file. Our hope is that the algorithms and software presented here will aid in the establishment of phylogrammars in bioinformatics and that such methods will be as widely adopted for comparative genomics as HMMs and SCFGs have been.
Overview
In 1981, Felsenstein published dynamic programming (DP) recursions for computing the likelihood of a phylogenetic tree for aligned sequence data, given an underlying substitution model [21]. Together with seminal papers by Neyman [64] and DayhofF et al.[12,13], this work heralded the widespread use probabilistic models in bioinformatics and molecular evolution. Felsenstein's underlying model is a finitestate continuoustime Markov chain, as described e.g. by Karlin and Taylor [43]. It is characterised by an instantaneous rate matrix R describing the instantaneous rates R_{ij }of point substitutions from residue i to j. In the unifying language of contemporary "Machine Learning" approaches, Felsenstein's trees are recognisable as a form of graphical model [66] or factor graph [50], and the DP recursions an instance of the sumproduct algorithm. (The connection to graphical models has been made more explicit with recent approaches that model other stochastic processes on phylogenetic trees, such as the evolution of molecular function [20].) Many parametric versions of this model have been explored, such as the "HKY85" model [32].
Beginning in the late 1980s, another class of probabilistic models for biological sequence analysis was developed. These models included HMMs for DNA [11] and proteins [8], and SCFGs for RNA [78,18]. Collectively, such models form a subset of the stochastic grammars. Originally used to annotate individual sequences, stochastic grammars were soon also combined with phylogenetic models to annotate alignments. Thus, trees have been combined with HMMs and/or SCFGs to predict genes [68] and conserved regions [23] in DNA sequences, secondary structures [83,26] and transmembrane topologies [53] in protein sequences, and basepairing structures in RNA sequences [46]. We refer to such hybrid models as phylogrammars. Associated with these advances were novel methods to approximate context dependence of substitution models, such as CpG and other dinucleotide effects [81,55]. The phylogrammars can also be viewed as a subclass of the "statistical alignment" grammars [34,37,60,36], which are derived from more rigorous assumptions about the underlying evolutionary model, including indels [84].
A compelling attraction of stochastic grammars (and probabilistic models in general) is that parameters can be systematically "learned" from data by maximum likelihood (ML). One reasonably good, general, albeit greedy and imperfect, approximation to ML is the EM algorithm [14]. EM applies to models which generate both "hidden" and "observed" data; e.g., the transcriptional/translational structure of a gene (hidden) and the raw genomic sequence (observed). The applications of EM to training HMMs (the BaumWelch algorithm) [4] and SCFGs (InsideOutside) [51] are wellestablished (reviewed in [16]), but what of phylogrammars? While a limited version of EM for substitution models was published in 1996 [9,31], the full derivation for the general reversible rate matrix did not appear until 2002 [38]. The phyloEM algorithm for rate matrices has since been further developed [94,35]. (Various alternatives to phyloEM, such as eigenvector projections [3] and the "resolvent" [63], have also been used to estimate rate matrices; some approximate versions of phyloEM have also been described [81,82].)
Conceptually, EM is straightforward: one simply alternates between imputing the hidden data (the "Estep") and optimizing the parameters (the "Mstep"). The Estep typically results in a set of "expected counts" which are intuitively easy to interpret. (For example, the Estep for phylogenetic trees returns the number of times each substitution is expected to have occurred on each branch.) The EM algorithm has been intensely scrutinized and has been shown to be versatile, adaptable and fast [25,57], particularly the special case of phyloEM [94]. We therefore argue that there are strong advantages to combining the form of EM used to train stochastic grammars (i.e. the BaumWelch and InsideOutside algorithms [16]) with the phyloEM form used for parameterizing substitution models on phylogenetic trees [38].
Previous applications of phylogrammars
The program we have developed can handle a broad class of phylogrammars within one framework. The following is a brief review of prior work that either uses phylogrammars, or is ideally suited to the phylogrammar framework.
This section is subclassified according to the complexity of the grammar, beginning with the simplest. Generally speaking, a phylogrammar can be used to annotate a multiple sequence alignment in any context where a stochastic grammar could be used to annotate an individual sequence. The applications span DNA, RNA and protein sequence annotation.
Point substitution models
A subset of the class of phylogrammars is the class of homogeneous substitution models, where the mutation rate is not a function of position but rather is identical for every site. Such models can be represented as a singlestate phyloHMM. Examples include
The JukesCantor model [41], Kimura's twoparameter model [44], the HKY85 model [32], the general reversible model [92], and the general irreversible model [91]. In the case of the Kimura and HKY85 models, the rate matrices are formulated parametrically: that is, each substitution rate is expressed as a function of a small set of rate and/or probability parameters (e.g. in Kimura's model, there are two rate parameters: the transition rate and the transversion rate).
Variablerate models, where the evolutionary rate is allowed to vary from site to site [90]. Yang used a finite number of discrete, fixed rate categories to approximate a continuous gamma distribution over sitespecific rates. In essence, this can be viewed as special cases of the phyloHMM of Felsenstein and Churchill [23], with the autocorrelation explicitly set to zero.
Hiddenstate models [48,38]. A relative of the variablerate model, the hiddenstate model allows a variety of different substitution rate matrices to be used, depending on a hidden state variable that specifies the structural context of the site [48]. For example, a hydrophobicallyinclined rate matrix might be used for buried amino acids and a hydrophilic matrix for exposed amino acids. An extension to the hiddenstate model allows the hidden state variable itself to change over time at some slow rate, modeling rare changes in structural context [38]. An alternative extension allows correlations between hidden state variables at adjacent sites: this is essentially the idea behind the phyloHMM, described below.
Models for synonymous/nonsynonymous substitution ratio measurement; empirical rate matrices for codon evolution [27,87]. Codon substitution matrices such as WAG [87] can be used to measure the ratio r of synonymous to nonsynonymous substitution rates, which may be indicative of purifying (r < 1), neutral (r = 1) or diversifying (r > 1) selection. These models are also related to the exon prediction phyloHMMs in EVOGENE [68] and EXONIPHY [80], described below.
Amino acid substitution models [12,28]. Likelihood calculations using these models can, as with the other substitution models discussed above, be viewed as trivial applications of phylogrammars.
Contextsensitive substitution models [81]. Siepel and Haussler introduced several alternate approximations for calculating the likelihood of alignments assuming a nearest neighbor substitution model, suitable for capturing the contextsensitivity of the substitution process that is observed in real sequence alignments (most notoriously in genomes wherein CpG methylation is used as a mechanism of epigenetic regulation, leading to elevated rates for the mutations CpG→TpG and CpG→ApG). Siepel and Haussler's method ignores longerrange correlations induced by nearestneighbor effects, but is effective in practice. (It may be regarded as an approximation to the more rigorous analysis of Lunter and Hein [55].)
Many of these models can be expressed using the General Parametric Substitution Model, which we define as the substitution model wherein all substitution rates and initial probabilities can be expressed as simple functions of a (reduced) set of rate and probability parameters. As an example, Kimura's twoparameter model [44] is shown (see figure 1) along with the HKY85 sixparameter model [32] (see figure 2).
Figure 1. Kimura's twoparameter model. The state order is {A, C, G, T}. Each entry is a function of the reduced parameter set (α, β) where α and β are rates.
Figure 2. Hasegawa et al's sixparameter model. The state order is {A, C, G, T}. The negative ondiagonal elements have been omitted for brevity (they are constrained by the requirement that each row sums to zero). Each entry is a function of the reduced parameter set (α, β, π_{A}, π_{C}, π_{G}, π_{T}) where (α, β) are rates and (π_{A}, π_{C}, π_{G}, π_{T}) are probabilities.
As long as each parameter in a parametric substitution model can be interpreted either as a rate (such as Kimura's transition and transversion rates) or a probability (such as the HKY85 equilibrium distribution over nucleotides), the phyloEM algorithm can be adapted to estimate such parameters via the computation of expected event counts. A formal description of the sets of allowable rate and probability functions is given in the Supplementary Material [see 1].
Additional File 1. XRate: a fast prototyping, training and annotation tool for phylogrammars. Supplementary material. A full description of the phyloEM algorithm for irreversible substitution models. Also contains details of experimental procedures.
Format: PDF Size: 185KB Download file
This file can be viewed with: Adobe Acrobat Reader
Although the particular models used above (Kimura and HKY85) are reversible, matrices of allowable rate functions can in general be irreversible. Our General Parametric Model may thus be regarded as a generalisation of the General Irreversible Model.
PhyloHMMs
PhyloHMMs form a class of models slightly more complex than point substitution models. In a phyloHMM, each column (or group of adjacent columns) is associated with a hidden state, representing the evolutionary context of the site. Each hidden state is conditionally dependent upon the immediately preceding state (the Markov property).
Tasks that have been addressed using phyloHMMs include:
Measurement of variation of evolutionary rate among sites in DNA [23]. Felsenstein and Churchill construct an HMM with three states. Each state generates an alignment column according to a point substitution process on a tree [21]. The overall evolutionary rate for the column depends on the state from which it is emitted: each state thus corresponds to a "rate category" (the relative rates for the three states are 0.3, 2.0 and 10.0). The use of an HMM allows for an autocorrelated model of rate variation.
Modeling sitespecific residue usage in proteins[9,31]. While sitespecific profiles are familiar tools in bioinformatics, early tools such as Gribskov profiles [29] and hidden Markov models [8] ignored phylogenetic correlations in the dataset, leading to biased sampling. Phylogrammars incorporate these correlations directly. In these papers, Bruno et al. introduced an initial EM algorithm for estimating rate matrices.
Prediction of secondary structure in proteins[83,26]. In a similar manner to Felsenstein and Churchill, a threestate HMM is constructed wherein each state emits an alignment column using a substitution rate matrix. Here, however, the states correspond to different units of secondary structure (loop, αhelix and βsheet). The substitution rate matrix for each state reflects the frequency distribution and substitution patterns for that secondary structural class. The method performs less well than established secondary structure prediction algorithms, but shows promise, in particular given the simplicity of the model (three states only). Later work expanded the number of states in the phyloHMM to eight (correspondingly increasing the number of parameters). Note that, as more parameters are introduced into this kind of phyloHMM, the problem of "training" those parameters grows in importance.
Prediction of exons and proteincoding gene structures in DNA [68,80]. The basis for the gene prediction programs EVOGENE and EXONIPHY, respectively, these phyloHMMs are based on substitution models for codon triplets with 4^{3 }= 64 states. The paper by Siepel and Haussler introduced the term "phyloHMM" and used an approximate version of the EM algorithm introduced by Holmes and Rubin for parameterization [38].
Detection, modeling and annotation of transcription factor binding sites in DNA [62]. Here, the EM algorithm and other formulae of Bruno and Halpern [9,31] is used to model sitespecific residue frequencies in alignments of promoter regions (rather than proteins, as addressed by Bruno and Halpern).
Detection of conserved regions in multiple alignments of genomic DNA [79]. PhyloHMMs to detect conserved regions can be viewed as extensions of Felsenstein and Churchill's original model with more rate categories. This approach has been used to detect highlyconserved regions in vertebrate, insect, nematode and yeast genomes. Approaches measuring the substitution rate per site [79,85], the local indel rate [54] and/or the CpG mutation bias [81,55] have all shown merit.
Analogously to some of the point substitution models, many phyloHMMs can be expressed parametrically. An example of such a model is the one used by Siepel's PHASTCONS program, whose phyloHMM has ten states ranging from slow to fast overall substitution rate. Moving from one state to another, the relative substitution rates between different nucleotides do not change (i.e. the ratio R_{ij}/R_{kl }is constant for any i, j, k, l ∈ {A, C, G, T}); only the overall substitution rate varies (i.e. the absolute value R_{ij }is not constant). Such consistency across states can be achieved by writing the rate matrices for the ten states as k_{1}R, k_{2 }× R, k_{3 }× R... k_{10 }× R where the k_{i }are scalar multipliers and R is a relative rate matrix shared by all the states. Similarly, the rate matrices of Felsenstein and Churchill's threestate phyloHMM can be written 0.3 × R, 2 × R and 10 × R. Both are examples of the general parametric phyloHMM.
PhyloSCFGs
The most complex class of phylogrammar considered here is the phyloSCFG. Most commonly used to model RNA secondary structure, these grammars are capable of modeling covariation between paired sites. In an SCFG, covarying sites must be strictly nested, allowing the modeling of foldback structures but not pseudoknots, kissing loops or other topologically elaborate RNA structures [45].
Tasks that have been addressed using phyloSCFGs include:
Prediction of RNA secondary structure [46,47]. The Pfold program in this paper introduced the first phyloSCFG, combining stochastic contextfree grammars (used to model RNA structure) with evolutionary substitution models. Since HMMs are a subset of SCFGs, the framework of phyloSCFGs includes the previously discussed phyloHMMs. The Pfold program also allowed for userspecified grammars; however, it lacked a fast EMlike algorithm for estimating grammar parameters from data (by contrast, the nonphylogenetic SCFGs used elsewhere in bioinformatics can be rapidly trained using the InsideOutside algorithm [16]). A key feature of these models is the use of 16state "basepair models" for modeling the simultaneous coevolution of functional basepairs in RNA structures. Again, fast and effective parameterization of the model is an important issue.
Detection of noncoding RNA genes [67]. A similar model to Pfold was used by the Evofold program, which uses a phyloSCFG to parse genomic alignments into noncoding RNA and other features [67].
Detection of RNA secondary structure within exons [69]. The RNADecoder program uses a parametric phyloSCFG to model exonic regions in which there is simultaneous selection on both the translated protein sequence and the secondary structure of the premRNA. Such regions have been found in viral genomes and hypothesized to fulfil a regulatory role [69]. Due to the complexity of these models and the sparsity of training data, parametric rate functions are required to limit the number of free parameters that must be estimated.
Detection of accelerated selection in human noncoding RNA [70]. Pollard et al used phyloHMMs and phyloSCFGs to identify a neurallyexpressed RNA gene, HARF1, that had undergone recent accelerated evolution in the lineage separating humans from the humanchimp ancestor.
Implementation
In practice, users of phylogrammars need to do a similar core set of tasks in order to perform data analysis. These tasks may include model development, structured parameterization, estimation of parameter values and application of the model to annotate alignments. Using the framework of phylogrammars, an implementation enabling all these tasks is possible. The EM algorithm provides a general and consistent approach to parameter estimation, while standard "parsing" algorithms (the Viterbi and CockeYoungerKasami (CYK) algorithms [16]) address the problem of annotation.
We have implemented EM and Viterbi/CYK parsing algorithms in our software. The general irreversible phyloEM algorithm, using eigenvector decompositions, is described in the Supplementary Material to this paper [see 1]. (Note that this model is more general than the "general reversible model" [92], which can be regarded as a special case wherein the rates obey a detailed balance symmetry so that π_{i}R_{ij }= π_{j}R_{ji}.) The main advance over previous descriptions of this algorithm [38,81] is a complete closedform solution for the Mstep of EM for irreversible models, including a full algebraic treatment of the complex conjugate eigenvector pairs [see 1]. This closedform solution for the Mstep eliminates the need for numerical optimization code as part of EM. The Viterbi and CYK algorithms are described in full elsewhere [16].
The essential idea of EM is iteratively to maximize the expected loglikelihood with respect to the rate parameters, where the expectation is taken over the posterior distribution of the missing data using the current parameters. In the case of phyloEM, the missing data are the sequences ancestral to the observed sequence data.
As with many instances of EM, the posterior distribution over the missing data in phyloEM can be summarized via a representative set of "counts" that, being expectations, have convenient additive properties.
These counts have the following intuitive meaning with respect to the ancestral states of the evolutionary process: (i) the expected residue composition at the root node of the tree; (ii) the expected number of times each type of point mutation occurred; (iii) the expected amount of evolutionary time each residue was extant.
Each of these counts is summed over all branches of the phylogenetic tree and then over all columns in the alignment (or groups of columns). The sum over columns is weighted by the posterior probability that each column (or group of columns) was generated by a particular state.
Note that it is relatively easy to obtain naive estimates for the phyloEM counts (e.g. using parsimony), but that such naive estimates are in general systematically biased. In particular, they tend to underestimate the number of substitutions that actually occurred.
A stochastic grammar consists of a set of "nonterminal" symbols (equivalent to the "states" of an HMM), a set of "terminal" symbols and a set of "production rules" for transforming nonterminals. In a contextfree grammar, each production rule transforms a single nonterminal into a (possibly empty) sequence of terminals and/or nonterminals. The iterative application of such rules can be represented as a tree structure known as the "parse tree" [16]. In biological applications, there is typically a large number of parse trees that can explain the observed data. This contrasts with applications in computational linguistics, where there are typically only a small number of parses consistent with the data.
To apply EM to a stochastic grammar, one must compute the expected number of times each production rule was used in the derivation of the observed alignment. These expected counts are summed over the posterior distribution of parse trees, and are calculated using the InsideOutside algorithm.
The set of terminal symbols for a phylogrammar is the set of possible alignment columns (in contrast to a singlesequence grammar, where the set of terminal symbols corresponds to the residue alphabet). The phyloEM algorithm is used to estimate the rate parameters associated with the emission of these symbols by the grammar.
Programs
The following open source software tools, implementing the algorithms and models described in this paper, are freely available (see Availability and Requirements).
xgram – a implementation of the EM algorithm for training phylogrammars, i.e. the InsideOutside and ForwardBackward algorithms combined with the EM algorithm for the general irreversible (and reversible) substitution models. This program implements the general irreversible EM algorithm described in the Supplementary Material [see 1], along with the general reversible EM algorithm described previously [38]. The grammar can be userspecified via an extensible file format, described below. Parametric grammars are allowed (so that individual substitution rates and/or rule probabilities can be constrained to arbitrary functions of a smaller set of model parameters). The xgram tool is capable of reproducing most of the phylogrammar models listed in this paper. In its generic applicability, xgram is similar to the dynamic programming engine Dynamite [5], although the class of models is different (phylogrammars vs single and pairHMMs) and the functionality broader (including parameterization by phyloEM, as well as Viterbi and CYK annotation codes). Also included is an implementation of the neighborjoining algorithm for fast estimation of tree topologies [77], and another version of the EM algorithm for rapidly optimising branch lengths of trees with fixed topology [24]. The model underlying xgram also allows for dynamically evolving "hidden states" associated with each site, again as previously described [38].
xrate – a version of xgram including several "preset" grammars for point substitution models, including the general irreversible and reversible substitution models.
xfold – a version of xgram including several "preset" grammars for RNA analysis, including that of the Pfold program [46].
xprot – a version of xgram including several "preset" grammars for protein analysis, including a grammar similar to that used by Thorne et al. for protein secondary structure prediction [83].
All of the above programs can be driven by any userspecified phylogrammar. Having specified a grammar, or chosen one of the presets, the user can
• Estimate the ML parameterization of the grammar for the training set via EM, using InsideOutside or ForwardBackward algorithms (autoselected by program) [16], together with the phyloEM algorithm described in the Supplementary Material [see 1];
• Find the maximum likelihood (ML) parse tree, using CockeYoungerKasami (CYK) or Viterbi algorithms (autoselected by program) [16], with phylogenetic likelihoods calculated by pruning [21];
• Annotate the alignment, columnbycolumn, with userspecified labels, using the ML parse tree;
• Find the posterior probability of each node in the ML parse tree.
The parse tree can also be constrained, completely or partially, by including complete or partial annotations in the input alignment. For example, one can annotate several known examples of a TF binding site in a multiple alignment. One can then allow the grammar to "learn" these examples and predict new binding sites.
File formats
The input and output format for sequence alignment data is the Stockholm format, as used by PFAM and RFAM. The wildcard character is the period ".". Annotation of columns with the wildcard character allows for incompletely labeled data and hence partially supervised learning. If a given annotation is specified in the grammar but absent from the training data, it will be treated as a string of wildcards and all compatible possibilities will be summed over.
Any phylogrammar can be specified, using a format based on LISP Sexpressions [56,75]. The format is humanreadable and succinct, while being machineparseable and extensible.
Phylogrammar specification files contain several elements:
• An alphabet, describing valid sequence tokens (e.g. nucleotides or amino acids) along with any degenerate or (in the case of nucleotides) complementary tokens.
• One or more chains, each describing a finitestate continuoustime Markov chain, including rate parameters;
• Optionally (for parametric models) a set of rate and probability parameter values;
• A set of transformation rules, which also serve to define the nonterminals in the grammar.
As an example, the grammar for the Kimura twoparameter rate matrix is shown (see figure 3). A more complete and uptodate description of the format can be found online [88], as can discussion of the latest version of xrate and its companion programs [89].
Figure 3. An xgramformat grammar for Kimura's twoparameter model.
Results and discussion
We illustrate the potential of xrate as a quick tool for prototyping phylogrammars by reimplementing several prior applications and testing on real and simulated data. As applications we choose firstly a codon substitution model which is both computationally intensive and parameterrich (due to the size of the rate matrix). Secondly, we compare xrate's performance in predicting protein structure to a previously used phyloHMM. Thirdly, we compare xrate to a previously used phyloSCFG for predicting RNA secondary structure.
To visualize rate matrices, we use figures that we refer to as "bubbleplots" (see figure 11). In a bubbleplot, the area of a circle in the main matrix is proportional to the rate of the corresponding substitution, with the grey circle in the upperleft repesenting the scale. The offset row shows the equilibrium probability distribution over states: here, the area of a circle is proportional to the equilibrium probability of the corresponding state. Additional colorcoding is used on a casebycase basis.
Fitting codon models
In the past, various amino acid substitution models have been estimated using ML techniques (e.g., mtREV [15], WAG [87]). An ML estimation of codon substitution models, however, has seemed infeasible for a long time because of the computational burden involved with such parameterrich models. This section shows that xrate is capable of tackling the problem. The full results of a particular study are being published elsewhere (Kosiol, Holmes and Goldman, in prep.); here, we will restrict attention to simulation results showing that xrate can do these sorts of analyses reliably.
The number of independent parameters for a reversible substitution model with N character states can be calculated as . This means that for the estimation of a 20state amino acid model, 208 independent parameters need to be calculated. In contrast, to estimate a 61state codon model (excluding stop codons), 1889 independent parameters have to be determined.
To test the robustness of xrate's ability to fit parameterrich models to aligned sequence data, we simulated a data set using all phylogenies of the Pandit database of protein domain alignments [86], using a standard model of codon evolution (the MO model [93] [see 1]). In this model, rates of substitutions involving changes to multiple nucleotides are zero, so that the rate matrix is sparsely populated.
xrate is able to recover M0 well from this 'artifical' Pandit database. The true rates used in the simulation are shown (see figure 8). These may be compared with the recovered rates (see figure 9).
A scatter plot of true vs estimated rates allows a more detailed analysis (see figure 10). This plot shows the true instantaneous rates of M0 plotted versus the instantaneous rates estimated from data simulated from M0. If = the points would lie on the bisection line y = x. Thus the deviation of the points from the bisection line indicates how different the rates are.
If one is interested in drawing biological conclusions from the estimated rate parameters, then it is of interest to consider xrate's estimates of rates which are zero in the true model, xrate sometimes inferred erroneously very small nonzero values for the instantaneous rates of double and triple changes from the simulated data set (in the M0 model, which was used to generate the data, such substitutions have zero rate). However, this error can be correctly identified by comparing loglikelihoods calculated by xrate under the following nested models: For the general model allowing for single, double and triple nucleotide changes 1889 parameters had to be estimated. The best likelihood calculated for general estimation is In L_{general }= 28930383.06. Using xrate we can also restrict the rate matrices to single nucleotide changes only. For this model 322 parameters had to be estimated. The best likelihood calculated for restricted estimation is lnL_{restricted }= 28930894.86.
Although the loglikelihood for the general rate matrix allowing for single, double and triple changes is better we can show that the improvement is not significant. Significance is tested using a standard likelihood ratio test between the two models, comparing twice the difference in loglikelihood with a distribution, where 1567 is the degrees of freedom by which the two models differ. Using the normal approximation for we compare (2(ln L_{general } ln L_{restricted})1567)/ = 9.71 with the relevant 99% critical value of 2.33 taken from a standard normal (0,1). The difference is seen to be insignificant; the Pvalue is almost 1.
Predicting protein secondary structure
We compared xrate to the phyloHMM for prediction of protein secondary structure developed by Goldman, Thorne, and Jones [26] (here referred to as GTJ). This section uses a fullyconnected threestate phyloHMM with general reversible Markov chains. Training sets were taken from the HOMSTRAD database of structural alignments of homologous protein families [61].
We trained the phyloHMM on alphabeta barrel alignments from HOMSTRAD, leaving out the betaglycanase SCOP family. xrate was then benchmarked on this betaglycanase SCOP family to compare the annotation predicted by xrate to the experimentally determined HOMSTRAD annotation. We also tried a more comprehensive training regime, training xrate on the complete HOMSTRAD database (excluding the betaglycanase SCOP family) and again comparing predicted and database annotations.
The performance of xrate was compared to that of GTJ. The results show that xrate can be used to quickly prototype and train a phyloHMM with comparable performance to that reported by Goldman et al.
Grammar
The PROT3 phylogrammar has state labels for the three secondary structure classes of alphahelix (H), betasheet (E) and loop (L). An excerpt of the grammar is shown (see figure 4).
Figure 4. An excerpt from an xgramformat grammar reproducing the protein secondary structure phyloHMM of Goldman, Thorne and Jones. This excerpt shows only the transformation rules, and omits the alphabet and chain definitions. Three separate Markov chains for amino acid substitution are used (and are assumed to be defined elsewhere in the file): alpha_col denotes an amino acid in an alpha helix (annotated with character H), beta_col denotes an amino acid in a beta sheet (annotated with character E) and loop_col denotes an amino acid in a loop region (annotated with character L).
An example of usage for this grammar follows. We also show an alignment from HOMSTRAD, too small to predict secondary structure with any confidence, but useful for illustrative purposes (see figure 5). Suppose we want to: (1) read in this alignment from a file named ' pp. stk'; (2) load a point substitution matrix from a file named 'dart/data/nullprot.eg' (this is an aminoacid matrix distributed with xrate; the filename path assumes that the DART package was downloaded to the current working directory); (3) use the above point substitution matrix to estimate a phylogenetic tree (by neighborjoining followed by EM on the branch lengths); (4) load the PROT3 model from a file named 'dart/data/prot3.eg' (again, this is distributed with xrate); and (5) use the PROT3 model to predict secondary structure classes for this protein family, printing the annotated alignment to the standard output. The following commandline syntax achieves this:
Figure 5. Example Stockholmformat input file for the protein secondary structure grammar (see figure 4). The alignment is of the pancreatic hormone family.
xrate pp.stk tree dart/data/nullprot.eg grammar dart/data/prot3.eg
The output of this command is shown (see figure 6).
Figure 6. Example Stockholmformat output using the protein secondary structure grammar (see figure 4) and the pancreatic hormone alignment (see figure 5). Line numbers have been added for reference; note the embedded New Hampshireformat tree at line 2, the Viterbi bitscore at line 3 and the Viterbi secondary structure annotation at line 7.
Figure 7. An excerpt from an xgramformat grammar reproducing the RNA secondary structure phyloSCFG of Knudsen and Hein. This excerpt shows only the transformation rules, and omits the alphabet and chain definitions. Two separate Markov chains for nucleotide substitution are used (and are assumed to be defined elsewhere in the file): LNUC and RNUC denote the left and right (i.e. 5' and 3') nucleotides of a coevolving basepair in a 16state Markov chain (annotated with characters < and >), while NUC denotes an unpaired nucleotide in a 4state Markov chain (annotated with character _).
Figure 8. True codon mutation rate matrix for the M0 mechanistic codon mutation model benchmark (see Results and Discussion). These rates were used to generate simulated data; rates were then estimated from these data and compared to the true rates (see figure 9).
Figure 9. Estimated codon mutation rate matrix for the codon model benchmark (see Results and Discussion). These rates were estimated by xrate from simulated data, generated using a mechanistic rate model (see figure 8).
Figure 10. Scatter plot comparing true instantaneous rates with estimated rates from simulated data for the codon model benchmark (see Results and Discussion).
Figure 11. Bubbleplot of amino acid substitution rates for alphahelices. See Results and Discussion for colorcoding and explanation of bubbleplots.
More such examples can be found in DART (the software library with which xrate is distributed) and on the wiki pages for the xrate program [89]. A full list of commandline options for xrate can be obtained by typing xrate –help or, equivalently, xrate h.
Results
Both xrate and the GTJ program were evaluated on the xylanase alignment used by GTJ, hereafter referred to as gtjxyl. xrate was trained on the subset of HOMSTRAD corresponding to alphabeta barrel structures, with members of the betaglycanase SCOP family (which includes the gtjxyl proteins) removed to prevent overlap between the training and test sets.
We report the prediction accuracy collectively for all secondary structure categories, and the sensitivity and specificity with respect to each individual category. These metrics are defined as follows
Sensitivity(n) = TP_{n}/(TP_{n }+ FN_{n})
Speciflcity(n) = TP_{n}/(TP_{n }+ FP_{n})
where (for secondary structure class n) TP_{n }is the number of true positives (columns correctly predicted as class n), FN_{n }is the number of false negatives (columns that should have been predicted as class n but were not) and FP_{n }is the number of false positives (columns that were incorrectly predicted as class n).
Bubbleplots were used to visualize the amino acid substitution rates. Substitutions are colored red if between aromatic amino acids, green if between hydrophobics and blue if between hydrophilics. Substitutions from one such group to another (e.g. from hydrophobic to hydrophilic) are colored gray.
Figures 11, 12 and 13 show the amino acid substitution matrices for the alphahelix, betasheet and loop states, respectively. The relative rates displayed in the figures in general agree with what one would expect from each of those states: the alphahelix and betasheet states substitute more slowly (and thus amino acid conservation is higher) than for the loop states (loop regions being more variable in structure [7]).
Figure 12. Bubbleplot of amino acid substitution rates for betasheets. See Results and Discussion for colorcoding and explanation of bubbleplots.
Figure 13. Bubbleplot of amino acid substitution rates for loop regions. See Results and Discussion for colorcoding and explanation of bubbleplots.
Table 1 shows the log likelihood scores of the training alignments, log P(Dθ), along with the logposterior probability of the HOMSTRAD reference annotation, log P(AD, θ). In this case, maximumlikelihood training also yields an increase in the annotation posterior probability P(AD, θ). This is not in general a guaranteed result of the EM algorithm, and alternative training procedures (such as maximumdiscrimination training [19]) have been proposed to achieve this effect. It appears in this case that such procedures are not required.
Table 1. Loglikelihood scores of training sets and logposterior probabilities of the true annotations for the PROT3 benchmark. Here D denotes the training alignment data (the HOMSTRAD database without the betaglycanase SCOP family), A denotes the DSSP annotations of the alignment data, θ_{D }denotes the model with parameters obtained from training on D, and θ_{G }denotes the model with parameters obtained from the GTJ datafiles.
Table 2 reports likelihoods, accuracies and runtimes for training set 2 as the EM convergence criteria are tightened. As expected, the likelihood increases as the convergence criteria are made more stringent. The annotation accuracy for the gtjxyl benchmark alignment also consistently increases.
Table 2. Effect of tightening the EM convergence criteria for the PROT3 benchmark. The "mininc" parameter is the minimum fractional loglikelihood increase per iteration of EM. Accuracies for the gtjxyl benchmark alignment are reported, along with loglikelihoods. See Table 1 for additional notation.
Table 3 summarizes the results of running xrate and the GTJ program on all the test cases. In general the accuracy of xrate is comparable to or even slightly better than the accuracy of the GTJ program.
Table 3. Summary of prediction performance for the PROT3 benchmark. "Sn" and "Sp" are the sensitivity and specificity for each secondary structure category; "Acc" is the overall accuracy.
Predicting RNA secondary structure
To illustrate the capability of xrate as a tool for RNA secondary structure prediction/annotation, we compare it to Pfold, a phyloSCFG developed by Knudsen and Hein [46,47].
There are two goals of this section: (1) to see if xrate can exactly emulate the Pfold phylogrammar using the same parameters as Pfold, and (2) to see if the EM algorithm can estimate parameters that yield comparable performance to those produced by other methods.
We benchmarked the Pfold phyloSCFG running on xrate against the original Pfold program using alignments from the Rfam database [30]. To address goal (2), we used xrate to estimate the substitution rates and initial frequencies of basepairs and single nucleotides from annotated Rfam alignments.
Our results show that the Pfold phyloSCFG is effectively emulated by xrate, that the EM algorithm can estimate a more likely parameterization for a given training set and that the parameters so obtained are comparable in performance to the Pfold program itself. We conclude that xrate is a suitable platform for developing, parameterizing, and testing phylogrammars without the necessity of writing source code or performing manual parameterization.
Grammar
The PFOLD grammar is taken from the Pfold program and is described in the paper by Knudsen and Hein [46].
An excerpt of the grammar, containing the production rules, is seen in figure 7 .
Results
We report the sensitivity and positive predictive value (PPV) of basepair predictions. These accuracy metrics are defined as follows
Sensitivity = TP/(TP + FN)
PPV = TP/(TP + FP)
where TP is the number of true positives (base pairs that are predicted correctly per the Rfam annotation), FN the number of false negatives (base pairs that are not predicted but are in the Rfam annotation) and FP the false positives (predicted base pairs that are not in the Rfam annotation).
Training and testing sets were obtained by selecting the 148 RNA gene families in Rfam version 7 with experimentallydetermined structures, discarding pseudoknots, removing excessively gappy columns (as this step is also performed by Pfold), grouping the families into superfamilies and randomly partitioning these superfamilies into two sets [see 1]. This yielded a training set of 71 alignments and a testing set of 77 alignments.
The benchmark results, shown in Table 4, indicate that the sensitivity and PPV of the Pfold program and its emulation on xrate are comparable. It should be noted, however, that the sets of base pairs predicted by the two programs are slightly different [see 1]. After examination, we attribute this to differences in implementation and loss of precision due to numerical calculations.
Table 4. Accuracy of RNA secondary structure prediction. Comparison of sensitivities and PPVs for the Pfold program, its phyloSCFG running on xrate with its original rates, and its phyloSCFG running on xrate with rates estimated from Rfam by the phyloEM algorithm.
We also tested whether parameterizing the phyloSCFG using the EM algorithm is comparable to the Pfold parameterization [46]. A comparison of Pfold's original rates with the EMestimated rates is shown in Figures 14–16. Both sets of parameters display similar trends. Substitutions that create or preserve canonical base pairs are more frequent than substitutions that destroy basepairs (see figure 14). Transitions are more common than transversions, both within basepairs (see figure 15) and unpaired sites (see figure 16). There is a difference in the magnitude of many of the rates, which we attribute to differences in the training sets.
Figure 14. Comparison of basepair substitution rates, colored by basepairing conservation, gain, or loss. Rates and equilibrium frequencies from the Pfold phyloSCFG (left panel) are compared with those estimated by the phyloEM algorithm from Rfam (right panel). Substitutions from noncanonical to canonical basepairs are blue (pairing gain), canonical to canonical are red (pairing conservation), noncanonical to noncanonical are black (unpaired and no change), and canonical to noncanonical are yellow (pairing loss).
Figure 15. Comparison of basepair substitution rates, colored by transitions/transversions. The rates were obtained from the Pfold program and by training on Rfam (see figure 14). Transition of a single base in a pair is dark red, transversion is light red; transitions in both bases is dark green, transition of one and transversion of the other is medium green, transversions of both is light green.
Figure 16. Comparison of substitution rates of nucleotides in unpaired alignment columns. Rates and equilibrium frequencies from the Pfold phyloSCFG (left panel) are compared with those estimated by the phyloEM algorithm from Rfam (right panel). Transitions are green, transversions are black.
The predictive accuracy of Pfold is compared to that of the xratetrained phyloSCFG in Table 4, while loglikelihoods are compared in Tables 5 and 6. The results are similar, indicating that the combination of training set and xrateimplemented EM is comparable to the training procedure used in the development of Pfold.
Table 5. Loglikelihoods of alignments, and logposteriors of alignment annotations, for training and testing datasets under various EM convergence regimes in the PFOLD benchmark. The "mininc" parameter is the minimal fractional increase in the loglikelihood that is considered by our EM implementation to be an improvement, while the "forgive" parameter is the number of iterations of EM without such an improvement that will be tolerated before the algorithm terminates. The default settings are mininc = le3, forgive = 0. Here D denotes the alignment data, A denotes the RFAM secondary structure annotations of the alignment data and θ denotes the model with parameters optimized for the training set using the specified EM convergence criteria.
Table 6. Loglikelihoods of alignments, and logposteriors of alignment annotations, for training and testing datasets using the original Pfold program. Comparison with Table 5 shows that EM training increases all probabilities, as desired.
An important point to check is whether the EM algorithm actually performs as designed. We expect to see certain phenomena if the algorithm is indeed working as expected:
• The algorithm, over the course of its iterations, should refine the parameter set (denoted at the n'th iteration by θ^{(n)}) to maximize the likelihood of the alignment data D and (if supplied) the annotation A. Therefore, the loglikelihood log P(Dθ^{(n)}) should increase with n towards an asymptotic maximum value. This is indeed observed to be the case for this example (see figure 17).
• In practice, the EM algorithm is not run for an infinite number of iterations; rather, the algorithm stops when some "convergence criteria" are met (relating to the fractional increase of the loglikelihood) and the parameters at this point are considered to be the "convergent parameters". We denote this convergent parameter set by θ*.
Figure 17. Loglikelihoods (log2 P(alignment, annotationparameters), red line) increase as the EM algorithm optimizes the model parameters on the training set. The accuracy results for this parameterization are reported in Table 4. The blue line represents the asymptotic best loglikelihood, reached at iteration 27.
• If the EM algorithm is performing effectively (i.e. finding a parameterization whose likelihood is close to the global maximum), we would also expect P(Dθ*) to be greater than P(Dθ') for some arbitrarily chosen parameterization θ' (for example, the KnudsenHein parameters, which were optimized for a dataset other than D). A comparison of Tables 5 and 6 confirms this to be the case.
• As the convergence criteria become more strict, log P(Dθ*) should increase. The results in Table 5 confirm this to be the case.
• If the training set is representative of the test set, then the above statements should also hold true when D is taken to mean the test set. Again, Tables 5 and 6 confirms this.
We note that Tables 5 and 6 shows that the posterior probability of the true annotation, P(AD, θ) = P(A, Dθ)/P(Aθ), is also increased after phyloEM training. As mentioned above, this is not a provably guaranteed result of the EM algorithm, which is designed to maximize only P(A, Dθ).
Conclusion
We have developed a tool, xrate, that combines the power of stochastic grammars, phylogenetic models, and fast automated parameter estimation from training data. The tool combines a novel EM algorithm for estimating rate parameters of the general irreversible substitution model (extending our earlier results for reversible models [38]) with the ForwardBackward and InsideOutside algorithms familiar from the stochastic grammar literature [16]. Novel grammars can be designed by the user, trained automatically, and evaluated without the need for writing or compiling any code. Example grammars that we have used with xrate so far include the phyloHMMs used by Thorne, Goldman and Jones to predict protein secondary structure [83], the phyloSCFGs used by Knudsen and Hein to predict ncRNA structure [46] and the DNA phyloHMMs used by Siepel and Haussler to predict proteincoding genes and find highlyconserved elements [81,80,39,79].
There are many useful applications of stochastic grammars in bioinformatics. Past triumphs of HMMs include protein homology detection [49]; prediction of proteincoding genes [10]; transmembrane and signal peptide annotation [42]; and profiles of fragment libraries for de novo protein structure prediction [76]. Applications of "higherpower" stochastic grammars (i.e. grammars that are situated further up the Chomsky hierarchy, such as TreeAdjoining Grammars [40]) include betasheet prediction [1]; RNA genefinding [74], homology detection [17] and structure prediction [73]; and operon prediction [6].
There are also many useful applications of phylogenetic models. These include reconstruction of phylogenetic trees [22], measurement of K_{a}/K_{s }ratios [27], modeling residue usage [9,31], modeling covariation [71], detecting of conserved residues [90] and sequence alignment [84,33,37]. Furthermore, there are many applications of probabilistic modeling in sequence analysis, e.g. "evolutionary trace" [52] or prediction of deleterious SNPs [65], that are either directly related to the above kinds of models or might productively be linked.
xrate and associated tools comprise an uptodate, friendly implementation of these models for the advanced user. We believe these are powerful tools with broad utility. Our results show that the performance of xrate is comparable to previously described phyloHMM and phyloSCFG implementations customized to specific tasks, and furthermore that the rate estimates produced by xrate can be interpreted in a biologically meaningful way. In releasing this general implementation, our hope is that we and others will use these computational tools to further the application of molecular evolution in biomedical research.
Availability and requirements
Project name : xrate
Project home page : http://biowiki.org/dart webcite
Operating system(s) : Platform independent
Programming language : C++
Other requirements : gcc version 3.3 or higher; GNU build tools (make, ar)
License : GNU GPL
Restrictions to use : None
Abbreviations
CYK : CockeYoungerKasami
DP : Dynamic Programming
EM : Expectation Maximization
HMM : Hidden Markov Model
ML : Maximum Likelihood
PPV : Positive Predictive Value
SCFG : Stochastic ContextFree Grammar
Authors' contributions
PK implemented the irreversible phyloEM algorithm and contributed to the supplementary material describing the algorithm. NG and RB developed the bubbleplot code. CK and NG performed the codon benchmark. YB performed the protein secondary structure benchmark. AU performed the RNA secondary structure. RB and SC performed additional benchmarks and testing of xrate. IH developed the remaining code and drafted the manuscript. IH, CK, NG, YB and AU contributed to the final version of the manuscript.
Acknowledgements
Richard Goldstein, Gerton Lunter and Dawn Brooks gave helpful feedback during the development of xrate.
IH, AU and YB were funded in part by NIH/NHGRI grant 1R01GM07670501. R.B was supported under a National Science Foundation Graduate Research Fellowship. YB was supported in part by the UC Berkeley Graduate Opportunity Fellowship. CK is a member of Wolfson College, University of Cambridge, and was funded by a Wellcome Trust Prize Studentship and an EMBL predoctoral fellowship. NG was partially supported by a Wellcome Trust fellowship.
References

Abe N, Mamitsuka H: Prediction of betasheet structures using stochastic tree grammars. In Proceedings Genome Informatics Workshop V. Universal Academy Press; 1994:1928.

Alexandersson M, Cawley S, Pachter L: SLAM crossspecies gene finding and alignment with a generalized pair hidden Markov model.
Genome Research 2003, 13(3):496502. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Arvestad L, Bruno WJ: Estimation of reversible substitution matrices from multiple pairs of sequences.
Journal of Molecular Evolution 1997, 45(6):696703. PubMed Abstract  Publisher Full Text

Baum LE: An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes.

Birney E, Durbin R: Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology. Edited by Gaasterland T, Karp P, Karplus K, Ouzounis C, Sander C, Valencia A. Menlo Park, CA, AAAI Press; 1997:5664.

Bockhorst J, Qiu Y, Glasner J, Liu M, Blattner F, Craven M: Predicting bacterial transcription units using sequence and expression data. In Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA, AAAI Press; 2003:3443.

Branden C, Tooze J: Introduction to Protein Structure. Garland, New York; 1991.

Brown M, Hughey R, Krogh A, Mian IS, Sjölander K, Haussler D: Using Dirichlet mixture priors to derive hidden Markov models for protein families. In Proceedings of the First International Conference on Intelligent Systems for Molecular Biology. Edited by Hunter L, Searls DB, Shavlik J. Menlo Park, CA, AAAI Press; 1993:4755.

Bruno WJ: Modelling residue usage in aligned protein sequences via maximum likelihood.

Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA.
Journal of Molecular Biology 1997, 268(1):7894. PubMed Abstract  Publisher Full Text

Churchill GA: Stochastic models for heterogeneous DNA sequences.
Bulletin of Mathematical Biology 1989, 51:7994. PubMed Abstract

Dayhoff MO, Eck RV, Park CM: A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure. Volume 5. Edited by Dayhoff MO. National Biomedical Research Foundation, Washington, DC; 1972::8999.

Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure. Volume 5. Edited by Dayhoff MO. National Biomedical Research Foundation, Washington, DC; 1978::345352.

Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm.

Dimmic MW, Mindell DP, Goldstein RA: Modeling evolution at the protein level using an adjustable amino acid fitness model.
Proceedings of the Fifth Pacific Symposium on Biocomputing 2000, 1829.

Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK; 1998.

Eddy SR: A memoryefficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure.

Eddy SR, Durbin R: RNA sequence analysis using covariance models.
Nucleic Acids Research 1994, 22:20792088. PubMed Abstract  PubMed Central Full Text

Eddy SR, Mitchison GJ, Durbin R: Maximum discrimination hidden Markov models of sequence consensus.
Journal of Computational Biology 1995, 2:923. PubMed Abstract

Engelhardt BE, Jordan MI, Muratore KE, Brenner SE: Protein molecular function prediction by Bayesian phylogenomics.
PLoS Computational Biology 2005., 1(5) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach.
Journal of Molecular Evolution 1981, 17:368376. PubMed Abstract  Publisher Full Text

Felsenstein J: Inferring Phylogenies. Sinauer Associates, Inc; 2003.
ISBN 0878931775.

Felsenstein J, Churchill GA: A hidden Markov model approach to variation among sites in rate of evolution.

Friedman N, Ninio M, Pe'er I, Pupko T: A structural EM algorithm for phylogenetic inference.
Journal of Computational Biology 2002, 9:331353. PubMed Abstract  Publisher Full Text

Gilks W, Richardson S, Spiegelhalter D: Markov Chain Monte Carlo in Practice. Chapman & Hall, London, UK; 1996.

Goldman N, Thorne JL, Jones DT: Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses.
Journal of Molecular Biology 1996, 263(2):196208. PubMed Abstract  Publisher Full Text

Goldman N, Yang Z: A codonbased model of nucleotide substitution for proteincoding DNA sequences.

Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database.
Science 1992, 256(5062):14431445. PubMed Abstract  Publisher Full Text

Gribskov M, McLachlan AD, Eisenberg D: Profile analysis: detection of distantly related proteins.
Proceedings of the National Academy of Sciences of the USA 1987, 84:43554358. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

GriffithsJones S, Bateman A, Marshall M, Khanna A, Eddy SR: Rfam: an RNA family database.
Nucleic Acids Research 2003, 31(1):439441. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Halpern AL, Bruno WJ: Evolutionary distances for proteincoding sequences: modeling sitespecific residue frequencies.

Hasegawa M, Kishino H, Yano T: Dating the humanape splitting by a molecular clock of mitochondrial DNA.
Journal of Molecular Evolution 1985, 22:160174. PubMed Abstract  Publisher Full Text

Hein J: An algorithm for statistical alignment of sequences related by a binary tree. In Pacific Symposium on Biocomputing. Edited by Altman RB, Dunker AK, Hunter L, Lauderdale K, Klein TE. Singapore, World Scientific; 2001:179190. PubMed Abstract

Hein J, Wiuf C, Knudsen B, Moller MB, Wibling G: Statistical alignment: computational properties, homology testing and goodnessoffit.
Journal of Molecular Biology 2000, 302:265279. PubMed Abstract  Publisher Full Text

Hobolth A, Jensen JL: Statistical inference in evolutionary models of DNA sequences via the EM algorithm.
Statistical applications in Genetics and Molecular Biology 2005., 4(1)

Holmes I: A probabilistic model for the evolution of RNA structure.
BMC Bioinformatics 2004., 5(166) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Holmes I, Bruno WJ: Evolutionary HMMs: a Bayesian approach to multiple alignment.
Bioinformatics 2001, 17(9):803820. PubMed Abstract  Publisher Full Text

Holmes I, Rubin GM: An Expectation Maximization algorithm for training hidden substitution models.
Journal of Molecular Biology 2002, 317(5):757768. Publisher Full Text

Jojic V, Jojic N, Meek C, Geiger D, Siepel A, Haussler D, Heckerman D: Efficient approximations for learning phylogenetic HMM models from data.
Bioinformatics 2004, 20(Supplement 1):161168. Publisher Full Text

Joshi A, Schabes Y: Treeadjoining grammars.
1997.

Jukes TH, Cantor C: Evolution of protein molecules. In Mammalian Protein Metabolism. Academic Press, New York; 1969:21132.

Kall L, Krogh A, Sonnhammer EL: A combined transmembrane topology and signal peptide prediction method.
Journal of Molecular Biology 2004, 338(5):10271036. PubMed Abstract  Publisher Full Text

Karlin S, Taylor H: A First Course in Stochastic Processes. Academic Press, San Diego, CA; 1975.

Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.
Journal of Molecular Evolution 1980, 16:111120. PubMed Abstract  Publisher Full Text

Klosterman PS, Tamura M, Holbrook SR, Brenner SE: SCOR: a structural classification of RNA database.
Nucleic Acids Research 2002, 30:392394. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Knudsen B, Hein J: RNA secondary structure prediction using stochastic contextfree grammars and evolutionary history.
Bioinformatics 1999, 15(6):446454. PubMed Abstract  Publisher Full Text

Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic contextfree grammars.
Nucleic Acids Research Evaluation Studies 2003, 31(13):34233428. Publisher Full Text

Koshi JM, Goldstein RA: Contextdependent optimal substitution matrices.
Protein Engineering 1995, 8:641645. PubMed Abstract

Krogh A, Brown M, Mian IS, Sjölander K, Haussler D: Hidden Markov models in computational biology: applications to protein modeling.
Journal of Molecular Biology 1994, 235:15011531. PubMed Abstract  Publisher Full Text

Kschischang FR, Frey BJ, Loeliger HA: Factor graphs and the sumproduct algorithm.
IEEE Transactions on Information Theory 1998, 47(2):498519. Publisher Full Text

Lari K, Young SJ: The estimation of stochastic contextfree grammars using the insideoutside algorithm.
Computer Speech and Language 1990, 4:3556. Publisher Full Text

Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families.
Journal of Molecular Biology 1996, 257:342358. PubMed Abstract  Publisher Full Text

Liò P, Goldman N: Using protein structural information in evolutionary inference: transmembrane proteins.

Lunter G, Ponting CP, Hein J: Genomewide identification of human functional DNA using a neutral indel model.
PLoS Computational Biology 2006., 2(1) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lunter GA, Hein J: A nucleotide substitution model with nearestneighbour interactions.
Bioinformatics 2004, 20(Suppl 1):I216I223. PubMed Abstract  Publisher Full Text

McCarthy JL: Recursive functions of symbolic expressions and their computation by machine.
Communications of the ACM 1960, 3(4):184195. Publisher Full Text

McLachlan GJ, Krishnan T: The EM Algorithm and Extensions. Wiley Interscience; 1996.

Meyer IM, Durbin R: Gene structure conservation aids similarity based gene prediction.
Nucleic Acids Research 2004, 32(2):776783. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Michalek S, Timmer J: Estimating rate constants in hidden Markov models by the EM algorithm.
IEEE Transactions in Signal Processing 1999, 47:226228. Publisher Full Text

Miklós I, Lunter G, Holmes I: A long indel model for evolutionary sequence alignment.
Molecular Biology and Evolution 2004, 21(3):529540. Publisher Full Text

Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: a database of protein structure alignments for homologous families.
Protein Science 1998, 7:24692471. PubMed Abstract  Publisher Full Text

Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB: MONKEY: identifying conserved transcriptionfactor binding sites in multiple alignments using a binding sitespecific evolutionary model.
Genome Biology 2004., 5(12) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Muller T, Vingron M: Modeling amino acid replacement.
Journal of Computational Biology 2000, 7(6):761776. PubMed Abstract  Publisher Full Text

Neyman J: Molecular studies of evolution: a source of novel statistical problems. In Statistical Decision Theory and Related Topics. Edited by Gupta SS, Yackel J. Academic Press, New York; 1971.

Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function.
Nucleic Acids Research 2003, 31(13):38123814. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pearl J: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann Publishers, San Mateo, California; 1988.

Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, LindbladToh K, Lander ES, Kent J, Miller W, Haussler D: Identification and classification of conserved RNA secondary structures in the human genome.
PLoS Computational Biology 2006, 2(4):e33. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pedersen JS, Hein J: Gene finding with a hidden Markov model of genome structure and evolution.
Bioinformatics 2003, 19(2):219227. PubMed Abstract  Publisher Full Text

Pedersen JS, Meyer IM, Forsberg R, Simmonds P, Hein J: A comparative method for finding and folding RNA secondary structures within proteincoding regions.
Nucleic Acids Research 2004, 32(16):49254923. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pollard KatherineS, Salama SofleR, Lambert Nelle, Lambot MarieAlexandra, Coppens Sandra, Pedersen JakobS, Katzman Sol, King Bryan, Onodera Courtney, Siepel Adam, Kern AndrewD, Dehay Colette, Igel Haller, Ares Manuel Jr, Vanderhaeghen Pierre, Haussler David: An RNA gene expressed during cortical development evolved rapidly in humans.
Nature 2006, 443(7108):167172. PubMed Abstract  Publisher Full Text

Pollock DD, Taylor WR, Goldman N: Coevolving protein residues: maximum likelihood identification and relationship to structure.
Journal of Molecular Biology 1999, 287(1):187198. PubMed Abstract  Publisher Full Text

Rabiner LR, Juang BH: An introduction to hidden Markov models.

Rivas E, Eddy SR: The language of RNA: a formal grammar that includes pseudoknots.
Bioinformatics 2000, 16(4):334340. PubMed Abstract  Publisher Full Text

Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis.
BMC Bioinformatics 2001., 2(8) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rivest R: Sexpressions. Internet Draft. [http://theory.lcs.mit.edu/~rivest/sexp.txt] webcite
1997.

Rohl CA, Strauss CE, Misura KM, Baker D: Protein structure prediction using Rosetta.
Methods in Enzymology 2004, 383:6693. PubMed Abstract  Publisher Full Text

Saitou N, Nei M: The neighborjoining method: a new method for reconstructing phylogenetic trees.

Sakakibara Y, Brown M, Hughey R, Saira Mian I, Kimmen Sjölander, Underwood RC, Haussler D: Stochastic contextfree grammars for tRNA modeling.
Nucleic Acids Research 1994, 22:51125120. PubMed Abstract  PubMed Central Full Text

Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.
Genome Research 2005, 15(8):10341050. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Siepel A, Haussler D: Combining phylogenetic and hidden Markov models in biosequence analysis.
Journal of Computational Biology 2004, 11(2–3):413428. PubMed Abstract  Publisher Full Text

Siepel A, Haussler D: Phylogenetic estimation of contextdependent substitution rates by maximum likelihood.
Molecular Biology and Evolution 2004, 21(3):468488. Publisher Full Text

Soyer OS, Goldstein RA: Predicting functional sites in proteins: sitespecific evolutionary models and their application to neurotransmitter transporters.
Journal of Molecular Biology 2004, 339(1):227242. PubMed Abstract  Publisher Full Text

Thorne JL, Goldman N, Jones DT: Combining protein evolution and secondary structure.

Thorne JL, Kishino H, Felsenstein J: An evolutionary model for maximum likelihood alignment of DNA sequences.
Journal of Molecular Evolution 1991, 33:114124. PubMed Abstract  Publisher Full Text

Wasserman WW, Fickett JW: Identification of regulatory regions which confer musclespecific gene expression.
Journal of Molecular Biology 1998, 278(1):167181. PubMed Abstract  Publisher Full Text

Whelan S, de Bakker PI, Goldman N: Pandit: a database of protein and associated nucleotide domains with inferred trees.
Bioinformatics 2003, 19(12):15561563. PubMed Abstract  Publisher Full Text

Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximumlikelihood approach.

The xgram file format [http://biowiki.org/XgramFormat] webcite

Information on xrate, xgram, xprot, xfold and related tools [http://biowiki.org/XgramSoftware] webcite

Yang Z: Maximumlikelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites.

Yang Z: Estimating the pattern of nucleotide substitution.
Journal of Molecular Evolution 1994, 39:105111. PubMed Abstract

Yang Z: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods.
Journal of Molecular Evolution 1994, 39:306314. PubMed Abstract  Publisher Full Text

Yang Z, Nielsen R, Goldman N, Pedersen AM: Codonsubstitution models for heterogeneous selection pressure at amino acid sites.

Yap VB, Speed TP: Statistical Methods in Molecular Evolution, chapter Estimating substitution matrices. Springer; 2005.