<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-428</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>XRate: a fast prototyping, training and annotation tool for phylo-grammars</p>
         </title>
         <aug>
            <au id="A1" ca="no">
               <snm>Klosterman</snm>
               <mi>S</mi>
               <fnm>Peter</fnm>
               <insr iid="I1"/>
               <email>petek@accesscom.com</email>
            </au>
            <au id="A2">
               <snm>Uzilov</snm>
               <mi>V</mi>
               <fnm>Andrew</fnm>
               <insr iid="I1"/>
               <email>andrew.uzilov@gmail.com</email>
            </au>
            <au id="A3">
               <snm>Benda&#241;a</snm>
               <mi>R</mi>
               <fnm>Yuri</fnm>
               <insr iid="I1"/>
               <email>ybendana@berkeley.edu</email>
            </au>
            <au id="A4">
               <snm>Bradley</snm>
               <mi>K</mi>
               <fnm>Robert</fnm>
               <insr iid="I1"/>
               <email>rbradley@berkeley.edu</email>
            </au>
            <au id="A5">
               <snm>Chao</snm>
               <fnm>Sharon</fnm>
               <insr iid="I1"/>
               <email>schao@berkeley.edu</email>
            </au>
            <au id="A6">
               <snm>Kosiol</snm>
               <fnm>Carolin</fnm>
               <insr iid="I2"/>
               <insr iid="I3"/>
               <email>ck285@cornell.edu</email>
            </au>
            <au id="A7">
               <snm>Goldman</snm>
               <fnm>Nick</fnm>
               <insr iid="I2"/>
               <email>goldman@ebi.ac.uk</email>
            </au>
            <au ca="yes" id="A8">
               <snm>Holmes</snm>
               <fnm>Ian</fnm>
               <insr iid="I1"/>
               <email>ihh@berkeley.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Bioengineering, University of California, Berkeley CA, USA</p>
            </ins>
            <ins id="I2">
               <p>European Bioinformatics Institute, Hinxton, Cambridgeshire, UK</p>
            </ins>
            <ins id="I3">
               <p>Department of Biological Statistics and Computational Biology, Cornell University, Ithaca NY, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>428</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/428</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17018148</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-428</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>24</day>
               <month>2</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>03</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>03</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Klosterman et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Recent years have seen the emergence of genome annotation methods based on the <it>phylo-grammar</it>, a probabilistic model combining continuous-time Markov chains and stochastic grammars. Previously, phylo-grammars have required considerable effort to implement, limiting their adoption by computational biologists.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have developed an open source software tool, xrate, for working with reversible, irreversible or parametric substitution models combined with stochastic context-free grammars. xrate efficiently estimates maximum-likelihood parameters and phylogenetic trees using a novel "phylo-EM" algorithm that we describe. The grammar is specified in an external configuration file, allowing users to design new grammars, estimate rate parameters from training data and annotate multiple sequence alignments without the need to recompile code from source. We have used xrate to measure codon substitution rates and predict protein and RNA secondary structures.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Our results demonstrate that xrate estimates biologically meaningful rates and makes predictions whose accuracy is comparable to that of more specialized tools.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Hidden Markov models [HMMs], together with related probabilistic models such as stochastic context-free grammars [SCFGs], are the basis of many algorithms for the analysis of biological sequences <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B8">8</abbr><abbr bid="B10">10</abbr><abbr bid="B16">16</abbr></abbrgrp>. An appealing feature of such models is that once the general structure of the model is specified, the parameters of the model can be estimated from representative "training data" with minimal user intervention (typically using the Expectation Maximization [EM] algorithm <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>). Combined with the continuous-time Markov chain theory of likelihood-based phylogeny, stochastic grammar approaches are finding similarly broad application in comparative sequence analysis, in particular the annotation of multiple alignments <abbrgrp><abbr bid="B83">83</abbr><abbr bid="B26">26</abbr><abbr bid="B53">53</abbr><abbr bid="B46">46</abbr><abbr bid="B74">74</abbr><abbr bid="B80">80</abbr></abbrgrp> (and, in some cases, simultaneous alignment and annotation <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B58">58</abbr></abbrgrp>). This combined model has been dubbed the <it>phylo-grammar</it>. By contrast to the single-sequence case (for which there is much prior art in the field of computational linguistics <abbrgrp><abbr bid="B72">72</abbr><abbr bid="B51">51</abbr></abbrgrp>), the automated parameterization of phylo-grammars from training data is somewhat uncharted territory, partly because the application of the EM algorithm to phylogenetics is a recent addition to the theoretical toolbox. The phylo-grammar approaches that have been used to date have often used approximate and/or inefficient versions of EM to estimate parameters <abbrgrp><abbr bid="B59">59</abbr><abbr bid="B81">81</abbr></abbrgrp>, or have been limited to particular subclasses of model, e.g. reversible or otherwise constrained models <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B38">38</abbr></abbrgrp>.</p>
         <p>Previously, we showed how to apply the EM algorithm to estimate substitution rates in a phylogenetic reversible continuous-time Markov chain model <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. This EM algorithm is exact and without approximation, using an eigenvector decomposition of the rate matrix to estimate summary statistics for the evolutionary history. We refer to this version of EM as "phylo-EM".</p>
         <p>Here, we report several extensions to the phylo-EM method. Specifically, we give a version of the phylo-EM algorithm for the fully general, irreversible substitution model on a phylogenetic tree (noting that the irreversible model is a generalisation of the reversible case). We then present a flexible package for multiple alignment annotation using phylo-HMMs and phylo-SCFGs that implements these algorithms and is similar, in spirit, to the Dynamite package for generic dynamic programming using HMMs <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
         <p>Using this package, it is extremely easy to design, train and apply a novel phylo-grammar, since new models can be loaded from an external, user-specified grammar file. Our hope is that the algorithms and software presented here will aid in the establishment of phylo-grammars in bioinformatics and that such methods will be as widely adopted for comparative genomics as HMMs and SCFGs have been.</p>
      </sec>
      <sec>
         <st>
            <p>Overview</p>
         </st>
         <p>In 1981, Felsenstein published dynamic programming (DP) recursions for computing the likelihood of a phylogenetic tree for aligned sequence data, given an underlying substitution model <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Together with seminal papers by Neyman <abbrgrp><abbr bid="B64">64</abbr></abbrgrp> and DayhofF <it>et al</it>.<abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>, this work heralded the widespread use probabilistic models in bioinformatics and molecular evolution. Felsenstein's underlying model is a finite-state continuous-time Markov chain, as described e.g. by Karlin and Taylor <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>. It is characterised by an instantaneous rate matrix <b>R </b>describing the instantaneous rates <it>R</it><sub><it>ij </it></sub>of point substitutions from residue <it>i </it>to <it>j</it>. In the unifying language of contemporary "Machine Learning" approaches, Felsenstein's trees are recognisable as a form of graphical model <abbrgrp><abbr bid="B66">66</abbr></abbrgrp> or factor graph <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>, and the DP recursions an instance of the sum-product algorithm. (The connection to graphical models has been made more explicit with recent approaches that model other stochastic processes on phylogenetic trees, such as the evolution of molecular function <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>.) Many parametric versions of this model have been explored, such as the "HKY85" model <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
         <p>Beginning in the late 1980s, another class of probabilistic models for biological sequence analysis was developed. These models included HMMs for DNA <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and proteins <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, and SCFGs for RNA <abbrgrp><abbr bid="B78">78</abbr><abbr bid="B18">18</abbr></abbrgrp>. Collectively, such models form a subset of the <b>stochastic grammars</b>. Originally used to annotate individual sequences, stochastic grammars were soon also combined with phylogenetic models to annotate alignments. Thus, trees have been combined with HMMs and/or SCFGs to predict genes <abbrgrp><abbr bid="B68">68</abbr></abbrgrp> and conserved regions <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> in DNA sequences, secondary structures <abbrgrp><abbr bid="B83">83</abbr><abbr bid="B26">26</abbr></abbrgrp> and transmembrane topologies <abbrgrp><abbr bid="B53">53</abbr></abbrgrp> in protein sequences, and basepairing structures in RNA sequences <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>. We refer to such hybrid models as <b>phylo-grammars</b>. Associated with these advances were novel methods to approximate context dependence of substitution models, such as CpG and other dinucleotide effects <abbrgrp><abbr bid="B81">81</abbr><abbr bid="B55">55</abbr></abbrgrp>. The phylo-grammars can also be viewed as a subclass of the "statistical alignment" grammars <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B37">37</abbr><abbr bid="B60">60</abbr><abbr bid="B36">36</abbr></abbrgrp>, which are derived from more rigorous assumptions about the underlying evolutionary model, including indels <abbrgrp><abbr bid="B84">84</abbr></abbrgrp>.</p>
         <p>A compelling attraction of stochastic grammars (and probabilistic models in general) is that parameters can be systematically "learned" from data by maximum likelihood (ML). One reasonably good, general, albeit greedy and imperfect, approximation to ML is the EM algorithm <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. EM applies to models which generate both "hidden" and "observed" data; e.g., the transcriptional/translational structure of a gene (hidden) and the raw genomic sequence (observed). The applications of EM to training HMMs (the Baum-Welch algorithm) <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and SCFGs (Inside-Outside) <abbrgrp><abbr bid="B51">51</abbr></abbrgrp> are well-established (reviewed in <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>), but what of phylo-grammars? While a limited version of EM for substitution models was published in 1996 <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B31">31</abbr></abbrgrp>, the full derivation for the general reversible rate matrix did not appear until 2002 <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. The phylo-EM algorithm for rate matrices has since been further developed <abbrgrp><abbr bid="B94">94</abbr><abbr bid="B35">35</abbr></abbrgrp>. (Various alternatives to phylo-EM, such as eigenvector projections <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> and the "resolvent" <abbrgrp><abbr bid="B63">63</abbr></abbrgrp>, have also been used to estimate rate matrices; some approximate versions of phylo-EM have also been described <abbrgrp><abbr bid="B81">81</abbr><abbr bid="B82">82</abbr></abbrgrp>.)</p>
         <p>Conceptually, EM is straightforward: one simply alternates between imputing the hidden data (the "E-step") and optimizing the parameters (the "M-step"). The E-step typically results in a set of "expected counts" which are intuitively easy to interpret. (For example, the E-step for phylogenetic trees returns the number of times each substitution is expected to have occurred on each branch.) The EM algorithm has been intensely scrutinized and has been shown to be versatile, adaptable and fast <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B57">57</abbr></abbrgrp>, particularly the special case of phylo-EM <abbrgrp><abbr bid="B94">94</abbr></abbrgrp>. We therefore argue that there are strong advantages to combining the form of EM used to train stochastic grammars (i.e. the Baum-Welch and Inside-Outside algorithms <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>) with the phylo-EM form used for parameterizing substitution models on phylogenetic trees <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Previous applications of phylo-grammars</p>
         </st>
         <p>The program we have developed can handle a broad class of phylo-grammars within one framework. The following is a brief review of prior work that either uses phylo-grammars, or is ideally suited to the phylo-grammar framework.</p>
         <p>This section is subclassified according to the complexity of the grammar, beginning with the simplest. Generally speaking, a phylo-grammar can be used to annotate a multiple sequence alignment in any context where a stochastic grammar could be used to annotate an individual sequence. The applications span DNA, RNA and protein sequence annotation.</p>
         <sec>
            <st>
               <p>Point substitution models</p>
            </st>
            <p>A subset of the class of phylo-grammars is the class of homogeneous substitution models, where the mutation rate is not a function of position but rather is identical for every site. Such models can be represented as a single-state phylo-HMM. Examples include</p>
            <p><b>The Jukes-Cantor model </b><abbrgrp><abbr bid="B41">41</abbr></abbrgrp>, <b>Kimura's two-parameter model </b><abbrgrp><abbr bid="B44">44</abbr></abbrgrp>, <b>the HKY85 model </b><abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, <b>the general reversible model </b><abbrgrp><abbr bid="B92">92</abbr></abbrgrp>, <b>and the general irreversible model </b><abbrgrp><abbr bid="B91">91</abbr></abbrgrp>. In the case of the Kimura and HKY85 models, the rate matrices are formulated para-metrically: that is, each substitution rate is expressed as a function of a small set of rate and/or probability parameters (e.g. in Kimura's model, there are two rate parameters: the transition rate and the transversion rate).</p>
            <p><b>Variable-rate models, where the evolutionary rate is allowed to vary from site to site </b><abbrgrp><abbr bid="B90">90</abbr></abbrgrp>. Yang used a finite number of discrete, fixed rate categories to approximate a continuous gamma distribution over site-specific rates. In essence, this can be viewed as special cases of the phylo-HMM of Felsenstein and Churchill <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, with the autocorrelation explicitly set to zero.</p>
            <p><b>Hidden-state models </b><abbrgrp><abbr bid="B48">48</abbr><abbr bid="B38">38</abbr></abbrgrp>. A relative of the variable-rate model, the hidden-state model allows a variety of different substitution rate matrices to be used, depending on a hidden state variable that specifies the structural context of the site <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>. For example, a hydrophobically-inclined rate matrix might be used for buried amino acids and a hydrophilic matrix for exposed amino acids. An extension to the hidden-state model allows the hidden state variable itself to change over time at some slow rate, modeling rare changes in structural context <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. An alternative extension allows correlations between hidden state variables at adjacent sites: this is essentially the idea behind the phylo-HMM, described below.</p>
            <p><b>Models for synonymous/nonsynonymous substitution ratio measurement; empirical rate matrices for codon evolution </b><abbrgrp><abbr bid="B27">27</abbr><abbr bid="B87">87</abbr></abbrgrp>. Codon substitution matrices such as WAG <abbrgrp><abbr bid="B87">87</abbr></abbrgrp> can be used to measure the ratio <it>r </it>of synonymous to nonsynonymous substitution rates, which may be indicative of purifying (<it>r </it>&lt; 1), neutral (<it>r </it>= 1) or diversifying (<it>r </it>> 1) selection. These models are also related to the exon prediction phylo-HMMs in EVOGENE <abbrgrp><abbr bid="B68">68</abbr></abbrgrp> and EXONIPHY <abbrgrp><abbr bid="B80">80</abbr></abbrgrp>, described below.</p>
            <p><b>Amino acid substitution models </b><abbrgrp><abbr bid="B12">12</abbr><abbr bid="B28">28</abbr></abbrgrp>. Likelihood calculations using these models can, as with the other substitution models discussed above, be viewed as trivial applications of phylo-grammars.</p>
            <p><b>Context-sensitive substitution models </b><abbrgrp><abbr bid="B81">81</abbr></abbrgrp>. Siepel and Haussler introduced several alternate approximations for calculating the likelihood of alignments assuming a nearest neighbor substitution model, suitable for capturing the context-sensitivity of the substitution process that is observed in real sequence alignments (most notoriously in genomes wherein CpG methylation is used as a mechanism of epigenetic regulation, leading to elevated rates for the mutations CpG&#8594;TpG and CpG&#8594;ApG). Siepel and Haussler's method ignores longer-range correlations induced by nearest-neighbor effects, but is effective in practice. (It may be regarded as an approximation to the more rigorous analysis of Lunter and Hein <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>.)</p>
            <p>Many of these models can be expressed using the <b>General Parametric Substitution Model</b>, which we define as the substitution model wherein all substitution rates and initial probabilities can be expressed as simple functions of a (reduced) set of rate and probability parameters. As an example, Kimura's two-parameter model <abbrgrp><abbr bid="B44">44</abbr></abbrgrp> is shown (see figure <figr fid="F1">1</figr>) along with the HKY85 six-parameter model <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> (see figure <figr fid="F2">2</figr>).</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Kimura's two-parameter model</p>
               </caption>
               <text>
                  <p>Kimura's two-parameter model. The state order is {<it>A</it>, <it>C</it>, <it>G</it>, <it>T</it>}. Each entry is a function of the reduced parameter set (<it>&#945;</it>, <it>&#946;</it>) where <it>&#945; </it>and <it>&#946; </it>are rates.</p>
               </text>
               <graphic file="1471-2105-7-428-1"/>
            </fig>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Hasegawa <it>et al</it>'s six-parameter model</p>
               </caption>
               <text>
                  <p>Hasegawa <it>et al</it>'s six-parameter model. The state order is {<it>A</it>, <it>C</it>, <it>G</it>, <it>T</it>}. The negative on-diagonal elements have been omitted for brevity (they are constrained by the requirement that each row sums to zero). Each entry is a function of the reduced parameter set (<it>&#945;</it>, <it>&#946;</it>, <it>&#960;</it><sub><it>A</it></sub>, <it>&#960;</it><sub><it>C</it></sub>, <it>&#960;</it><sub><it>G</it></sub>, <it>&#960;</it><sub><it>T</it></sub>) where (<it>&#945;</it>, <it>&#946;</it>) are rates and (<it>&#960;</it><sub><it>A</it></sub>, <it>&#960;</it><sub><it>C</it></sub>, <it>&#960;</it><sub><it>G</it></sub>, <it>&#960;</it><sub><it>T</it></sub>) are probabilities.</p>
               </text>
               <graphic file="1471-2105-7-428-2"/>
            </fig>
            <p>As long as each parameter in a parametric substitution model can be interpreted either as a rate (such as Kimura's transition and transversion rates) or a probability (such as the HKY85 equilibrium distribution over nucleotides), the phylo-EM algorithm can be adapted to estimate such parameters via the computation of expected event counts. A formal description of the sets of allowable rate and probability functions is given in the Supplementary Material [see <supplr sid="S1">Additional file 1</supplr>].</p>
            <suppl id="S1">
               <title>
                  <p>Additional File 1</p>
               </title>
               <text>
                  <p>XRate: a fast prototyping, training and annotation tool for phylo-grammars. Supplementary material. A full description of the phylo-EM algorithm for irreversible substitution models. Also contains details of experimental procedures.</p>
               </text>
               <file name="1471-2105-7-428-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>Although the particular models used above (Kimura and HKY85) are reversible, matrices of allowable rate functions can in general be irreversible. Our General Parametric Model may thus be regarded as a generalisation of the General Irreversible Model.</p>
         </sec>
         <sec>
            <st>
               <p>Phylo-HMMs</p>
            </st>
            <p>Phylo-HMMs form a class of models slightly more complex than point substitution models. In a phylo-HMM, each column (or group of adjacent columns) is associated with a hidden state, representing the evolutionary context of the site. Each hidden state is conditionally dependent upon the immediately preceding state (the Markov property).</p>
            <p>Tasks that have been addressed using phylo-HMMs include:</p>
            <p><b>Measurement of variation of evolutionary rate among sites in DNA </b><abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Felsenstein and Churchill construct an HMM with three states. Each state generates an alignment column according to a point substitution process on a tree <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. The overall evolutionary rate for the column depends on the state from which it is emitted: each state thus corresponds to a "rate category" (the relative rates for the three states are 0.3, 2.0 and 10.0). The use of an HMM allows for an autocorrelated model of rate variation.</p>
            <p><b>Modeling site-specific residue usage in proteins</b><abbrgrp><abbr bid="B9">9</abbr><abbr bid="B31">31</abbr></abbrgrp>. While site-specific profiles are familiar tools in bioinformatics, early tools such as Gribskov profiles <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> and hidden Markov models <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> ignored phylogenetic correlations in the dataset, leading to biased sampling. Phylo-grammars incorporate these correlations directly. In these papers, Bruno <it>et al</it>. introduced an initial EM algorithm for estimating rate matrices.</p>
            <p><b>Prediction of secondary structure in proteins</b><abbrgrp><abbr bid="B83">83</abbr><abbr bid="B26">26</abbr></abbrgrp>. In a similar manner to Felsenstein and Churchill, a three-state HMM is constructed wherein each state emits an alignment column using a substitution rate matrix. Here, however, the states correspond to different units of secondary structure (loop, <it>&#945;</it>-helix and <it>&#946;</it>-sheet). The substitution rate matrix for each state reflects the frequency distribution and substitution patterns for that secondary structural class. The method performs less well than established secondary structure prediction algorithms, but shows promise, in particular given the simplicity of the model (three states only). Later work expanded the number of states in the phylo-HMM to eight (correspondingly increasing the number of parameters). Note that, as more parameters are introduced into this kind of phylo-HMM, the problem of "training" those parameters grows in importance.</p>
            <p><b>Prediction of exons and protein-coding gene structures in DNA </b><abbrgrp><abbr bid="B68">68</abbr><abbr bid="B80">80</abbr></abbrgrp>. The basis for the gene prediction programs EVOGENE and EXONIPHY, respectively, these phylo-HMMs are based on substitution models for codon triplets with 4<sup>3 </sup>= 64 states. The paper by Siepel and Haussler introduced the term "phylo-HMM" and used an approximate version of the EM algorithm introduced by Holmes and Rubin for parameterization <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>.</p>
            <p><b>Detection, modeling and annotation of transcription factor binding sites in DNA </b><abbrgrp><abbr bid="B62">62</abbr></abbrgrp>. Here, the EM algorithm and other formulae of Bruno and Halpern <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B31">31</abbr></abbrgrp> is used to model site-specific residue frequencies in alignments of promoter regions (rather than proteins, as addressed by Bruno and Halpern).</p>
            <p><b>Detection of conserved regions in multiple alignments of genomic DNA </b><abbrgrp><abbr bid="B79">79</abbr></abbrgrp>. Phylo-HMMs to detect conserved regions can be viewed as extensions of Felsenstein and Churchill's original model with more rate categories. This approach has been used to detect highly-conserved regions in vertebrate, insect, nematode and yeast genomes. Approaches measuring the substitution rate per site <abbrgrp><abbr bid="B79">79</abbr><abbr bid="B85">85</abbr></abbrgrp>, the local indel rate <abbrgrp><abbr bid="B54">54</abbr></abbrgrp> and/or the CpG mutation bias <abbrgrp><abbr bid="B81">81</abbr><abbr bid="B55">55</abbr></abbrgrp> have all shown merit.</p>
            <p>Analogously to some of the point substitution models, many phylo-HMMs can be expressed parametrically. An example of such a model is the one used by Siepel's PHASTCONS program, whose phylo-HMM has ten states ranging from slow to fast overall substitution rate. Moving from one state to another, the <it>relative </it>substitution rates between different nucleotides do not change (i.e. the ratio <it>R</it><sub><it>ij</it></sub>/<it>R</it><sub><it>kl </it></sub>is constant for any <it>i</it>, <it>j</it>, <it>k</it>, <it>l </it>&#8712; {<it>A</it>, <it>C</it>, <it>G</it>, <it>T</it>}); only the <it>overall </it>substitution rate varies (i.e. the absolute value <it>R</it><sub><it>ij </it></sub>is not constant). Such consistency across states can be achieved by writing the rate matrices for the ten states as <it>k</it><sub>1</sub><b>R</b>, <it>k</it><sub>2 </sub>&#215; <b>R</b>, <it>k</it><sub>3 </sub>&#215; <b>R</b>... <it>k</it><sub>10 </sub>&#215; <b>R </b>where the <it>k</it><sub><it>i </it></sub>are scalar multipliers and <b>R </b>is a relative rate matrix shared by all the states. Similarly, the rate matrices of Felsenstein and Churchill's three-state phylo-HMM can be written 0.3 &#215; <b>R</b>, 2 &#215; <b>R </b>and 10 &#215; <b>R</b>. Both are examples of the general parametric phylo-HMM.</p>
         </sec>
         <sec>
            <st>
               <p>Phylo-SCFGs</p>
            </st>
            <p>The most complex class of phylo-grammar considered here is the phylo-SCFG. Most commonly used to model RNA secondary structure, these grammars are capable of modeling covariation between paired sites. In an SCFG, covarying sites must be strictly nested, allowing the modeling of foldback structures but not pseudoknots, kissing loops or other topologi-cally elaborate RNA structures <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>.</p>
            <p>Tasks that have been addressed using phylo-SCFGs include:</p>
            <p><b>Prediction of RNA secondary structure </b><abbrgrp><abbr bid="B46">46</abbr><abbr bid="B47">47</abbr></abbrgrp>. The Pfold program in this paper introduced the first phylo-SCFG, combining stochastic context-free grammars (used to model RNA structure) with evolutionary substitution models. Since HMMs are a subset of SCFGs, the framework of phylo-SCFGs includes the previously discussed phylo-HMMs. The Pfold program also allowed for user-specified grammars; however, it lacked a fast EM-like algorithm for estimating grammar parameters from data (by contrast, the non-phylogenetic SCFGs used elsewhere in bioinformatics can be rapidly trained using the Inside-Outside algorithm <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>). A key feature of these models is the use of 16-state "basepair models" for modeling the simultaneous coevolution of functional base-pairs in RNA structures. Again, fast and effective parameterization of the model is an important issue.</p>
            <p><b>Detection of noncoding RNA genes </b><abbrgrp><abbr bid="B67">67</abbr></abbrgrp>. A similar model to Pfold was used by the Evofold program, which uses a phylo-SCFG to parse genomic alignments into noncoding RNA and other features <abbrgrp><abbr bid="B67">67</abbr></abbrgrp>.</p>
            <p><b>Detection of RNA secondary structure within exons </b><abbrgrp><abbr bid="B69">69</abbr></abbrgrp>. The RNA-Decoder program uses a parametric phylo-SCFG to model exonic regions in which there is simultaneous selection on both the translated protein sequence and the secondary structure of the pre-mRNA. Such regions have been found in viral genomes and hypothesized to fulfil a regulatory role <abbrgrp><abbr bid="B69">69</abbr></abbrgrp>. Due to the complexity of these models and the sparsity of training data, parametric rate functions are required to limit the number of free parameters that must be estimated.</p>
            <p><b>Detection of accelerated selection in human noncoding RNA </b><abbrgrp><abbr bid="B70">70</abbr></abbrgrp>. Pollard <it>et al </it>used phylo-HMMs and phylo-SCFGs to identify a neurally-expressed RNA gene, HARF1, that had undergone recent accelerated evolution in the lineage separating humans from the human-chimp ancestor.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <p>In practice, users of phylo-grammars need to do a similar core set of tasks in order to perform data analysis. These tasks may include model development, structured parameterization, estimation of parameter values and application of the model to annotate alignments. Using the framework of phylo-grammars, an implementation enabling all these tasks is possible. The EM algorithm provides a general and consistent approach to parameter estimation, while standard "parsing" algorithms (the Viterbi and Cocke-Younger-Kasami (CYK) algorithms <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>) address the problem of annotation.</p>
         <p>We have implemented EM and Viterbi/CYK parsing algorithms in our software. The general irreversible phylo-EM algorithm, using eigenvector decompositions, is described in the Supplementary Material to this paper [see <supplr sid="S1">Additional file 1</supplr>]. (Note that this model is more general than the "general reversible model" <abbrgrp><abbr bid="B92">92</abbr></abbrgrp>, which can be regarded as a special case wherein the rates obey a detailed balance symmetry so that <it>&#960;</it><sub><it>i</it></sub><it>R</it><sub><it>ij </it></sub>= <it>&#960;</it><sub><it>j</it></sub><it>R</it><sub><it>ji</it></sub>.) The main advance over previous descriptions of this algorithm <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B81">81</abbr></abbrgrp> is a complete closed-form solution for the M-step of EM for irreversible models, including a full algebraic treatment of the complex conjugate eigenvector pairs [see <supplr sid="S1">Additional file 1</supplr>]. This closed-form solution for the M-step eliminates the need for numerical optimization code as part of EM. The Viterbi and CYK algorithms are described in full elsewhere <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>.</p>
         <p>The essential idea of EM is iteratively to maximize the <it>expected log-likelihood </it>with respect to the rate parameters, where the expectation is taken over the posterior distribution of the missing data using the current parameters. In the case of phylo-EM, the missing data are the sequences ancestral to the observed sequence data.</p>
         <p>As with many instances of EM, the posterior distribution over the missing data in phylo-EM can be summarized via a representative set of "counts" that, being expectations, have convenient additive properties.</p>
         <p>These counts have the following intuitive meaning with respect to the ancestral states of the evolutionary process: (i) the expected residue composition at the root node of the tree; (ii) the expected number of times each type of point mutation occurred; (iii) the expected amount of evolutionary time each residue was extant.</p>
         <p>Each of these counts is summed over all branches of the phylogenetic tree and then over all columns in the alignment (or groups of columns). The sum over columns is weighted by the posterior probability that each column (or group of columns) was generated by a particular state.</p>
         <p>Note that it is relatively easy to obtain naive estimates for the phylo-EM counts (e.g. using parsimony), but that such naive estimates are in general systematically biased. In particular, they tend to underestimate the number of substitutions that actually occurred.</p>
         <p>A stochastic grammar consists of a set of "nonterminal" symbols (equivalent to the "states" of an HMM), a set of "terminal" symbols and a set of "production rules" for transforming nonterminals. In a context-free grammar, each production rule transforms a single nonterminal into a (possibly empty) sequence of terminals and/or nonterminals. The iterative application of such rules can be represented as a tree structure known as the "parse tree" <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. In biological applications, there is typically a large number of parse trees that can explain the observed data. This contrasts with applications in computational linguistics, where there are typically only a small number of parses consistent with the data.</p>
         <p>To apply EM to a stochastic grammar, one must compute the expected number of times each production rule was used in the derivation of the observed alignment. These expected counts are summed over the posterior distribution of parse trees, and are calculated using the Inside-Outside algorithm.</p>
         <p>The set of terminal symbols for a phylo-grammar is the set of possible alignment columns (in contrast to a single-sequence grammar, where the set of terminal symbols corresponds to the residue alphabet). The phylo-EM algorithm is used to estimate the rate parameters associated with the emission of these symbols by the grammar.</p>
         <sec>
            <st>
               <p>Programs</p>
            </st>
            <p>The following open source software tools, implementing the algorithms and models described in this paper, are freely available (see Availability and Requirements).</p>
            <p>xgram &#8211; a implementation of the EM algorithm for training phylo-grammars, i.e. the Inside-Outside and Forward-Backward algorithms combined with the EM algorithm for the general irreversible (and reversible) substitution models. This program implements the general irreversible EM algorithm described in the Supplementary Material [see <supplr sid="S1">Additional file 1</supplr>], along with the general reversible EM algorithm described previously <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. The grammar can be user-specified via an extensible file format, described below. Parametric grammars are allowed (so that individual substitution rates and/or rule probabilities can be constrained to arbitrary functions of a smaller set of model parameters). The xgram tool is capable of reproducing most of the phylo-grammar models listed in this paper. In its generic applicability, xgram is similar to the dynamic programming engine Dynamite <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, although the class of models is different (phylo-grammars <it>vs </it>single- and pair-HMMs) and the functionality broader (including parameterization by phylo-EM, as well as Viterbi and CYK annotation codes). Also included is an implementation of the neighbor-joining algorithm for fast estimation of tree topologies <abbrgrp><abbr bid="B77">77</abbr></abbrgrp>, and another version of the EM algorithm for rapidly optimising branch lengths of trees with fixed topology <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. The model underlying xgram also allows for dynamically evolving "hidden states" associated with each site, again as previously described <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>.</p>
            <p>xrate &#8211; a version of xgram including several "preset" grammars for point substitution models, including the general irreversible and reversible substitution models.</p>
            <p>xfold &#8211; a version of xgram including several "preset" grammars for RNA analysis, including that of the Pfold program <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>.</p>
            <p>xprot &#8211; a version of xgram including several "preset" grammars for protein analysis, including a grammar similar to that used by Thorne <it>et al</it>. for protein secondary structure prediction <abbrgrp><abbr bid="B83">83</abbr></abbrgrp>.</p>
            <p>All of the above programs can be driven by any user-specified phylo-grammar. Having specified a grammar, or chosen one of the presets, the user can</p>
            <p>&#8226; Estimate the ML parameterization of the grammar for the training set via EM, using Inside-Outside or Forward-Backward algorithms (auto-selected by program) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, together with the phylo-EM algorithm described in the Supplementary Material [see <supplr sid="S1">Additional file 1</supplr>];</p>
            <p>&#8226; Find the maximum likelihood (ML) parse tree, using Cocke-Younger-Kasami (CYK) or Viterbi algorithms (auto-selected by program) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, with phylogenetic likelihoods calculated by pruning <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>;</p>
            <p>&#8226; Annotate the alignment, column-by-column, with user-specified labels, using the ML parse tree;</p>
            <p>&#8226; Find the posterior probability of each node in the ML parse tree.</p>
            <p>The parse tree can also be constrained, completely or partially, by including complete or partial annotations in the input alignment. For example, one can annotate several known examples of a TF binding site in a multiple alignment. One can then allow the grammar to "learn" these examples and predict new binding sites.</p>
         </sec>
         <sec>
            <st>
               <p>File formats</p>
            </st>
            <p>The input and output format for sequence alignment data is the Stockholm format, as used by PFAM and RFAM. The wildcard character is the period ".". Annotation of columns with the wildcard character allows for incompletely labeled data and hence partially supervised learning. If a given annotation is specified in the grammar but absent from the training data, it will be treated as a string of wildcards and all compatible possibilities will be summed over.</p>
            <p>Any phylo-grammar can be specified, using a format based on LISP S-expressions <abbrgrp><abbr bid="B56">56</abbr><abbr bid="B75">75</abbr></abbrgrp>. The format is human-readable and succinct, while being machine-parseable and extensible.</p>
            <p>Phylo-grammar specification files contain several elements:</p>
            <p>&#8226; An <it>alphabet</it>, describing valid sequence tokens (e.g. nucleotides or amino acids) along with any degenerate or (in the case of nucleotides) complementary tokens.</p>
            <p>&#8226; One or more <it>chains</it>, each describing a finite-state continuous-time Markov chain, including rate parameters;</p>
            <p>&#8226; Optionally (for parametric models) a set of rate and probability <it>parameter values</it>;</p>
            <p>&#8226; A set of <it>transformation rules</it>, which also serve to define the nonterminals in the grammar.</p>
            <p>As an example, the grammar for the Kimura two-parameter rate matrix is shown (see figure <figr fid="F3">3</figr>). A more complete and up-to-date description of the format can be found online <abbrgrp><abbr bid="B88">88</abbr></abbrgrp>, as can discussion of the latest version of xrate and its companion programs <abbrgrp><abbr bid="B89">89</abbr></abbrgrp>.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>An xgram-format grammar for Kimura's two-parameter model</p>
               </caption>
               <text>
                  <p>An xgram-format grammar for Kimura's two-parameter model.</p>
               </text>
               <graphic file="1471-2105-7-428-3"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>We illustrate the potential of xrate as a quick tool for prototyping phylo-grammars by re-implementing several prior applications and testing on real and simulated data. As applications we choose firstly a codon substitution model which is both computationally intensive and parameter-rich (due to the size of the rate matrix). Secondly, we compare xrate's performance in predicting protein structure to a previously used phylo-HMM. Thirdly, we compare xrate to a previously used phylo-SCFG for predicting RNA secondary structure.</p>
         <p>To visualize rate matrices, we use figures that we refer to as "bubble-plots" (see figure <figr fid="F11">11</figr>). In a bubbleplot, the area of a circle in the main matrix is proportional to the rate of the corresponding substitution, with the grey circle in the upper-left repesenting the scale. The offset row shows the equilibrium probability distribution over states: here, the area of a circle is proportional to the equilibrium probability of the corresponding state. Additional color-coding is used on a case-by-case basis.</p>
         <sec>
            <st>
               <p>Fitting codon models</p>
            </st>
            <p>In the past, various amino acid substitution models have been estimated using ML techniques (e.g., mtREV <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, WAG <abbrgrp><abbr bid="B87">87</abbr></abbrgrp>). An ML estimation of codon substitution models, however, has seemed infeasible for a long time because of the computational burden involved with such parameter-rich models. This section shows that xrate is capable of tackling the problem. The full results of a particular study are being published elsewhere (Kosiol, Holmes and Goldman, in prep.); here, we will restrict attention to simulation results showing that xrate can do these sorts of analyses reliably.</p>
            <p>The number of independent parameters for a reversible substitution model with <it>N </it>character states can be calculated as <m:math name="1471-2105-7-428-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:mi>N</m:mi><m:mo stretchy="false">(</m:mo><m:mi>N</m:mi><m:mo>+</m:mo><m:mn>1</m:mn><m:mo stretchy="false">)</m:mo></m:mrow><m:mn>2</m:mn></m:mfrac><m:mo>&#8722;</m:mo><m:mn>2</m:mn></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd6eaojabcIcaOiabd6eaojabgUcaRiabigdaXiabcMcaPaqaaiabikdaYaaacqGHsislcqaIYaGmaaa@355B@</m:annotation></m:semantics></m:math>. This means that for the estimation of a 20-state amino acid model, 208 independent parameters need to be calculated. In contrast, to estimate a 61-state codon model (excluding stop codons), 1889 independent parameters have to be determined.</p>
            <p>To test the robustness of xrate's ability to fit parameter-rich models to aligned sequence data, we simulated a data set using all phylogenies of the Pandit database of protein domain alignments <abbrgrp><abbr bid="B86">86</abbr></abbrgrp>, using a standard model of codon evolution (the MO model <abbrgrp><abbr bid="B93">93</abbr></abbrgrp> [see <supplr sid="S1">Additional file 1</supplr>]). In this model, rates of substitutions involving changes to multiple nucleotides are zero, so that the rate matrix is sparsely populated.</p>
            <p>xrate is able to recover M0 well from this 'artifical' Pandit database. The true rates used in the simulation are shown (see figure <figr fid="F8">8</figr>). These may be compared with the recovered rates (see figure <figr fid="F9">9</figr>).</p>
            <p>A scatter plot of true <it>vs </it>estimated rates allows a more detailed analysis (see figure <figr fid="F10">10</figr>). This plot shows the true instantaneous rates <m:math name="1471-2105-7-428-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>q</m:mi><m:mrow><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow><m:mrow><m:mo stretchy="false">(</m:mo><m:mi>t</m:mi><m:mi>r</m:mi><m:mi>u</m:mi><m:mi>e</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGXbqCdaqhaaWcbaGaemyAaKMaemOAaOgabaGaeiikaGIaemiDaqNaemOCaiNaemyDauNaemyzauMaeiykaKcaaaaa@3852@</m:annotation></m:semantics></m:math> of M0 plotted versus the instantaneous rates <m:math name="1471-2105-7-428-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>q</m:mi><m:mrow><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow><m:mrow><m:mo stretchy="false">(</m:mo><m:mi>e</m:mi><m:mi>s</m:mi><m:mi>t</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGXbqCdaqhaaWcbaGaemyAaKMaemOAaOgabaGaeiikaGIaemyzauMaem4CamNaemiDaqNaeiykaKcaaaaa@36E1@</m:annotation></m:semantics></m:math> estimated from data simulated from M0. If <m:math name="1471-2105-7-428-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>q</m:mi><m:mrow><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow><m:mrow><m:mo stretchy="false">(</m:mo><m:mi>t</m:mi><m:mi>r</m:mi><m:mi>u</m:mi><m:mi>e</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGXbqCdaqhaaWcbaGaemyAaKMaemOAaOgabaGaeiikaGIaemiDaqNaemOCaiNaemyDauNaemyzauMaeiykaKcaaaaa@3852@</m:annotation></m:semantics></m:math> = <m:math name="1471-2105-7-428-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>q</m:mi><m:mrow><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow><m:mrow><m:mo stretchy="false">(</m:mo><m:mi>e</m:mi><m:mi>s</m:mi><m:mi>t</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGXbqCdaqhaaWcbaGaemyAaKMaemOAaOgabaGaeiikaGIaemyzauMaem4CamNaemiDaqNaeiykaKcaaaaa@36E1@</m:annotation></m:semantics></m:math> the points would lie on the bisection line <it>y </it>= <it>x</it>. Thus the deviation of the points from the bisection line indicates how different the rates are.</p>
            <p>If one is interested in drawing biological conclusions from the estimated rate parameters, then it is of interest to consider xrate's estimates of rates which are zero in the true model, xrate sometimes inferred erroneously very small non-zero values for the instantaneous rates of double and triple changes from the simulated data set (in the M0 model, which was used to generate the data, such substitutions have zero rate). However, this error can be correctly identified by comparing log-likelihoods calculated by xrate under the following nested models: For the general model allowing for single, double and triple nucleotide changes 1889 parameters had to be estimated. The best likelihood calculated for general estimation is In <it>L</it><sub><it>general </it></sub>= -28930383.06. Using xrate we can also restrict the rate matrices to single nucleotide changes only. For this model 322 parameters had to be estimated. The best likelihood calculated for restricted estimation is lnL<sub><it>restricted </it></sub>= -28930894.86.</p>
            <p>Although the log-likelihood for the general rate matrix allowing for single, double and triple changes is better we can show that the improvement is not significant. Significance is tested using a standard likelihood ratio test between the two models, comparing twice the difference in log-likelihood with a <m:math name="1471-2105-7-428-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>&#967;</m:mi><m:mrow><m:mn>1</m:mn><m:mn>5</m:mn><m:mn>6</m:mn><m:mn>7</m:mn></m:mrow><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFhpWydaqhaaWcbaGaeyymaeJaeyynauJaeyOnayJaey4naCdabaGaeyOmaidaaaaa@335D@</m:annotation></m:semantics></m:math> distribution, where 1567 is the degrees of freedom by which the two models differ. Using the normal approximation for <m:math name="1471-2105-7-428-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>&#967;</m:mi><m:mrow><m:mo stretchy="false">(</m:mo><m:mn>1</m:mn><m:mn>5</m:mn><m:mn>6</m:mn><m:mn>7</m:mn><m:mo>,</m:mo><m:mn>0</m:mn><m:mo>.</m:mo><m:mn>0</m:mn><m:mn>1</m:mn><m:mo stretchy="false">)</m:mo></m:mrow><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFhpWydaqhaaWcbaGaeyikaGIaeyymaeJaeyynauJaeyOnayJaey4naCJaeyilaWIaeyimaaJaeyOla4IaeyimaaJaeyymaeJaeyykaKcabaGaeyOmaidaaaaa@39A9@</m:annotation></m:semantics></m:math> we compare (2(ln <it>L</it><sub><it>general </it></sub>- ln <it>L</it><sub><it>restricted</it></sub>)-1567)/<m:math name="1471-2105-7-428-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msqrt><m:mrow><m:mn>2</m:mn><m:mo>&#215;</m:mo><m:mn>1567</m:mn></m:mrow></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabikdaYiabgEna0kabigdaXiabiwda1iabiAda2iabiEda3aWcbeaaaaa@33AE@</m:annotation></m:semantics></m:math> = -9.71 with the relevant 99% critical value of 2.33 taken from a standard normal <m:math name="1471-2105-7-428-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">N</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFneVtaaa@383B@</m:annotation></m:semantics></m:math> (0,1). The difference is seen to be insignificant; the P-value is almost 1.</p>
         </sec>
         <sec>
            <st>
               <p>Predicting protein secondary structure</p>
            </st>
            <p>We compared xrate to the phylo-HMM for prediction of protein secondary structure developed by Goldman, Thorne, and Jones <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> (here referred to as GTJ). This section uses a fully-connected three-state phylo-HMM with general reversible Markov chains. Training sets were taken from the HOMSTRAD database of structural alignments of homologous protein families <abbrgrp><abbr bid="B61">61</abbr></abbrgrp>.</p>
            <p>We trained the phylo-HMM on alpha-beta barrel alignments from HOMSTRAD, leaving out the beta-glycanase SCOP family. xrate was then benchmarked on this beta-glycanase SCOP family to compare the annotation predicted by xrate to the experimentally determined HOMSTRAD annotation. We also tried a more comprehensive training regime, training xrate on the complete HOMSTRAD database (excluding the beta-glycanase SCOP family) and again comparing predicted and database annotations.</p>
            <p>The performance of xrate was compared to that of GTJ. The results show that xrate can be used to quickly prototype and train a phylo-HMM with comparable performance to that reported by Goldman <it>et al</it>.</p>
            <sec>
               <st>
                  <p>Grammar</p>
               </st>
               <p>The PROT3 phylo-grammar has state labels for the three secondary structure classes of alpha-helix (H), beta-sheet (E) and loop (L). An excerpt of the grammar is shown (see figure <figr fid="F4">4</figr>).</p>
               <fig id="F4">
                  <title>
                     <p>Figure 4</p>
                  </title>
                  <caption>
                     <p>An excerpt from an xgram-format grammar reproducing the protein secondary structure phylo-HMM of Goldman, Thorne and Jones</p>
                  </caption>
                  <text>
                     <p>An excerpt from an xgram-format grammar reproducing the protein secondary structure phylo-HMM of Goldman, Thorne and Jones. This excerpt shows only the transformation rules, and omits the alphabet and chain definitions. Three separate Markov chains for amino acid substitution are used (and are assumed to be defined elsewhere in the file): alpha_col denotes an amino acid in an alpha helix (annotated with character H), beta_col denotes an amino acid in a beta sheet (annotated with character E) and loop_col denotes an amino acid in a loop region (annotated with character L).</p>
                  </text>
                  <graphic file="1471-2105-7-428-4"/>
               </fig>
               <p>An example of usage for this grammar follows. We also show an alignment from HOMSTRAD, too small to predict secondary structure with any confidence, but useful for illustrative purposes (see figure <figr fid="F5">5</figr>). Suppose we want to: (1) read in this alignment from a file named ' pp. stk'; (2) load a point substitution matrix from a file named 'dart/data/nullprot.eg' (this is an amino-acid matrix distributed with xrate; the filename path assumes that the DART package was downloaded to the current working directory); (3) use the above point substitution matrix to estimate a phylo-genetic tree (by neighbor-joining followed by EM on the branch lengths); (4) load the PROT3 model from a file named 'dart/data/prot3.eg' (again, this is distributed with xrate); and (5) use the PROT3 model to predict secondary structure classes for this protein family, printing the annotated alignment to the standard output. The following command-line syntax achieves this:</p>
               <fig id="F5">
                  <title>
                     <p>Figure 5</p>
                  </title>
                  <caption>
                     <p>Example Stockholm-format input file for the protein secondary structure grammar (see figure 4)</p>
                  </caption>
                  <text>
                     <p>Example Stockholm-format input file for the protein secondary structure grammar (see figure 4). The alignment is of the pancreatic hormone family.</p>
                  </text>
                  <graphic file="1471-2105-7-428-5"/>
               </fig>
               <p>xrate pp.stk --tree dart/data/nullprot.eg --grammar dart/data/prot3.eg</p>
               <p>The output of this command is shown (see figure <figr fid="F6">6</figr>).</p>
               <fig id="F6">
                  <title>
                     <p>Figure 6</p>
                  </title>
                  <caption>
                     <p>Example Stockholm-format output using the protein secondary structure grammar (see figure 4) and the pancreatic hormone alignment (see figure 5)</p>
                  </caption>
                  <text>
                     <p>Example Stockholm-format output using the protein secondary structure grammar (see figure 4) and the pancreatic hormone alignment (see figure 5). Line numbers have been added for reference; note the embedded New Hampshire-format tree at line 2, the Viterbi bit-score at line 3 and the Viterbi secondary structure annotation at line 7.</p>
                  </text>
                  <graphic file="1471-2105-7-428-6"/>
               </fig>
               <fig id="F7">
                  <title>
                     <p>Figure 7</p>
                  </title>
                  <caption>
                     <p>An excerpt from an xgram-format grammar reproducing the RNA secondary structure phylo-SCFG of Knudsen and Hein</p>
                  </caption>
                  <text>
                     <p>An excerpt from an xgram-format grammar reproducing the RNA secondary structure phylo-SCFG of Knudsen and Hein. This excerpt shows only the transformation rules, and omits the alphabet and chain definitions. Two separate Markov chains for nucleotide substitution are used (and are assumed to be defined elsewhere in the file): LNUC and RNUC denote the left and right (i.e. 5' and 3') nucleotides of a co-evolving basepair in a 16-state Markov chain (annotated with characters &lt; and >), while NUC denotes an unpaired nucleotide in a 4-state Markov chain (annotated with character _).</p>
                  </text>
                  <graphic file="1471-2105-7-428-7"/>
               </fig>
               <fig id="F8">
                  <title>
                     <p>Figure 8</p>
                  </title>
                  <caption>
                     <p>True codon mutation rate matrix for the M0 mechanistic codon mutation model benchmark (see Results and Discussion)</p>
                  </caption>
                  <text>
                     <p>True codon mutation rate matrix for the M0 mechanistic codon mutation model benchmark (see Results and Discussion). These rates were used to generate simulated data; rates were then estimated from these data and compared to the true rates (see figure 9).</p>
                  </text>
                  <graphic file="1471-2105-7-428-8"/>
               </fig>
               <fig id="F9">
                  <title>
                     <p>Figure 9</p>
                  </title>
                  <caption>
                     <p>Estimated codon mutation rate matrix for the codon model benchmark (see Results and Discussion)</p>
                  </caption>
                  <text>
                     <p>Estimated codon mutation rate matrix for the codon model benchmark (see Results and Discussion). These rates were estimated by xrate from simulated data, generated using a mechanistic rate model (see figure 8).</p>
                  </text>
                  <graphic file="1471-2105-7-428-9"/>
               </fig>
               <fig id="F10">
                  <title>
                     <p>Figure 10</p>
                  </title>
                  <caption>
                     <p>Scatter plot comparing true instantaneous rates with estimated rates from simulated data for the codon model benchmark (see Results and Discussion)</p>
                  </caption>
                  <text>
                     <p>Scatter plot comparing true instantaneous rates with estimated rates from simulated data for the codon model benchmark (see Results and Discussion).</p>
                  </text>
                  <graphic file="1471-2105-7-428-10"/>
               </fig>
               <fig id="F11">
                  <title>
                     <p>Figure 11</p>
                  </title>
                  <caption>
                     <p>Bubbleplot of amino acid substitution rates for alpha-helices</p>
                  </caption>
                  <text>
                     <p>Bubbleplot of amino acid substitution rates for alpha-helices. See Results and Discussion for color-coding and explanation of bubbleplots.</p>
                  </text>
                  <graphic file="1471-2105-7-428-11"/>
               </fig>
               <p>More such examples can be found in DART (the software library with which xrate is distributed) and on the wiki pages for the xrate program <abbrgrp><abbr bid="B89">89</abbr></abbrgrp>. A full list of command-line options for xrate can be obtained by typing xrate &#8211;help or, equivalently, xrate -h.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Both xrate and the GTJ program were evaluated on the xylanase alignment used by GTJ, hereafter referred to as gtjxyl. xrate was trained on the subset of HOMSTRAD corresponding to alpha-beta barrel structures, with members of the beta-glycanase SCOP family (which includes the gtjxyl proteins) removed to prevent overlap between the training and test sets.</p>
               <p>We report the prediction <it>accuracy </it>collectively for all secondary structure categories, and the <it>sensitivity </it>and <it>specificity </it>with respect to each individual category. These metrics are defined as follows</p>
               <p>Sensitivity(<it>n</it>) = TP<sub><it>n</it></sub>/(TP<sub><it>n </it></sub>+ FN<sub><it>n</it></sub>)</p>
               <p>Speciflcity(<it>n</it>) = TP<sub><it>n</it></sub>/(TP<sub><it>n </it></sub>+ FP<sub><it>n</it></sub>)</p>
               <p>Accuracy = (<m:math name="1471-2105-7-428-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:munder><m:mo>&#8721;</m:mo><m:mi>n</m:mi></m:munder><m:mrow><m:msub><m:mrow><m:mtext>TP</m:mtext></m:mrow><m:mi>n</m:mi></m:msub></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeqbqaaiabbsfaujabbcfaqnaaBaaaleaacqWGUbGBaeqaaaqaaiabd6gaUbqab0GaeyyeIuoaaaa@3410@</m:annotation></m:semantics></m:math>)/(<m:math name="1471-2105-7-428-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:munder><m:mo>&#8721;</m:mo><m:mi>n</m:mi></m:munder><m:mrow><m:msub><m:mrow><m:mtext>TP</m:mtext></m:mrow><m:mi>n</m:mi></m:msub></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeqbqaaiabbsfaujabbcfaqnaaBaaaleaacqWGUbGBaeqaaaqaaiabd6gaUbqab0GaeyyeIuoaaaa@3410@</m:annotation></m:semantics></m:math> + FN<sub><it>n</it></sub>)</p>
               <p>where (for secondary structure class <it>n</it>) TP<sub><it>n </it></sub>is the number of true positives (columns correctly predicted as class <it>n</it>), FN<sub><it>n </it></sub>is the number of false negatives (columns that should have been predicted as class <it>n </it>but were not) and FP<sub><it>n </it></sub>is the number of false positives (columns that were incorrectly predicted as class <it>n</it>).</p>
               <p>Bubbleplots were used to visualize the amino acid substitution rates. Substitutions are colored red if between aromatic amino acids, green if between hydrophobics and blue if between hydrophilics. Substitutions from one such group to another (e.g. from hydrophobic to hydrophilic) are colored gray.</p>
               <p>Figures <figr fid="F11">11</figr>, <figr fid="F12">12</figr> and <figr fid="F13">13</figr> show the amino acid substitution matrices for the alpha-helix, beta-sheet and loop states, respectively. The relative rates displayed in the figures in general agree with what one would expect from each of those states: the alpha-helix and beta-sheet states substitute more slowly (and thus amino acid conservation is higher) than for the loop states (loop regions being more variable in structure <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>).</p>
               <fig id="F12">
                  <title>
                     <p>Figure 12</p>
                  </title>
                  <caption>
                     <p>Bubbleplot of amino acid substitution rates for beta-sheets</p>
                  </caption>
                  <text>
                     <p>Bubbleplot of amino acid substitution rates for beta-sheets. See Results and Discussion for color-coding and explanation of bubbleplots.</p>
                  </text>
                  <graphic file="1471-2105-7-428-12"/>
               </fig>
               <fig id="F13">
                  <title>
                     <p>Figure 13</p>
                  </title>
                  <caption>
                     <p>Bubbleplot of amino acid substitution rates for loop regions</p>
                  </caption>
                  <text>
                     <p>Bubbleplot of amino acid substitution rates for loop regions. See Results and Discussion for color-coding and explanation of bubbleplots.</p>
                  </text>
                  <graphic file="1471-2105-7-428-13"/>
               </fig>
               <p>Table <tblr tid="T1">1</tblr> shows the log likelihood scores of the training alignments, log <it>P</it>(<it>D</it>|<it>&#952;</it>), along with the log-posterior probability of the HOMSTRAD reference annotation, log <it>P</it>(<it>A</it>|<it>D</it>, <it>&#952;</it>). In this case, maximum-likelihood training also yields an increase in the annotation posterior probability <it>P</it>(<it>A</it>|<it>D</it>, <it>&#952;</it>). This is not in general a guaranteed result of the EM algorithm, and alternative training procedures (such as maximum-discrimination training <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>) have been proposed to achieve this effect. It appears in this case that such procedures are not required.</p>
               <tbl id="T1">
                  <title>
                     <p>Table 1</p>
                  </title>
                  <caption>
                     <p>Log-likelihood scores of training sets and log-posterior probabilities of the true annotations for the PROT3 benchmark. Here <it>D </it>denotes the training alignment data (the HOMSTRAD database without the beta-glycanase SCOP family), <it>A </it>denotes the DSSP annotations of the alignment data, <it><it>&#952;</it></it><sub><it>D </it></sub>denotes the model with parameters obtained from training on <it>D</it>, and <it><it>&#952;</it></it><sub><it>G </it></sub>denotes the model with parameters obtained from the GTJ datafiles.</p>
                  </caption>
                  <tblbdy cols="4">
                     <r>
                        <c ca="left">
                           <p>
                              <it>&#952;</it>
                           </p>
                        </c>
                        <c ca="right">
                           <p>log<sub>2 </sub><it>P</it>(<it>A</it>, <it>D</it>|<it>&#952;</it>)</p>
                        </c>
                        <c ca="right">
                           <p>log<sub>2 </sub><it>P</it>(<it>D</it>|<it>&#952;</it>)</p>
                        </c>
                        <c ca="right">
                           <p>log<sub>2 </sub><it>P</it>(<it>A</it>|<it>D</it>, <it>&#952;</it>)</p>
                        </c>
                     </r>
                     <r>
                        <c cspan="4">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <it>&#952;</it>
                              <sub>
                                 <it>D</it>
                              </sub>
                           </p>
                        </c>
                        <c ca="right">
                           <p>-173038</p>
                        </c>
                        <c ca="right">
                           <p>-162491</p>
                        </c>
                        <c ca="right">
                           <p>-10547</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <it>&#952;</it>
                              <sub>
                                 <it>G</it>
                              </sub>
                           </p>
                        </c>
                        <c ca="right">
                           <p>-238632</p>
                        </c>
                        <c ca="right">
                           <p>-227979</p>
                        </c>
                        <c ca="right">
                           <p>-10653</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <p>Table <tblr tid="T2">2</tblr> reports likelihoods, accuracies and runtimes for training set 2 as the EM convergence criteria are tightened. As expected, the likelihood increases as the convergence criteria are made more stringent. The annotation accuracy for the gtjxyl benchmark alignment also consistently increases.</p>
               <tbl id="T2">
                  <title>
                     <p>Table 2</p>
                  </title>
                  <caption>
                     <p>Effect of tightening the EM convergence criteria for the PROT3 benchmark. The "mininc" parameter is the minimum fractional log-likelihood increase per iteration of EM. Accuracies for the gtjxyl benchmark alignment are reported, along with log-likelihoods. See Table 1 for additional notation.</p>
                  </caption>
                  <tblbdy cols="6">
                     <r>
                        <c ca="right">
                           <p>mininc</p>
                        </c>
                        <c ca="right">
                           <p>Runtime/min</p>
                        </c>
                        <c ca="right">
                           <p>Acc(gtjxyl)</p>
                        </c>
                        <c ca="right">
                           <p>log<sub>2 </sub><it>P</it>(<it>A</it>, <it>D</it>|<it>&#952;</it><sub><it>D</it></sub>)</p>
                        </c>
                        <c ca="right">
                           <p>log<sub>2 </sub><it>P</it>(<it>D</it>|<it>&#952;</it><sub><it>D</it></sub>)</p>
                        </c>
                        <c ca="right">
                           <p>log<sub>2 </sub><it>P</it>(<it>A</it>|<it>D</it>, <it>&#952;</it><sub><it>D</it></sub>)</p>
                        </c>
                     </r>
                     <r>
                        <c cspan="6">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="right">
                           <p>le-3</p>
                        </c>
                        <c ca="right">
                           <p>14</p>
                        </c>
                        <c ca="right">
                           <p>64.1</p>
                        </c>
                        <c ca="right">
                           <p>-2696469</p>
                        </c>
                        <c ca="right">
                           <p>-2549947</p>
                        </c>
                        <c ca="right">
                           <p>-146522</p>
                        </c>
                     </r>
                     <r>
                        <c ca="right">
                           <p>le-4</p>
                        </c>
                        <c ca="right">
                           <p>35</p>
                        </c>
                        <c ca="right">
                           <p>64.7</p>
                        </c>
                        <c ca="right">
                           <p>-2686598</p>
                        </c>
                        <c ca="right">
                           <p>-2539908</p>
                        </c>
                        <c ca="right">
                           <p>-146690</p>
                        </c>
                     </r>
                     <r>
                        <c ca="right">
                           <p>le-5</p>
                        </c>
                        <c ca="right">
                           <p>84</p>
                        </c>
                        <c ca="right">
                           <p>68.0</p>
                        </c>
                        <c ca="right">
                           <p>-2682667</p>
                        </c>
                        <c ca="right">
                           <p>-2536849</p>
                        </c>
                        <c ca="right">
                           <p>-145818</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <p>Table <tblr tid="T3">3</tblr> summarizes the results of running xrate and the GTJ program on all the test cases. In general the accuracy of xrate is comparable to or even slightly better than the accuracy of the GTJ program.</p>
               <tbl id="T3">
                  <title>
                     <p>Table 3</p>
                  </title>
                  <caption>
                     <p>Summary of prediction performance for the PROT3 benchmark. "Sn" and "Sp" are the sensitivity and specificity for each secondary structure category; "Acc" is the overall accuracy.</p>
                  </caption>
                  <tblbdy cols="8">
                     <r>
                        <c ca="right">
                           <p>Program</p>
                        </c>
                        <c ca="right">
                           <p>Sn (<it>&#945;</it>)</p>
                        </c>
                        <c ca="right">
                           <p>Sp (<it>&#945;</it>)</p>
                        </c>
                        <c ca="right">
                           <p>Sn (<it>&#946;</it>)</p>
                        </c>
                        <c ca="right">
                           <p>Sp (<it>&#946;</it>)</p>
                        </c>
                        <c ca="right">
                           <p>Sn (L)</p>
                        </c>
                        <c ca="right">
                           <p>Sp (L)</p>
                        </c>
                        <c ca="right">
                           <p>Acc</p>
                        </c>
                     </r>
                     <r>
                        <c cspan="8">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="right">
                           <p>GTJ</p>
                        </c>
                        <c ca="right">
                           <p>66.7</p>
                        </c>
                        <c ca="right">
                           <p>91.3</p>
                        </c>
                        <c ca="right">
                           <p>63.5</p>
                        </c>
                        <c ca="right">
                           <p>84.0</p>
                        </c>
                        <c ca="right">
                           <p>73.5</p>
                        </c>
                        <c ca="right">
                           <p>77.3</p>
                        </c>
                        <c ca="right">
                           <p>69.6</p>
                        </c>
                     </r>
                     <r>
                        <c ca="right">
                           <p>xrate</p>
                        </c>
                        <c ca="right">
                           <p>71.6</p>
                        </c>
                        <c ca="right">
                           <p>95.7</p>
                        </c>
                        <c ca="right">
                           <p>82.7</p>
                        </c>
                        <c ca="right">
                           <p>79.0</p>
                        </c>
                        <c ca="right">
                           <p>65.2</p>
                        </c>
                        <c ca="right">
                           <p>81.2</p>
                        </c>
                        <c ca="right">
                           <p>70.2</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Predicting RNA secondary structure</p>
            </st>
            <p>To illustrate the capability of xrate as a tool for RNA secondary structure prediction/annotation, we compare it to Pfold, a phylo-SCFG developed by Knudsen and Hein <abbrgrp><abbr bid="B46">46</abbr><abbr bid="B47">47</abbr></abbrgrp>.</p>
            <p>There are two goals of this section: (1) to see if xrate can exactly emulate the Pfold phylo-grammar using the same parameters as Pfold, and (2) to see if the EM algorithm can estimate parameters that yield comparable performance to those produced by other methods.</p>
            <p>We benchmarked the Pfold phylo-SCFG running on xrate against the original Pfold program using alignments from the Rfam database <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. To address goal (2), we used xrate to estimate the substitution rates and initial frequencies of basepairs and single nucleotides from annotated Rfam alignments.</p>
            <p>Our results show that the Pfold phylo-SCFG is effectively emulated by xrate, that the EM algorithm can estimate a more likely parameterization for a given training set and that the parameters so obtained are comparable in performance to the Pfold program itself. We conclude that xrate is a suitable platform for developing, parameterizing, and testing phylo-grammars without the necessity of writing source code or performing manual parameterization.</p>
            <sec>
               <st>
                  <p>Grammar</p>
               </st>
               <p>The PFOLD grammar is taken from the Pfold program and is described in the paper by Knudsen and Hein <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>.</p>
               <p>An excerpt of the grammar, containing the production rules, is seen in figure <figr fid="F7">7</figr> . </p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We report the <it>sensitivity </it>and <it>positive predictive value </it>(PPV) of basepair predictions. These accuracy metrics are defined as follows</p>
               <p>Sensitivity = TP/(TP + FN)</p>
               <p>PPV = TP/(TP + FP)</p>
               <p>where TP is the number of true positives (base pairs that are predicted correctly per the Rfam annotation), FN the number of false negatives (base pairs that are not predicted but are in the Rfam annotation) and FP the false positives (predicted base pairs that are not in the Rfam annotation).</p>
               <p>Training and testing sets were obtained by selecting the 148 RNA gene families in Rfam version 7 with experimentally-determined structures, discarding pseudoknots, removing excessively gappy columns (as this step is also performed by Pfold), grouping the families into superfamilies and randomly partitioning these superfamilies into two sets [see <supplr sid="S1">Additional file 1</supplr>]. This yielded a training set of 71 alignments and a testing set of 77 alignments.</p>
               <p>The benchmark results, shown in Table <tblr tid="T4">4</tblr>, indicate that the sensitivity and PPV of the Pfold program and its emulation on xrate are comparable. It should be noted, however, that the sets of base pairs predicted by the two programs are slightly different [see <supplr sid="S1">Additional file 1</supplr>]. After examination, we attribute this to differences in implementation and loss of precision due to numerical calculations.</p>
               <tbl id="T4">
                  <title>
                     <p>Table 4</p>
                  </title>
                  <caption>
                     <p>Accuracy of RNA secondary structure prediction. Comparison of sensitivities and PPVs for the Pfold program, its phylo-SCFG running on xrate with its original rates, and its phylo-SCFG running on xrate with rates estimated from Rfam by the phylo-EM algorithm.</p>
                  </caption>
                  <tblbdy cols="3">
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c ca="right">
                           <p>Sensitivity</p>
                        </c>
                        <c ca="right">
                           <p>PPV</p>
                        </c>
                     </r>
                     <r>
                        <c cspan="3">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="right">
                           <p>Pfold</p>
                        </c>
                        <c ca="right">
                           <p>45.0%</p>
                        </c>
                        <c ca="right">
                           <p>58.3%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="right">
                           <p>xrate emulating Pfold</p>
                        </c>
                        <c ca="right">
                           <p>44.4%</p>
                        </c>
                        <c ca="right">
                           <p>61.7%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="right">
                           <p>xrate trained on Rfam</p>
                        </c>
                        <c ca="right">
                           <p>42.8%</p>
                        </c>
                        <c ca="right">
                           <p>58.2%</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <p>We also tested whether parameterizing the phylo-SCFG using the EM algorithm is comparable to the Pfold parameterization <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>. A comparison of Pfold's original rates with the EM-estimated rates is shown in Figures 14&#8211;16. Both sets of parameters display similar trends. Substitutions that create or preserve canonical base pairs are more frequent than substitutions that destroy basepairs (see figure <figr fid="F14">14</figr>). Transitions are more common than transversions, both within basepairs (see figure <figr fid="F15">15</figr>) and unpaired sites (see figure <figr fid="F16">16</figr>). There is a difference in the magnitude of many of the rates, which we attribute to differences in the training sets.</p>
               <fig id="F14">
                  <title>
                     <p>Figure 14</p>
                  </title>
                  <caption>
                     <p>Comparison of basepair substitution rates, colored by basepairing conservation, gain, or loss</p>
                  </caption>
                  <text>
                     <p>Comparison of basepair substitution rates, colored by basepairing conservation, gain, or loss. Rates and equilibrium frequencies from the Pfold phylo-SCFG (left panel) are compared with those estimated by the phylo-EM algorithm from Rfam (right panel). Substitutions from non-canonical to canonical basepairs are blue (pairing gain), canonical to canonical are red (pairing conservation), non-canonical to non-canonical are black (unpaired and no change), and canonical to non-canonical are yellow (pairing loss).</p>
                  </text>
                  <graphic file="1471-2105-7-428-14"/>
               </fig>
               <fig id="F15">
                  <title>
                     <p>Figure 15</p>
                  </title>
                  <caption>
                     <p>Comparison of basepair substitution rates, colored by transitions/transversions</p>
                  </caption>
                  <text>
                     <p>Comparison of basepair substitution rates, colored by transitions/transversions. The rates were obtained from the Pfold program and by training on Rfam (see figure 14). Transition of a single base in a pair is dark red, transversion is light red; transitions in both bases is dark green, transition of one and transversion of the other is medium green, transversions of both is light green.</p>
                  </text>
                  <graphic file="1471-2105-7-428-15"/>
               </fig>
               <fig id="F16">
                  <title>
                     <p>Figure 16</p>
                  </title>
                  <caption>
                     <p>Comparison of substitution rates of nucleotides in unpaired alignment columns</p>
                  </caption>
                  <text>
                     <p>Comparison of substitution rates of nucleotides in unpaired alignment columns. Rates and equilibrium frequencies from the Pfold phylo-SCFG (left panel) are compared with those estimated by the phylo-EM algorithm from Rfam (right panel). Transitions are green, transversions are black.</p>
                  </text>
                  <graphic file="1471-2105-7-428-16"/>
               </fig>
               <p>The predictive accuracy of Pfold is compared to that of the xrate-trained phylo-SCFG in Table <tblr tid="T4">4</tblr>, while log-likelihoods are compared in Tables <tblr tid="T5">5</tblr> and <tblr tid="T6">6</tblr>. The results are similar, indicating that the combination of training set and xrate-implemented EM is comparable to the training procedure used in the development of Pfold.</p>
               <tbl id="T5">
                  <title>
                     <p>Table 5</p>
                  </title>
                  <caption>
                     <p>Log-likelihoods of alignments, and log-posteriors of alignment annotations, for training and testing datasets under various EM convergence regimes in the PFOLD benchmark. The "mininc" parameter is the minimal fractional increase in the log-likelihood that is considered by our EM implementation to be an improvement, while the "forgive" parameter is the number of iterations of EM without such an improvement that will be tolerated before the algorithm terminates. The default settings are mininc = le-3, forgive = 0. Here <it>D </it>denotes the alignment data, <it>A </it>denotes the RFAM secondary structure annotations of the alignment data and <it>&#952; </it>denotes the model with parameters optimized for the training set using the specified EM convergence criteria.</p>
                  </caption>
                  <tblbdy cols="6">
                     <r>
                        <c ca="left">
                           <p>Dataset</p>
                        </c>
                        <c ca="left">
                           <p>"mininc"</p>
                        </c>
                        <c ca="left">
                           <p>"forgive"</p>
                        </c>
                        <c ca="left">
                           <p>log<sub>2 </sub><it>P</it>(<it>D</it>, <it>A</it>|<it>&#952;</it>)</p>
                        </c>
                        <c ca="left">
                           <p>log<sub>2 </sub><it>P</it>(<it>D</it>|<it>&#952;</it>)</p>
                        </c>
                        <c ca="left">
                           <p>log<sub>2 </sub><it>P</it>(<it>A</it>|<it>D</it>, <it>&#952;</it>)</p>
                        </c>
                     </r>
                     <r>
                        <c cspan="6">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>0</p>
                        </c>
                        <c ca="left">
                           <p>-466330.6649</p>
                        </c>
                        <c ca="left">
                           <p>-453589.9251</p>
                        </c>
                        <c ca="left">
                           <p>-12740.7398</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-4</p>
                        </c>
                        <c ca="left">
                           <p>0</p>
                        </c>
                        <c ca="left">
                           <p>-465397.0642</p>
                        </c>
                        <c ca="left">
                           <p>-453403.7081</p>
                        </c>
                        <c ca="left">
                           <p>-11993.3561</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-5</p>
                        </c>
                        <c ca="left">
                           <p>0</p>
                        </c>
                        <c ca="left">
                           <p>-465397.0642</p>
                        </c>
                        <c ca="left">
                           <p>-453403.7081</p>
                        </c>
                        <c ca="left">
                           <p>-11993.3561</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>2</p>
                        </c>
                        <c ca="left">
                           <p>-465821.5239</p>
                        </c>
                        <c ca="left">
                           <p>-453476.0389</p>
                        </c>
                        <c ca="left">
                           <p>-12345.4850</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>4</p>
                        </c>
                        <c ca="left">
                           <p>-465565.9224</p>
                        </c>
                        <c ca="left">
                           <p>-453437.5353</p>
                        </c>
                        <c ca="left">
                           <p>-12128.3871</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>6</p>
                        </c>
                        <c ca="left">
                           <p>-465397.0642</p>
                        </c>
                        <c ca="left">
                           <p>-453403.7081</p>
                        </c>
                        <c ca="left">
                           <p>-11993.3561</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>8</p>
                        </c>
                        <c ca="left">
                           <p>-465291.1983</p>
                        </c>
                        <c ca="left">
                           <p>-453356.6841</p>
                        </c>
                        <c ca="left">
                           <p>-11934.5142</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-4</p>
                        </c>
                        <c ca="left">
                           <p>4</p>
                        </c>
                        <c ca="left">
                           <p>-465147.9174</p>
                        </c>
                        <c ca="left">
                           <p>-453318.4543</p>
                        </c>
                        <c ca="left">
                           <p>-11829.4631</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>le-4</p>
                        </c>
                        <c ca="left">
                           <p>10</p>
                        </c>
                        <c ca="left">
                           <p>-465010.8431</p>
                        </c>
                        <c ca="left">
                           <p>-453209.0744</p>
                        </c>
                        <c ca="left">
                           <p>-11801.7687</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>0</p>
                        </c>
                        <c ca="left">
                           <p>-360472.7960</p>
                        </c>
                        <c ca="left">
                           <p>-343832.6014</p>
                        </c>
                        <c ca="left">
                           <p>-16640.1946</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-4</p>
                        </c>
                        <c ca="left">
                           <p>0</p>
                        </c>
                        <c ca="left">
                           <p>-360190.7940</p>
                        </c>
                        <c ca="left">
                           <p>-344117.5123</p>
                        </c>
                        <c ca="left">
                           <p>-16073.2817</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-5</p>
                        </c>
                        <c ca="left">
                           <p>0</p>
                        </c>
                        <c ca="left">
                           <p>-360190.7940</p>
                        </c>
                        <c ca="left">
                           <p>-344117.5123</p>
                        </c>
                        <c ca="left">
                           <p>-16073.2817</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>2</p>
                        </c>
                        <c ca="left">
                           <p>-360148.9090</p>
                        </c>
                        <c ca="left">
                           <p>-343841.2775</p>
                        </c>
                        <c ca="left">
                           <p>-16307.6315</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>4</p>
                        </c>
                        <c ca="left">
                           <p>-360178.4500</p>
                        </c>
                        <c ca="left">
                           <p>-344016.2558</p>
                        </c>
                        <c ca="left">
                           <p>-16162.1942</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>6</p>
                        </c>
                        <c ca="left">
                           <p>-360190.7940</p>
                        </c>
                        <c ca="left">
                           <p>-344117.5123</p>
                        </c>
                        <c ca="left">
                           <p>-16073.2817</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-3</p>
                        </c>
                        <c ca="left">
                           <p>8</p>
                        </c>
                        <c ca="left">
                           <p>-360092.2930</p>
                        </c>
                        <c ca="left">
                           <p>-344078.8868</p>
                        </c>
                        <c ca="left">
                           <p>-16013.4062</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-4</p>
                        </c>
                        <c ca="left">
                           <p>4</p>
                        </c>
                        <c ca="left">
                           <p>-360057.4880</p>
                        </c>
                        <c ca="left">
                           <p>-344116.5923</p>
                        </c>
                        <c ca="left">
                           <p>-15940.8957</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>le-4</p>
                        </c>
                        <c ca="left">
                           <p>10</p>
                        </c>
                        <c ca="left">
                           <p>-360108.0100</p>
                        </c>
                        <c ca="left">
                           <p>-344166.2108</p>
                        </c>
                        <c ca="left">
                           <p>-15941.7992</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <tbl id="T6">
                  <title>
                     <p>Table 6</p>
                  </title>
                  <caption>
                     <p>Log-likelihoods of alignments, and log-posteriors of alignment annotations, for training and testing datasets using the original Pfold program. Comparison with Table 5 shows that EM training increases all probabilities, as desired.</p>
                  </caption>
                  <tblbdy cols="4">
                     <r>
                        <c ca="left">
                           <p>Dataset</p>
                        </c>
                        <c ca="left">
                           <p>log<sub>2 </sub><it>P</it>(<it>D</it>, <it>A</it>|<it>&#952;</it>)</p>
                        </c>
                        <c ca="left">
                           <p>log<sub>2 </sub><it>P</it>(<it>D</it>|<it>&#952;</it>)</p>
                        </c>
                        <c ca="left">
                           <p>log<sub>2 </sub><it>P</it>(<it>A</it>|<it>D</it>, <it>&#952;</it>)</p>
                        </c>
                     </r>
                     <r>
                        <c cspan="4">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Training set</p>
                        </c>
                        <c ca="left">
                           <p>-487422.5964</p>
                        </c>
                        <c ca="left">
                           <p>-464828.9148</p>
                        </c>
                        <c ca="left">
                           <p>-22593.6816</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Test set</p>
                        </c>
                        <c ca="left">
                           <p>-370490.5284</p>
                        </c>
                        <c ca="left">
                           <p>-348550.7516</p>
                        </c>
                        <c ca="left">
                           <p>-21939.7768</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <p>An important point to check is whether the EM algorithm actually performs as designed. We expect to see certain phenomena if the algorithm is indeed working as expected:</p>
               <p>&#8226; The algorithm, over the course of its iterations, should refine the parameter set (denoted at the <it>n</it>'th iteration by <it>&#952;</it><sup>(<it>n</it>)</sup>) to maximize the likelihood of the alignment data <it>D </it>and (if supplied) the annotation <it>A</it>. Therefore, the log-likelihood log <it>P</it>(<it>D</it>|<it>&#952;</it><sup>(<it>n</it>)</sup>) should increase with <it>n </it>towards an asymptotic maximum value. This is indeed observed to be the case for this example (see figure <figr fid="F17">17</figr>).</p>
               <p>&#8226; In practice, the EM algorithm is not run for an infinite number of iterations; rather, the algorithm stops when some "convergence criteria" are met (relating to the fractional increase of the log-likelihood) and the parameters at this point are considered to be the "convergent parameters". We denote this convergent parameter set by <it>&#952;</it>*.</p>
               <fig id="F17">
                  <title>
                     <p>Figure 17</p>
                  </title>
                  <text>
                     <p>Log-likelihoods (log2 P(alignment, annotation|parameters), red line) increase as the EM algorithm optimizes the model parameters on the training set. The accuracy results for this parameterization are reported in Table 4. The blue line represents the asymptotic best log-likelihood, reached at iteration 27.</p>
                  </text>
                  <graphic file="1471-2105-7-428-17"/>
               </fig>
               <p>&#8226; If the EM algorithm is performing effectively (i.e. finding a parameterization whose likelihood is close to the global maximum), we would also expect <it>P</it>(<it>D</it>|<it>&#952;</it>*) to be greater than <it>P</it>(<it>D</it>|<it>&#952;</it>') for some arbitrarily chosen parameterization <it>&#952;</it>' (for example, the Knudsen-Hein parameters, which were optimized for a dataset other than <it>D</it>). A comparison of Tables <tblr tid="T5">5</tblr> and <tblr tid="T6">6</tblr> confirms this to be the case.</p>
               <p>&#8226; As the convergence criteria become more strict, log <it>P</it>(<it>D</it>|<it>&#952;</it>*) should increase. The results in Table <tblr tid="T5">5</tblr> confirm this to be the case.</p>
               <p>&#8226; If the training set is representative of the test set, then the above statements should also hold true when <it>D </it>is taken to mean the test set. Again, Tables <tblr tid="T5">5</tblr> and <tblr tid="T6">6</tblr> confirms this.</p>
               <p>We note that Tables <tblr tid="T5">5</tblr> and <tblr tid="T6">6</tblr> shows that the posterior probability of the true annotation, <it>P</it>(<it>A</it>|<it>D</it>, <it>&#952;</it>) = <it>P</it>(<it>A</it>, <it>D</it>|<it>&#952;</it>)/<it>P</it>(<it>A</it>|<it>&#952;</it>), is also increased after phylo-EM training. As mentioned above, this is not a provably guaranteed result of the EM algorithm, which is designed to maximize only <it>P</it>(<it>A</it>, <it>D</it>|<it>&#952;</it>).</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have developed a tool, xrate, that combines the power of stochastic grammars, phylogenetic models, and fast automated parameter estimation from training data. The tool combines a novel EM algorithm for estimating rate parameters of the general irreversible substitution model (extending our earlier results for reversible models <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>) with the Forward-Backward and Inside-Outside algorithms familiar from the stochastic grammar literature <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Novel grammars can be designed by the user, trained automatically, and evaluated without the need for writing or compiling any code. Example grammars that we have used with xrate so far include the phylo-HMMs used by Thorne, Goldman and Jones to predict protein secondary structure <abbrgrp><abbr bid="B83">83</abbr></abbrgrp>, the phylo-SCFGs used by Knudsen and Hein to predict ncRNA structure <abbrgrp><abbr bid="B46">46</abbr></abbrgrp> and the DNA phylo-HMMs used by Siepel and Haussler to predict protein-coding genes and find highly-conserved elements <abbrgrp><abbr bid="B81">81</abbr><abbr bid="B80">80</abbr><abbr bid="B39">39</abbr><abbr bid="B79">79</abbr></abbrgrp>.</p>
         <p>There are many useful applications of stochastic grammars in bioinformatics. Past triumphs of HMMs include protein homology detection <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>; prediction of protein-coding genes <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>; transmembrane and signal peptide annotation <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>; and profiles of fragment libraries for <it>de novo </it>protein structure prediction <abbrgrp><abbr bid="B76">76</abbr></abbrgrp>. Applications of "higher-power" stochastic grammars (i.e. grammars that are situated further up the Chomsky hierarchy, such as Tree-Adjoining Grammars <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>) include beta-sheet prediction <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>; RNA genefinding <abbrgrp><abbr bid="B74">74</abbr></abbrgrp>, homology detection <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and structure prediction <abbrgrp><abbr bid="B73">73</abbr></abbrgrp>; and operon prediction <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
         <p>There are also many useful applications of phylogenetic models. These include reconstruction of phylogenetic trees <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, measurement of <it>K</it><sub><it>a</it></sub>/<it>K</it><sub><it>s </it></sub>ratios <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, modeling residue usage <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B31">31</abbr></abbrgrp>, modeling covariation <abbrgrp><abbr bid="B71">71</abbr></abbrgrp>, detecting of conserved residues <abbrgrp><abbr bid="B90">90</abbr></abbrgrp> and sequence alignment <abbrgrp><abbr bid="B84">84</abbr><abbr bid="B33">33</abbr><abbr bid="B37">37</abbr></abbrgrp>. Furthermore, there are many applications of probabilistic modeling in sequence analysis, e.g. "evolutionary trace" <abbrgrp><abbr bid="B52">52</abbr></abbrgrp> or prediction of deleterious SNPs <abbrgrp><abbr bid="B65">65</abbr></abbrgrp>, that are either directly related to the above kinds of models or might productively be linked.</p>
         <p>xrate and associated tools comprise an up-to-date, friendly implementation of these models for the advanced user. We believe these are powerful tools with broad utility. Our results show that the performance of xrate is comparable to previously described phylo-HMM and phylo-SCFG implementations customized to specific tasks, and furthermore that the rate estimates produced by xrate can be interpreted in a biologically meaningful way. In releasing this general implementation, our hope is that we and others will use these computational tools to further the application of molecular evolution in biomedical research.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p><b>Project name </b>: xrate</p>
         <p><b>Project home page </b>: <url>http://biowiki.org/dart</url></p>
         <p><b>Operating system(s) </b>: Platform independent</p>
         <p><b>Programming language </b>: C++</p>
         <p><b>Other requirements </b>: gcc version 3.3 or higher; GNU build tools (make, ar)</p>
         <p><b>License </b>: GNU GPL</p>
         <p><b>Restrictions to use </b>: None</p>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p><b>CYK </b>: Cocke-Younger-Kasami</p>
         <p><b>DP </b>: Dynamic Programming</p>
         <p><b>EM </b>: Expectation Maximization</p>
         <p><b>HMM </b>: Hidden Markov Model</p>
         <p><b>ML </b>: Maximum Likelihood</p>
         <p><b>PPV </b>: Positive Predictive Value</p>
         <p><b>SCFG </b>: Stochastic Context-Free Grammar</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>PK implemented the irreversible phylo-EM algorithm and contributed to the supplementary material describing the algorithm. NG and RB developed the bubbleplot code. CK and NG performed the codon benchmark. YB performed the protein secondary structure benchmark. AU performed the RNA secondary structure. RB and SC performed additional benchmarks and testing of xrate. IH developed the remaining code and drafted the manuscript. IH, CK, NG, YB and AU contributed to the final version of the manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Richard Goldstein, Gerton Lunter and Dawn Brooks gave helpful feedback during the development of xrate.</p>
            <p>IH, AU and YB were funded in part by NIH/NHGRI grant 1R01GM076705-01. R.B was supported under a National Science Foundation Graduate Research Fellowship. YB was supported in part by the UC Berkeley Graduate Opportunity Fellowship. CK is a member of Wolfson College, University of Cambridge, and was funded by a Wellcome Trust Prize Studentship and an EMBL predoctoral fellowship. NG was partially supported by a Wellcome Trust fellowship.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Prediction of beta-sheet structures using stochastic tree grammars</p>
            </title>
            <aug>
               <au>
                  <snm>Abe</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Mamitsuka</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Proceedings Genome Informatics Workshop V</source>
            <publisher>Universal Academy Press</publisher>
            <pubdate>1994</pubdate>
            <fpage>19</fpage>
            <lpage>28</lpage>
         </bibl>
         <bibl id="B2">
            <title>
               <p>SLAM cross-species gene finding and alignment with a generalized pair hidden Markov model</p>
            </title>
            <aug>
               <au>
                  <snm>Alexandersson</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Cawley</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Pachter</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <issue>3</issue>
            <fpage>496</fpage>
            <lpage>502</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">430255</pubid>
                  <pubid idtype="pmpid" link="fulltext">12618381</pubid>
                  <pubid idtype="doi">10.1101/gr.424203</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Estimation of reversible substitution matrices from multiple pairs of sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Arvestad</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Bruno</snm>
                  <fnm>WJ</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Evolution</source>
            <pubdate>1997</pubdate>
            <volume>45</volume>
            <issue>6</issue>
            <fpage>696</fpage>
            <lpage>703</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/PL00006274</pubid>
                  <pubid idtype="pmpid" link="fulltext">9419247</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes</p>
            </title>
            <aug>
               <au>
                  <snm>Baum</snm>
                  <fnm>LE</fnm>
               </au>
            </aug>
            <source>Inequalities</source>
            <pubdate>1972</pubdate>
            <volume>3</volume>
            <fpage>1</fpage>
            <lpage>8</lpage>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison</p>
            </title>
            <aug>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology</source>
            <publisher>Menlo Park, CA, AAAI Press</publisher>
            <editor>Gaasterland T, Karp P, Karplus K, Ouzounis C, Sander C, Valencia A</editor>
            <pubdate>1997</pubdate>
            <fpage>56</fpage>
            <lpage>64</lpage>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Predicting bacterial transcription units using sequence and expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Bockhorst</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Qiu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Glasner</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Blattner</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Craven</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology</source>
            <publisher>Menlo Park, CA, AAAI Press</publisher>
            <pubdate>2003</pubdate>
            <fpage>34</fpage>
            <lpage>43</lpage>
         </bibl>
         <bibl id="B7">
            <aug>
               <au>
                  <snm>Branden</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Tooze</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Introduction to Protein Structure</source>
            <publisher>Garland, New York</publisher>
            <pubdate>1991</pubdate>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Using Dirichlet mixture priors to derive hidden Markov models for protein families</p>
            </title>
            <aug>
               <au>
                  <snm>Brown</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hughey</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Krogh</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mian</snm>
                  <fnm>IS</fnm>
               </au>
               <au>
                  <snm>Sj&#246;lander</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of the First International Conference on Intelligent Systems for Molecular Biology</source>
            <publisher>Menlo Park, CA, AAAI Press</publisher>
            <editor>Hunter L, Searls DB, Shavlik J</editor>
            <pubdate>1993</pubdate>
            <fpage>47</fpage>
            <lpage>55</lpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Modelling residue usage in aligned protein sequences via maximum likelihood</p>
            </title>
            <aug>
               <au>
                  <snm>Bruno</snm>
                  <fnm>WJ</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>1996</pubdate>
            <volume>13</volume>
            <issue>10</issue>
            <fpage>1368</fpage>
            <lpage>1374</lpage>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Prediction of complete gene structures in human genomic DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Burge</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1997</pubdate>
            <volume>268</volume>
            <issue>1</issue>
            <fpage>78</fpage>
            <lpage>94</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1997.0951</pubid>
                  <pubid idtype="pmpid" link="fulltext">9149143</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Stochastic models for heterogeneous DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Churchill</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>Bulletin of Mathematical Biology</source>
            <pubdate>1989</pubdate>
            <volume>51</volume>
            <fpage>79</fpage>
            <lpage>94</lpage>
            <xrefbib>
               <pubid idtype="pmpid">2706403</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>A model of evolutionary change in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Dayhoff</snm>
                  <fnm>MO</fnm>
               </au>
               <au>
                  <snm>Eck</snm>
                  <fnm>RV</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>CM</fnm>
               </au>
            </aug>
            <source>Atlas of Protein Sequence and Structure</source>
            <publisher>National Biomedical Research Foundation, Washington, DC</publisher>
            <editor>Dayhoff MO</editor>
            <pubdate>1972</pubdate>
            <volume>5</volume>
            <fpage>89</fpage>
            <lpage>99</lpage>
         </bibl>
         <bibl id="B13">
            <title>
               <p>A model of evolutionary change in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Dayhoff</snm>
                  <fnm>MO</fnm>
               </au>
               <au>
                  <snm>Schwartz</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Orcutt</snm>
                  <fnm>BC</fnm>
               </au>
            </aug>
            <source>Atlas of Protein Sequence and Structure</source>
            <publisher>National Biomedical Research Foundation, Washington, DC</publisher>
            <editor>Dayhoff MO</editor>
            <pubdate>1978</pubdate>
            <volume>5</volume>
            <issue>supplement 3</issue>
            <fpage>345</fpage>
            <lpage>352</lpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Maximum likelihood from incomplete data via the EM algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Dempster</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Laird</snm>
                  <fnm>NM</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>DB</fnm>
               </au>
            </aug>
            <source>Journal of the Royal Statistical Society</source>
            <pubdate>1977</pubdate>
            <volume>B39</volume>
            <fpage>1</fpage>
            <lpage>38</lpage>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Modeling evolution at the protein level using an adjustable amino acid fitness model</p>
            </title>
            <aug>
               <au>
                  <snm>Dimmic</snm>
                  <fnm>MW</fnm>
               </au>
               <au>
                  <snm>Mindell</snm>
                  <fnm>DP</fnm>
               </au>
               <au>
                  <snm>Goldstein</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>Proceedings of the Fifth Pacific Symposium on Biocomputing</source>
            <pubdate>2000</pubdate>
            <fpage>18</fpage>
            <lpage>29</lpage>
         </bibl>
         <bibl id="B16">
            <aug>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Krogh</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mitchison</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids</source>
            <publisher>Cambridge University Press, Cambridge, UK</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B17">
            <title>
               <p>A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure</p>
            </title>
            <aug>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <issue>18</issue>
         </bibl>
         <bibl id="B18">
            <title>
               <p>RNA sequence analysis using covariance models</p>
            </title>
            <aug>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1994</pubdate>
            <volume>22</volume>
            <fpage>2079</fpage>
            <lpage>2088</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308124</pubid>
                  <pubid idtype="pmpid">8029015</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Maximum discrimination hidden Markov models of sequence consensus</p>
            </title>
            <aug>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Mitchison</snm>
                  <fnm>GJ</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Journal of Computational Biology</source>
            <pubdate>1995</pubdate>
            <volume>2</volume>
            <fpage>9</fpage>
            <lpage>23</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7497123</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Protein molecular function prediction by Bayesian phylogenomics</p>
            </title>
            <aug>
               <au>
                  <snm>Engelhardt</snm>
                  <fnm>BE</fnm>
               </au>
               <au>
                  <snm>Jordan</snm>
                  <fnm>MI</fnm>
               </au>
               <au>
                  <snm>Muratore</snm>
                  <fnm>KE</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>PLoS Computational Biology</source>
            <pubdate>2005</pubdate>
            <volume>1</volume>
            <issue>5</issue>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1246806</pubid>
                  <pubid idtype="pmpid" link="fulltext">16217548</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Evolutionary trees from DNA sequences: a maximum likelihood approach</p>
            </title>
            <aug>
               <au>
                  <snm>Felsenstein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Evolution</source>
            <pubdate>1981</pubdate>
            <volume>17</volume>
            <fpage>368</fpage>
            <lpage>376</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF01734359</pubid>
                  <pubid idtype="pmpid">7288891</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <aug>
               <au>
                  <snm>Felsenstein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Inferring Phylogenies</source>
            <publisher>Sinauer Associates, Inc</publisher>
            <pubdate>2003</pubdate>
            <note>ISBN 0878931775.</note>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A hidden Markov model approach to variation among sites in rate of evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Felsenstein</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Churchill</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>1996</pubdate>
            <volume>13</volume>
            <fpage>93</fpage>
            <lpage>104</lpage>
         </bibl>
         <bibl id="B24">
            <title>
               <p>A structural EM algorithm for phylogenetic inference</p>
            </title>
            <aug>
               <au>
                  <snm>Friedman</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Ninio</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Pe'er</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Pupko</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Journal of Computational Biology</source>
            <pubdate>2002</pubdate>
            <volume>9</volume>
            <fpage>331</fpage>
            <lpage>353</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/10665270252935494</pubid>
                  <pubid idtype="pmpid" link="fulltext">12015885</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <aug>
               <au>
                  <snm>Gilks</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Richardson</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Spiegelhalter</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Markov Chain Monte Carlo in Practice</source>
            <publisher>Chapman &amp; Hall, London, UK</publisher>
            <pubdate>1996</pubdate>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses</p>
            </title>
            <aug>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Thorne</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1996</pubdate>
            <volume>263</volume>
            <issue>2</issue>
            <fpage>196</fpage>
            <lpage>208</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1996.0569</pubid>
                  <pubid idtype="pmpid" link="fulltext">8913301</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>A codon-based model of nucleotide substitution for protein-coding DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>1994</pubdate>
            <volume>11</volume>
            <fpage>725</fpage>
            <lpage>735</lpage>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Exhaustive matching of the entire protein sequence database</p>
            </title>
            <aug>
               <au>
                  <snm>Gonnet</snm>
                  <fnm>GH</fnm>
               </au>
               <au>
                  <snm>Cohen</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Benner</snm>
                  <fnm>SA</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1992</pubdate>
            <volume>256</volume>
            <issue>5062</issue>
            <fpage>1443</fpage>
            <lpage>1445</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1604319</pubid>
                  <pubid idtype="pmpid" link="fulltext">1604319</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Profile analysis: detection of distantly related proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Gribskov</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>McLachlan</snm>
                  <fnm>AD</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of the National Academy of Sciences of the USA</source>
            <pubdate>1987</pubdate>
            <volume>84</volume>
            <fpage>4355</fpage>
            <lpage>4358</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">305087</pubid>
                  <pubid idtype="pmpid" link="fulltext">3474607</pubid>
                  <pubid idtype="doi">10.1073/pnas.84.13.4355</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Rfam: an RNA family database</p>
            </title>
            <aug>
               <au>
                  <snm>Griffiths-Jones</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Marshall</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Khanna</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>1</issue>
            <fpage>439</fpage>
            <lpage>441</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165453</pubid>
                  <pubid idtype="pmpid" link="fulltext">12520045</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg006</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies</p>
            </title>
            <aug>
               <au>
                  <snm>Halpern</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Bruno</snm>
                  <fnm>WJ</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>1998</pubdate>
            <volume>15</volume>
            <issue>7</issue>
            <fpage>910</fpage>
            <lpage>917</lpage>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Dating the human-ape splitting by a molecular clock of mitochondrial DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Hasegawa</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kishino</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Yano</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Evolution</source>
            <pubdate>1985</pubdate>
            <volume>22</volume>
            <fpage>160</fpage>
            <lpage>174</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF02101694</pubid>
                  <pubid idtype="pmpid">3934395</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>An algorithm for statistical alignment of sequences related by a binary tree</p>
            </title>
            <aug>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Pacific Symposium on Biocomputing</source>
            <publisher>Singapore, World Scientific</publisher>
            <editor>Altman RB, Dunker AK, Hunter L, Laud-erdale K, Klein TE</editor>
            <pubdate>2001</pubdate>
            <fpage>179</fpage>
            <lpage>190</lpage>
            <xrefbib>
               <pubid idtype="pmpid">11262938</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Statistical alignment: computational properties, homology testing and goodness-of-fit</p>
            </title>
            <aug>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wiuf</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Knudsen</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Moller</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Wibling</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>2000</pubdate>
            <volume>302</volume>
            <fpage>265</fpage>
            <lpage>279</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2000.4061</pubid>
                  <pubid idtype="pmpid" link="fulltext">10964574</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Statistical inference in evolutionary models of DNA sequences via the EM algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Hobolth</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Jensen</snm>
                  <fnm>JL</fnm>
               </au>
            </aug>
            <source>Statistical applications in Genetics and Molecular Biology</source>
            <pubdate>2005</pubdate>
            <volume>4</volume>
            <issue>1</issue>
         </bibl>
         <bibl id="B36">
            <title>
               <p>A probabilistic model for the evolution of RNA structure</p>
            </title>
            <aug>
               <au>
                  <snm>Holmes</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>166</issue>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">534097</pubid>
                  <pubid idtype="pmpid" link="fulltext">15507142</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Evolutionary HMMs: a Bayesian approach to multiple alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Holmes</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Bruno</snm>
                  <fnm>WJ</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>9</issue>
            <fpage>803</fpage>
            <lpage>820</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/17.9.803</pubid>
                  <pubid idtype="pmpid" link="fulltext">11590097</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>An Expectation Maximization algorithm for training hidden substitution models</p>
            </title>
            <aug>
               <au>
                  <snm>Holmes</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>GM</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>2002</pubdate>
            <volume>317</volume>
            <issue>5</issue>
            <fpage>757</fpage>
            <lpage>768</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1006/jmbi.2002.5405</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Efficient approximations for learning phylogenetic HMM models from data</p>
            </title>
            <aug>
               <au>
                  <snm>Jojic</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Jojic</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Meek</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Geiger</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Siepel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Heckerman</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>Supplement 1</issue>
            <fpage>161</fpage>
            <lpage>168</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/bioinformatics/bth917</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Tree-adjoining grammars</p>
            </title>
            <aug>
               <au>
                  <snm>Joshi</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Schabes</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <pubdate>1997</pubdate>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Evolution of protein molecules</p>
            </title>
            <aug>
               <au>
                  <snm>Jukes</snm>
                  <fnm>TH</fnm>
               </au>
               <au>
                  <snm>Cantor</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Mammalian Protein Metabolism</source>
            <publisher>Academic Press, New York</publisher>
            <pubdate>1969</pubdate>
            <fpage>21</fpage>
            <lpage>132</lpage>
         </bibl>
         <bibl id="B42">
            <title>
               <p>A combined transmembrane topology and signal peptide prediction method</p>
            </title>
            <aug>
               <au>
                  <snm>Kall</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Krogh</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>2004</pubdate>
            <volume>338</volume>
            <issue>5</issue>
            <fpage>1027</fpage>
            <lpage>1036</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2004.03.016</pubid>
                  <pubid idtype="pmpid" link="fulltext">15111065</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <aug>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Taylor</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>A First Course in Stochastic Processes</source>
            <publisher>Academic Press, San Diego, CA</publisher>
            <pubdate>1975</pubdate>
         </bibl>
         <bibl id="B44">
            <title>
               <p>A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Kimura</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Evolution</source>
            <pubdate>1980</pubdate>
            <volume>16</volume>
            <fpage>111</fpage>
            <lpage>120</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF01731581</pubid>
                  <pubid idtype="pmpid">7463489</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B45">
            <title>
               <p>SCOR: a structural classification of RNA database</p>
            </title>
            <aug>
               <au>
                  <snm>Klosterman</snm>
                  <fnm>PS</fnm>
               </au>
               <au>
                  <snm>Tamura</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Holbrook</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>392</fpage>
            <lpage>394</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99131</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752346</pubid>
                  <pubid idtype="doi">10.1093/nar/30.1.392</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B46">
            <title>
               <p>RNA secondary structure prediction using stochastic context-free grammars and evolutionary history</p>
            </title>
            <aug>
               <au>
                  <snm>Knudsen</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <issue>6</issue>
            <fpage>446</fpage>
            <lpage>454</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.6.446</pubid>
                  <pubid idtype="pmpid" link="fulltext">10383470</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B47">
            <title>
               <p>Pfold: RNA secondary structure prediction using stochastic context-free grammars</p>
            </title>
            <aug>
               <au>
                  <snm>Knudsen</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research Evaluation Studies</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>13</issue>
            <fpage>3423</fpage>
            <lpage>3428</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/gkg614</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B48">
            <title>
               <p>Context-dependent optimal substitution matrices</p>
            </title>
            <aug>
               <au>
                  <snm>Koshi</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Goldstein</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>Protein Engineering</source>
            <pubdate>1995</pubdate>
            <volume>8</volume>
            <fpage>641</fpage>
            <lpage>645</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8577693</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B49">
            <title>
               <p>Hidden Markov models in computational biology: applications to protein modeling</p>
            </title>
            <aug>
               <au>
                  <snm>Krogh</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mian</snm>
                  <fnm>IS</fnm>
               </au>
               <au>
                  <snm>Sj&#246;lander</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1994</pubdate>
            <volume>235</volume>
            <fpage>1501</fpage>
            <lpage>1531</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1994.1104</pubid>
                  <pubid idtype="pmpid" link="fulltext">8107089</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B50">
            <title>
               <p>Factor graphs and the sum-product algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Kschischang</snm>
                  <fnm>FR</fnm>
               </au>
               <au>
                  <snm>Frey</snm>
                  <fnm>BJ</fnm>
               </au>
               <au>
                  <snm>Loeliger</snm>
                  <fnm>H-A</fnm>
               </au>
            </aug>
            <source>IEEE Transactions on Information Theory</source>
            <pubdate>1998</pubdate>
            <volume>47</volume>
            <issue>2</issue>
            <fpage>498</fpage>
            <lpage>519</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1109/18.910572</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B51">
            <title>
               <p>The estimation of stochastic context-free grammars using the inside-outside algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Lari</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Young</snm>
                  <fnm>SJ</fnm>
               </au>
            </aug>
            <source>Computer Speech and Language</source>
            <pubdate>1990</pubdate>
            <volume>4</volume>
            <fpage>35</fpage>
            <lpage>56</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0885-2308(90)90022-X</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B52">
            <title>
               <p>An evolutionary trace method defines binding surfaces common to protein families</p>
            </title>
            <aug>
               <au>
                  <snm>Lichtarge</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Bourne</snm>
                  <fnm>HR</fnm>
               </au>
               <au>
                  <snm>Cohen</snm>
                  <fnm>FE</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1996</pubdate>
            <volume>257</volume>
            <fpage>342</fpage>
            <lpage>358</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1996.0167</pubid>
                  <pubid idtype="pmpid" link="fulltext">8609628</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B53">
            <title>
               <p>Using protein structural information in evolutionary inference: transmembrane proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Li&#242;</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>1999</pubdate>
            <volume>16</volume>
            <fpage>1696</fpage>
            <lpage>1710</lpage>
         </bibl>
         <bibl id="B54">
            <title>
               <p>Genome-wide identification of human functional DNA using a neutral indel model</p>
            </title>
            <aug>
               <au>
                  <snm>Lunter</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>PLoS Computational Biology</source>
            <pubdate>2006</pubdate>
            <volume>2</volume>
            <issue>1</issue>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1326222</pubid>
                  <pubid idtype="pmpid" link="fulltext">16410828</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B55">
            <title>
               <p>A nucleotide substitution model with nearest-neighbour interactions</p>
            </title>
            <aug>
               <au>
                  <snm>Lunter</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>Suppl 1</issue>
            <fpage>I216</fpage>
            <lpage>I223</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth901</pubid>
                  <pubid idtype="pmpid" link="fulltext">15262802</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B56">
            <title>
               <p>Recursive functions of symbolic expressions and their computation by machine</p>
            </title>
            <aug>
               <au>
                  <snm>McCarthy</snm>
                  <fnm>JL</fnm>
               </au>
            </aug>
            <source>Communications of the ACM</source>
            <pubdate>1960</pubdate>
            <volume>3</volume>
            <issue>4</issue>
            <fpage>184</fpage>
            <lpage>195</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/367177.367199</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B57">
            <aug>
               <au>
                  <snm>McLachlan</snm>
                  <fnm>GJ</fnm>
               </au>
               <au>
                  <snm>Krishnan</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>The EM Algorithm and Extensions</source>
            <publisher>Wiley Interscience</publisher>
            <pubdate>1996</pubdate>
         </bibl>
         <bibl id="B58">
            <title>
               <p>Gene structure conservation aids similarity based gene prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Meyer</snm>
                  <fnm>IM</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>2</issue>
            <fpage>776</fpage>
            <lpage>783</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">373336</pubid>
                  <pubid idtype="pmpid" link="fulltext">14764925</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh211</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B59">
            <title>
               <p>Estimating rate constants in hidden Markov models by the EM algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Michalek</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Timmer</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>IEEE Transactions in Signal Processing</source>
            <pubdate>1999</pubdate>
            <volume>47</volume>
            <fpage>226</fpage>
            <lpage>228</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1109/78.738259</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B60">
            <title>
               <p>A long indel model for evolutionary sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Mikl&#243;s</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Lunter</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Holmes</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>2004</pubdate>
            <volume>21</volume>
            <issue>3</issue>
            <fpage>529</fpage>
            <lpage>540</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/molbev/msh043</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B61">
            <title>
               <p>HOMSTRAD: a database of protein structure alignments for homologous families</p>
            </title>
            <aug>
               <au>
                  <snm>Mizuguchi</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Deane</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Blundell</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Overington</snm>
                  <fnm>JP</fnm>
               </au>
            </aug>
            <source>Protein Science</source>
            <pubdate>1998</pubdate>
            <volume>7</volume>
            <fpage>2469</fpage>
            <lpage>2471</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9828015</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B62">
            <title>
               <p>MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model</p>
            </title>
            <aug>
               <au>
                  <snm>Moses</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Chiang</snm>
                  <fnm>DY</fnm>
               </au>
               <au>
                  <snm>Pollard</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Iyer</snm>
                  <fnm>VN</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
            </aug>
            <source>Genome Biology</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>12</issue>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">545801</pubid>
                  <pubid idtype="pmpid" link="fulltext">15575972</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B63">
            <title>
               <p>Modeling amino acid replacement</p>
            </title>
            <aug>
               <au>
                  <snm>Muller</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Vingron</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Journal of Computational Biology</source>
            <pubdate>2000</pubdate>
            <volume>7</volume>
            <issue>6</issue>
            <fpage>761</fpage>
            <lpage>776</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/10665270050514918</pubid>
                  <pubid idtype="pmpid" link="fulltext">11382360</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B64">
            <title>
               <p>Molecular studies of evolution: a source of novel statistical problems</p>
            </title>
            <aug>
               <au>
                  <snm>Neyman</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Statistical Decision Theory and Related Topics</source>
            <publisher>Academic Press, New York</publisher>
            <editor>Gupta SS, Yackel J</editor>
            <pubdate>1971</pubdate>
         </bibl>
         <bibl id="B65">
            <title>
               <p>SIFT: Predicting amino acid changes that affect protein function</p>
            </title>
            <aug>
               <au>
                  <snm>Ng</snm>
                  <fnm>PC</fnm>
               </au>
               <au>
                  <snm>Henikoff</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>13</issue>
            <fpage>3812</fpage>
            <lpage>3814</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">168916</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824425</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg509</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B66">
            <aug>
               <au>
                  <snm>Pearl</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Probabilistic Reasoning in Intelligent Systems</source>
            <publisher>Morgan Kaufmann Publishers, San Mateo, California</publisher>
            <pubdate>1988</pubdate>
         </bibl>
         <bibl id="B67">
            <title>
               <p>Identification and classification of conserved RNA secondary structures in the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Bejerano</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Siepel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Rosenbloom</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Lindblad-Toh</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
               <au>
                  <snm>Kent</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>PLoS Computational Biology</source>
            <pubdate>2006</pubdate>
            <volume>2</volume>
            <issue>4</issue>
            <fpage>e33</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1440920</pubid>
                  <pubid idtype="pmpid" link="fulltext">16628248</pubid>
                  <pubid idtype="doi">10.1371/journal.pcbi.0020033</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B68">
            <title>
               <p>Gene finding with a hidden Markov model of genome structure and evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>2</issue>
            <fpage>219</fpage>
            <lpage>227</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/19.2.219</pubid>
                  <pubid idtype="pmpid" link="fulltext">12538242</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B69">
            <title>
               <p>A comparative method for finding and folding RNA secondary structures within protein-coding regions</p>
            </title>
            <aug>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>IM</fnm>
               </au>
               <au>
                  <snm>Forsberg</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Simmonds</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>16</issue>
            <fpage>4925</fpage>
            <lpage>4923</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">519121</pubid>
                  <pubid idtype="pmpid" link="fulltext">15448187</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh839</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B70">
            <title>
               <p>An RNA gene expressed during cortical development evolved rapidly in humans</p>
            </title>
            <aug>
               <au>
                  <snm>Pollard</snm>
                  <mi>S</mi>
                  <fnm>Katherine</fnm>
               </au>
               <au>
                  <snm>Salama</snm>
                  <mi>R</mi>
                  <fnm>Sofle</fnm>
               </au>
               <au>
                  <snm>Lambert</snm>
                  <fnm>Nelle</fnm>
               </au>
               <au>
                  <snm>Lambot</snm>
                  <fnm>Marie-Alexandra</fnm>
               </au>
               <au>
                  <snm>Coppens</snm>
                  <fnm>Sandra</fnm>
               </au>
               <au>
                  <snm>Pedersen</snm>
                  <mi>S</mi>
                  <fnm>Jakob</fnm>
               </au>
               <au>
                  <snm>Katzman</snm>
                  <fnm>Sol</fnm>
               </au>
               <au>
                  <snm>King</snm>
                  <fnm>Bryan</fnm>
               </au>
               <au>
                  <snm>Onodera</snm>
                  <fnm>Courtney</fnm>
               </au>
               <au>
                  <snm>Siepel</snm>
                  <fnm>Adam</fnm>
               </au>
               <au>
                  <snm>Kern</snm>
                  <mi>D</mi>
                  <fnm>Andrew</fnm>
               </au>
               <au>
                  <snm>Dehay</snm>
                  <fnm>Colette</fnm>
               </au>
               <au>
                  <snm>Igel</snm>
                  <fnm>Haller</fnm>
               </au>
               <au>
                  <snm>Ares</snm>
                  <fnm>Manuel</fnm>
                  <suf>Jr</suf>
               </au>
               <au>
                  <snm>Vanderhaeghen</snm>
                  <fnm>Pierre</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>David</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2006</pubdate>
            <volume>443</volume>
            <issue>7108</issue>
            <fpage>167</fpage>
            <lpage>172</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature05113</pubid>
                  <pubid idtype="pmpid" link="fulltext">16915236</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B71">
            <title>
               <p>Coevolving protein residues: maximum likelihood identification and relationship to structure</p>
            </title>
            <aug>
               <au>
                  <snm>Pollock</snm>
                  <fnm>DD</fnm>
               </au>
               <au>
                  <snm>Taylor</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1999</pubdate>
            <volume>287</volume>
            <issue>1</issue>
            <fpage>187</fpage>
            <lpage>198</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1998.2601</pubid>
                  <pubid idtype="pmpid" link="fulltext">10074416</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B72">
            <title>
               <p>An introduction to hidden Markov models</p>
            </title>
            <aug>
               <au>
                  <snm>Rabiner</snm>
                  <fnm>LR</fnm>
               </au>
               <au>
                  <snm>Juang</snm>
                  <fnm>BH</fnm>
               </au>
            </aug>
            <source>IEEE ASSP Magazine</source>
            <pubdate>1986</pubdate>
            <volume>3</volume>
            <issue>1</issue>
            <fpage>4</fpage>
            <lpage>16</lpage>
         </bibl>
         <bibl id="B73">
            <title>
               <p>The language of RNA: a formal grammar that includes pseudoknots</p>
            </title>
            <aug>
               <au>
                  <snm>Rivas</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <issue>4</issue>
            <fpage>334</fpage>
            <lpage>340</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/16.4.334</pubid>
                  <pubid idtype="pmpid" link="fulltext">10869031</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B74">
            <title>
               <p>Noncoding RNA gene detection using comparative sequence analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Rivas</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <issue>8</issue>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">64605</pubid>
                  <pubid idtype="pmpid" link="fulltext">11801179</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B75">
            <title>
               <p>S-expressions. Internet Draft</p>
            </title>
            <aug>
               <au>
                  <snm>Rivest</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <pubdate>1997</pubdate>
            <url>http://theory.lcs.mit.edu/~rivest/sexp.txt</url>
         </bibl>
         <bibl id="B76">
            <title>
               <p>Protein structure prediction using Rosetta</p>
            </title>
            <aug>
               <au>
                  <snm>Rohl</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Strauss</snm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Misura</snm>
                  <fnm>KM</fnm>
               </au>
               <au>
                  <snm>Baker</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Methods in Enzymology</source>
            <pubdate>2004</pubdate>
            <volume>383</volume>
            <fpage>66</fpage>
            <lpage>93</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15063647</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B77">
            <title>
               <p>The neighbor-joining method: a new method for reconstructing phylogenetic trees</p>
            </title>
            <aug>
               <au>
                  <snm>Saitou</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Nei</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>1987</pubdate>
            <volume>4</volume>
            <fpage>406</fpage>
            <lpage>425</lpage>
         </bibl>
         <bibl id="B78">
            <title>
               <p>Stochastic context-free grammars for tRNA modeling</p>
            </title>
            <aug>
               <au>
                  <snm>Sakakibara</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hughey</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Saira Mian</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Kimmen</snm>
                  <fnm>Sj&#246;lander</fnm>
               </au>
               <au>
                  <snm>Underwood</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1994</pubdate>
            <volume>22</volume>
            <fpage>5112</fpage>
            <lpage>5120</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">523785</pubid>
                  <pubid idtype="pmpid">7800507</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B79">
            <title>
               <p>Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Siepel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bejerano</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Hinrichs</snm>
                  <fnm>AS</fnm>
               </au>
               <au>
                  <snm>Hou</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rosenbloom</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Clawson</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Spieth</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hillier</snm>
                  <fnm>LW</fnm>
               </au>
               <au>
                  <snm>Richards</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Weinstock</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Wilson</snm>
                  <fnm>RK</fnm>
               </au>
               <au>
                  <snm>Gibbs</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Kent</snm>
                  <fnm>WJ</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2005</pubdate>
            <volume>15</volume>
            <issue>8</issue>
            <fpage>1034</fpage>
            <lpage>1050</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1182216</pubid>
                  <pubid idtype="pmpid" link="fulltext">16024819</pubid>
                  <pubid idtype="doi">10.1101/gr.3715005</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B80">
            <title>
               <p>Combining phylogenetic and hidden Markov models in biosequence analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Siepel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Journal of Computational Biology</source>
            <pubdate>2004</pubdate>
            <volume>11</volume>
            <issue>2&#8211;3</issue>
            <fpage>413</fpage>
            <lpage>428</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/1066527041410472</pubid>
                  <pubid idtype="pmpid" link="fulltext">15285899</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B81">
            <title>
               <p>Phylogenetic estimation of context-dependent substitution rates by maximum likelihood</p>
            </title>
            <aug>
               <au>
                  <snm>Siepel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>2004</pubdate>
            <volume>21</volume>
            <issue>3</issue>
            <fpage>468</fpage>
            <lpage>488</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/molbev/msh039</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B82">
            <title>
               <p>Predicting functional sites in proteins: site-specific evolutionary models and their application to neurotransmitter transporters</p>
            </title>
            <aug>
               <au>
                  <snm>Soyer</snm>
                  <fnm>OS</fnm>
               </au>
               <au>
                  <snm>Goldstein</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>2004</pubdate>
            <volume>339</volume>
            <issue>1</issue>
            <fpage>227</fpage>
            <lpage>242</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2004.03.025</pubid>
                  <pubid idtype="pmpid" link="fulltext">15123434</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B83">
            <title>
               <p>Combining protein evolution and secondary structure</p>
            </title>
            <aug>
               <au>
                  <snm>Thorne</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>1996</pubdate>
            <volume>13</volume>
            <fpage>666</fpage>
            <lpage>673</lpage>
         </bibl>
         <bibl id="B84">
            <title>
               <p>An evolutionary model for maximum likelihood alignment of DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Thorne</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Kishino</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Felsenstein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Evolution</source>
            <pubdate>1991</pubdate>
            <volume>33</volume>
            <fpage>114</fpage>
            <lpage>124</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF02193625</pubid>
                  <pubid idtype="pmpid">1920447</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B85">
            <title>
               <p>Identification of regulatory regions which confer muscle-specific gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Wasserman</snm>
                  <fnm>WW</fnm>
               </au>
               <au>
                  <snm>Fickett</snm>
                  <fnm>JW</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1998</pubdate>
            <volume>278</volume>
            <issue>1</issue>
            <fpage>167</fpage>
            <lpage>181</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1998.1700</pubid>
                  <pubid idtype="pmpid" link="fulltext">9571041</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B86">
            <title>
               <p>Pandit: a database of protein and associated nucleotide domains with inferred trees</p>
            </title>
            <aug>
               <au>
                  <snm>Whelan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>de Bakker</snm>
                  <fnm>PI</fnm>
               </au>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>12</issue>
            <fpage>1556</fpage>
            <lpage>1563</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg188</pubid>
                  <pubid idtype="pmpid" link="fulltext">12912837</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B87">
            <title>
               <p>A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach</p>
            </title>
            <aug>
               <au>
                  <snm>Whelan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>2001</pubdate>
            <volume>18</volume>
            <issue>5</issue>
            <fpage>691</fpage>
            <lpage>699</lpage>
         </bibl>
         <bibl id="B88">
            <title>
               <p>The xgram file format</p>
            </title>
            <url>http://biowiki.org/XgramFormat</url>
         </bibl>
         <bibl id="B89">
            <title>
               <p>Information on xrate, xgram, xprot, xfold and related tools</p>
            </title>
            <url>http://biowiki.org/XgramSoftware</url>
         </bibl>
         <bibl id="B90">
            <title>
               <p>Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Molecular Biology and Evolution</source>
            <pubdate>1993</pubdate>
            <volume>10</volume>
            <fpage>1396</fpage>
            <lpage>1401</lpage>
         </bibl>
         <bibl id="B91">
            <title>
               <p>Estimating the pattern of nucleotide substitution</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Evolution</source>
            <pubdate>1994</pubdate>
            <volume>39</volume>
            <fpage>105</fpage>
            <lpage>111</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8064867</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B92">
            <title>
               <p>Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Evolution</source>
            <pubdate>1994</pubdate>
            <volume>39</volume>
            <fpage>306</fpage>
            <lpage>314</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF00160154</pubid>
                  <pubid idtype="pmpid">7932792</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B93">
            <title>
               <p>Codon-substitution models for heterogeneous selection pressure at amino acid sites</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Nielsen</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>A-M</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2000</pubdate>
            <volume>155</volume>
            <fpage>432</fpage>
            <lpage>449</lpage>
         </bibl>
         <bibl id="B94">
            <aug>
               <au>
                  <snm>Yap</snm>
                  <fnm>VB</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Statistical Methods in Molecular Evolution, chapter Estimating substitution matrices</source>
            <publisher>Springer</publisher>
            <pubdate>2005</pubdate>
         </bibl>
      </refgrp>
   </bm>
</art>
