<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-11-54</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Towards realistic benchmarks for multiple alignments of non-coding sequences</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Kim</snm>
               <fnm>Jaebum</fnm>
               <insr iid="I1"/>
               <email>jkim63@illinois.edu</email>
            </au>
            <au ca="yes" id="A2">
               <snm>Sinha</snm>
               <fnm>Saurabh</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>sinhas@illinois.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA</p>
            </ins>
            <ins id="I2">
               <p>Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2010</pubdate>
         <volume>11</volume>
         <issue>1</issue>
         <fpage>54</fpage>
         <url>http://www.biomedcentral.com/1471-2105/11/54</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">20102627</pubid>
               <pubid idtype="doi">10.1186/1471-2105-11-54</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>7</day>
               <month>8</month>
               <year>2009</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>26</day>
               <month>1</month>
               <year>2010</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>26</day>
               <month>1</month>
               <year>2010</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2010</year>
         <collab>Kim and Sinha; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>
                  <b>Abstract</b>
               </p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the <it>Drosophila </it>group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of <it>Drosophila </it>non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in <it>Drosophila </it>non-coding sequences if provided with the true alignments.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We have developed a method to generate benchmarks for multiple alignments of <it>Drosophila </it>non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification id="endnote" subtype="user_supplied_xml" type="bmc"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The availability of genome sequences of closely related species (such as 18 placental mammal species <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> and 12 <it>Drosophila </it>species <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>) has provided opportunities to solve several key biological problems such as the inference of phylogenetic trees, reconstruction of ancestral genomes, estimation of evolutionary rates, identification of conserved and non-conserved regions, and more generally the study of genome structure and evolution. The alignment of multiple sequences, highlighting regions of homology among the sequences and predicting nucleotide level relationships among them, plays a critical role in such analyses. Numerous attempts have been made to develop accurate and efficient methods to solve the multiple sequence alignment problem (reviewed in <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>), offering us much flexibility, as well as difficulty, in choosing the most appropriate tool(s) for the task. Another important task related to multiple alignment is the annotation of insertions and deletions (indels) in the alignment, a task that has received some attention in recent years <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp> in light of the realization that indels may be responsible for genomic variation as much as nucleotide substitutions are <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, and that indels may affect regional mutation rates <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
         <p>Given the availability of multiple tools to perform either of these two tasks, a researcher faces two important questions: "Which of the tools should I use for my task?" and "How accurate will the tool be on my data?" Answers to these come from studies that use data sets ("benchmarks") where the true answers are known, to evaluate and compare different tools. The design of benchmarks therefore directly affects the reliability of bioinformatics analyses that use those tools. The two most widely used benchmarking approaches for alignment tools are (i) to make use of biological sequences and their manually curated alignments from databases such as Homstrad <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, BAliBASE <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, and SABmark <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, or (ii) to simulate the evolution of biological sequences by using specialized tools such as Dawg <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, Rose <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> and INDELible <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. The main advantage of the former approach is the use of real biological sequences and alignments that are produced by using protein structure information. This approach does not apply to non-coding DNA sequences, whose alignments form the basis of regulatory comparative genomics. Therefore, simulation-based benchmarks have been widely adopted in this context <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>. The simulation approach, however, is highly dependent on its parameters that reflect the underlying evolutionary processes and their rates. It is not clear how to choose "correct" settings for these parameters and how to assess if the simulated sequences mimic real data well enough for claims about alignment accuracy, both in relative terms (i.e., comparison of tools) and in the absolute, to generalize from the benchmarks to the real world setting. We address these questions in this work, whose main contributions are the following.</p>
         <p>1) We present a new simulation-based benchmarking method that is based on the entire spectrum of values of its parameters as inferred from real data. This is in contrast to existing approaches that rely on the average observed values of the parameters.</p>
         <p>2) We quantify the difficulty of aligning a data set by leveraging recent developments <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> on estimating alignment accuracy without requiring the "true" alignments. We reason that if the synthetic data sets truly mimic real orthologous sequences, the difficulty of aligning them ought to match that for the real data. This is the key insight used to determine how realistic a particular benchmark (i.e., collection of data sets) is, and we use this idea to show that the novel simulation method produces far more realistic benchmarks than the existing approach.</p>
         <p>3) Using our new benchmarks, we evaluate and compare the accuracy of six multiple alignment tools (ClustalW <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, Dialign-TX <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, Mafft <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, Mavid <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>, Mlagan <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, and Pecan <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>) on <it>Drosophila </it>non-coding sequences. The specific alignment task we consider is that of global alignment of ~1-10 Kbp long sequences, and our conclusions may not apply to the task of local alignment, which was studied in <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. We are able to estimate the accuracy of alignment for specific sets of <it>Drosophila </it>genomes, and find these to be very different from previously reported values. We also evaluate two schemes for annotating insertions and deletions specifically, and find their accuracy to be comparable, and close to optimal.</p>
         <p>4) We find that data sets with an excess of deletions over insertions are more amenable to accurate alignment than those with an excess of insertions, suggesting an implicit bias (in the alignment tools) with respect to their treatment of indels, even though none of the evaluated tools explicitly makes a distinction between insertions and deletions.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Simulation of non-coding sequences by a traditional method</p>
            </st>
            <p>Modeling of DNA sequence evolution has been studied extensively in the past, and state-of-the-art simulation programs <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp> draw on various aspects of such models. Simulation of non-coding sequences <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> incorporates current understanding of the architecture of such sequences in terms of regions of evolutionary constraint, for example by stipulating the presence of short (but variable length) subsequences that evolve at a much slower rate than the rest of the sequence. We refer the reader to <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B21">21</abbr></abbrgrp> for a comprehensive description of these approaches, which form the foundation of our own work reported here. These simulation programs rely crucially on the values of their parameters (e.g., substitution rate or frequency of constrained blocks). The parameters serve to fully specify the stochastic processes from which evolutionary events (e.g., substitutions or indels) will be sampled, and prescribe the <it>expected </it>frequency of those events in the generated data sets. Variation in the frequency of these events, which underlie the difficulty of alignment tasks, results from the inherent randomness of the simulation process, i.e., the differences in random choices made from one "run" of the process to another. It is natural to ask if the resulting variability across data sets in a synthetic benchmark is comparable to the corresponding variability observed in real orthologous sequences. The question is particularly relevant due to the heterogeneity of non-coding sequences with respect to the density of functional elements and also motivated by the known variation in evolutionary rates across loci <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>.</p>
            <p>We began by implementing the above-mentioned simulation paradigm, which we call the "traditional" paradigm, by incorporating the "constraint blocks" idea of Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> into the Dawg simulation program <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Parameters, including phylogeny, branch lengths, indel frequency, and various parameters related to conserved blocks were set based on previously published values from the literature <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B37">37</abbr></abbrgrp> or estimated by us from published multiple alignments of <it>Drosophila </it>non-coding sequences (see Methods). A key difference in our implementation was that branch lengths (i.e., average substitution rates) were estimated from non-coding sequences themselves, instead of synonymous substitution rates from coding sequences, as has been done previously. We elaborate on this important issue later in this section.</p>
            <p>We considered the alignments of real <it>Drosophila </it>sequences from eight species (see Methods), computed the sum of branch lengths of the phylogenetic tree estimated from ~1 Kbp segments of alignment, and found the distribution of this statistic to have a large variance across the genome (black bars in Figure <figr fid="F1">1</figr>). The same distribution, when computed from 100 synthetic data sets generated using the traditional simulator described above, and the same alignment program, shows a very sharp peak around the mean (dark gray bars in Figure <figr fid="F1">1</figr>). We note that the means of the two distributions are similar (1.87 in real data and 1.94 in synthetic data), since the benchmark was parameterized by the average substitution rates observed in real data. This is the first clear evidence that existing simulators fall short of representing the <it>range </it>of conservation levels in real data.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Distributions of sum of branch lengths in a phylogenetic tree estimated from real data and synthetic data respectively</p>
               </caption>
               <text>
                  <p><b>Distributions of sum of branch lengths in a phylogenetic tree estimated from real data and synthetic data respectively</b>. Sequences of eight <it>Drosophila </it>species were collected from real data ("Real"), data produced by a traditional simulator ("Traditional"), and data produced by the new simulator based on parameter sampling ("New"). The traditional simulator used the average substitution rates observed in the real data, while the new simulator used the empirical distribution of substitution rates in real data. The branch lengths were estimated by Paml <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>.</p>
               </text>
               <graphic file="1471-2105-11-54-1" hint_layout="single"/>
            </fig>
            <p>Since substitution rates are generally correlated with indel rates, a large variance in the former implies a corresponding variance in indel frequencies, which of course lie at the root of the alignment problem. This suggests that if we could measure the "difficulty of alignment" in any region of the genome (e.g., by having knowledge of the true alignment, and measuring the accuracy of a powerful alignment program), we ought to see a large variability in this measure across the genome. Moreover, if the observed distribution of the alignment difficulty measure is comparable to that in a benchmark, we would be confident in making claims about performance of alignment tools based on that benchmark. The problem is that measuring alignment difficulty on real data requires knowledge of their true alignment, which is unavailable. Recent work by Landan and Graur <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> showed that a reasonable surrogate for the accuracy of an alignment program on a data set can be computed even without the true alignment. They reasoned that good alignments should be invariant to the <it>orientation </it>of the input sequences, and therefore defined the "Heads or Tails (HoT)" alignment quality score as the agreement between two alignments, one generated from original sequences and the other from their reversed versions. Hall <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> later showed that there is a clear positive correlation between HoT alignment quality scores and the real alignment accuracy measured by comparison with the true alignment. This remarkable finding inspired us to formulate the following strategy for quantifying the spectrum of alignment difficulty in data sets. We computed the HoT alignment quality score on the computed alignment of a data set, and used this score as a surrogate for the alignment difficulty of the data set. (The alignment was computed using a well-established alignment program called Pecan <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, but other choices would not affect our conclusions.) Low values of the alignment quality score indicate that the data set is particularly hard to align, and high values are suggestive of an "easy" data set. As shown in Figure <figr fid="F2">2A</figr>, the distributions of the score were significantly different between synthetic and real data sets. Alignment quality scores for 83% of the synthetic sequences are above 95, whereas close to 50% of real sequences had scores below this range. This strongly suggests that by and large the synthetic sequences simulated by the traditional approach are easier to align than real sequences, even though the former were generated with evolutionary parameters mirroring their real data counterparts. In particular, the variance of alignment quality (and presumably of alignment difficulty) is much smaller in synthetic data sets.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Distributions of alignment quality scores - HoT SPS (A-D) and HoT CS (E-H) - between real and simulated sequences</p>
               </caption>
               <text>
                  <p><b>Distributions of alignment quality scores - HoT SPS (A-D) and HoT CS (E-H) - between real and simulated sequences</b>. Synthetic sequences were simulated by (A, E) a traditional method, (B, F) using a mixture model of evolutionary rates, (C, G) using a mixture model of ratios of substitutions to indels, and (D, H) a novel method that relies on observed genome-wide distributions of its parameters.</p>
               </text>
               <graphic file="1471-2105-11-54-2" hint_layout="double"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Simulation based on a mixture model of parameters</p>
            </st>
            <p>We hypothesized that the above observation about synthetic data sets was due to the use of a single setting of the branch lengths, and the relatively low variability resulting from the randomness of the process itself (Figure <figr fid="F1">1</figr>). If this is true, then one way to alleviate the problem would be to allow for multiple phylogenies for simulation of different data sets, with the variability of branch lengths across phylogenies introducing an additional source of data set variability. We therefore considered a set of <it>K = 10 </it>phylogenies {<it>&#981;</it><sub><it>1</it></sub>, <it>&#981;</it><sub><it>2</it></sub>, ..., <it>&#981;</it><sub><it>K</it></sub>} that are scaled versions of the original phylogeny <it>&#981;</it><sub><it>0</it></sub>, i.e., every branch length in phylogeny <it>&#981;</it><sub><it>i </it></sub>is a constant factor <it>&#964;</it><sub><it>i </it></sub>times the corresponding branch length in <it>&#981;</it><sub><it>0</it></sub>. (We used {<it>&#964;</it><sub><it>i</it></sub>} = {1,2, ..., 10}.) We modified the simulator to first sample at random one of the <it>K </it>phylogenies, and simulate according to this setting of branch lengths, with all other parameters being fixed as before. In other words, the distribution of alignment quality scores from the new simulation process is a mixture distribution, with components parameterized by different phylogenies and the probability of sampling any particular phylogeny being the mixture weight. We estimated an upper bound on the agreement between this mixture distribution and the observed distribution of alignment quality scores, by maximum likelihood training of mixture weights, through expectation-maximization algorithm <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. This "best fit" mixture distribution is shown in Figure <figr fid="F2">2B</figr>, along with the real data distribution, and reveals a much stronger agreement between the two distributions, as compared to Figure <figr fid="F2">2A</figr>. The same trend was seen when allowing for a set of values of the "substitution to indel ratio" parameter (with values 10:1,10:2, ..., 10:5), keeping all other parameters, including the phylogeny, fixed (Figure <figr fid="F2">2C</figr>). These results strongly suggested that the use of a range of parameter values instead of a single value has great impact on the variability of alignment difficulty in synthetic data sets, and has the potential to lead to the generation of realistic sequences.</p>
         </sec>
         <sec>
            <st>
               <p>Simulation based on parameter sampling</p>
            </st>
            <p>The above results, while encouraging in terms of better reproducing the genomic variability of alignment difficulty, were obtained by fitting parameters of the simulation process so as to best match real data. We next asked if we could achieve the same or better agreement between the synthetic and real data distributions without having seen the real distribution of alignment quality scores. This would then allow us to use the observed agreement as a relatively unbiased assessment of how realistic the benchmark is. Developing the mixture model idea from the previous section, we now computed for each parameter the entire distribution of values observed in real data alignments, just as the traditional approach estimates the average of these values. The simulation process was now made to sample each parameter independently from its empirical distribution, and then generate a data set based on the sampled parameter values. The benchmark thus constructed (comprising 10000 different data sets) was examined for its distribution of alignment quality scores, and as seen in Figure <figr fid="F2">2D</figr>, this distribution was remarkably close to that observed in real sequences. In other words, the newly constructed benchmark meets our pre-specified criterion for a "realistic" benchmark. (It also shows strong agreement, as expected, with real data in terms of estimated branch lengths; Figure <figr fid="F1">1</figr>.)</p>
            <p>The above analysis was performed using the sum-of-pairs score (SPS), which is the simplest of the scores defined in the HoT approach <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. We repeated all analyses with another score, called the HoT column score (CS), and observed the same trends (Figure <figr fid="F2">2E-H</figr>), although the agreement between synthetic and real data distributions was not as strong now as with the SPS (Figure <figr fid="F2">2D</figr>) (also see Discussion).</p>
         </sec>
         <sec>
            <st>
               <p>Assessment of multiple alignment tools</p>
            </st>
            <sec>
               <st>
                  <p>Accuracy of multiple alignments</p>
               </st>
               <p>We used our new benchmark to evaluate and compare six leading multiple alignment tools that are publicly available and can align DNA sequences. These are ClustalW 2.0.5 <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, Dialign-TX 1.0.0 <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, Mafft 6.240 <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, Mavid 2.0 build 4 <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>, Mlagan 2.0 <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, and Pecan 0.7 <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. We performed the assessment with varying numbers of species, <it>K </it>= 3, ..., 8. For each choice of <it>K</it>, 10000 sets of sequences corresponding to <it>K </it>different <it>Drosophila </it>species were simulated and the above alignment tools were run with default parameters or with the best setting recommended by their authors. We then compared the resulting alignments to the "true" alignments reported by the simulation program, using the following three commonly used evaluation measures <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp>: (i) <it>alignment agreement</it>, which is the fraction of aligned base pairs (or bases aligned to gaps) in the predicted alignment that agree with the true alignment, (ii) <it>alignment sensitivity</it>, which is the fraction of aligned base pairs of the true alignment that agree with the predicted alignment, and (iii) <it>alignment specificity</it>, which is the fraction of aligned base pairs of the predicted alignment that agree with the true alignment. Whereas the alignment agreement score considers aligned base pairs as well as bases aligned to gaps, the sensitivity and specificity scores are calculated <it>only </it>from aligned base pairs. The results of our evaluations are shown in Figure <figr fid="F3">3</figr> and Additional files <supplr sid="S1">1</supplr> and <supplr sid="S2">2</supplr> (left panels) (see Additional file <supplr sid="S3">3</supplr> for an example of true and computed alignments by the six alignment programs). The Pecan alignment program was found to be superior by all three measures, across all values of <it>K</it>. Its performance degrades more slowly (with increasing <it>K</it>) than the other tools, as a result of which the gap between Pecan and the other tools became larger more species were included in the tests. The average alignment agreement in five species alignments produced by Pecan (the species most divergent from <it>D. melanogaster </it>being <it>D. pseudoobscura</it>) was close to 80%, but degraded to ~67% when aligning all eight species.</p>
               <suppl id="S1">
                  <title>
                     <p>Additional file 1</p>
                  </title>
                  <text>
                     <p><b>Performance of multiple alignment tools compared by alignment sensitivity</b>. The scores were calculated by using all synthetic data sets (left panel), and by using only data sets where the expected number of insertions is two times more than the number of deletions or vice versa (middle and right panels respectively).</p>
                  </text>
                  <file name="1471-2105-11-54-S1.DOC">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <suppl id="S2">
                  <title>
                     <p>Additional file 2</p>
                  </title>
                  <text>
                     <p><b>Performance of multiple alignment tools compared by alignment specificity</b>. The scores were calculated by using all synthetic data sets (left panel), and by using only data sets where the expected number of insertions is two times more than the number of deletions or vice versa (middle and right panels respectively).</p>
                  </text>
                  <file name="1471-2105-11-54-S2.DOC">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <suppl id="S3">
                  <title>
                     <p>Additional file 3</p>
                  </title>
                  <text>
                     <p>An example data set from the benchmark shown (in part) with true alignment (top panel) and alignments computed by each different programs.</p>
                  </text>
                  <file name="1471-2105-11-54-S3.DOC">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <fig id="F3">
                  <title>
                     <p>Figure 3</p>
                  </title>
                  <caption>
                     <p>Performance of multiple alignment tools compared by alignment agreement</p>
                  </caption>
                  <text>
                     <p><b>Performance of multiple alignment tools compared by alignment agreement</b>. The scores were calculated by using all synthetic data sets (left panel), and by using only data sets where the expected number of insertions is two times more than the number of deletions or vice versa (middle and right panels respectively).</p>
                  </text>
                  <graphic file="1471-2105-11-54-3" hint_layout="double"/>
               </fig>
               <p>We performed the same evaluations by limiting ourselves to those data sets (in the benchmark) that had an excess of insertions over deletions, and separately to those data sets with an excess of deletions (Figure <figr fid="F3">3</figr>, and Additional files <supplr sid="S1">1</supplr> and <supplr sid="S2">2</supplr>; middle and right panels). Surprisingly, we saw a clear difference between these two classes of data sets, with most tools performing significantly worse when there was an excess of insertions in the data set. For example, on data sets with <it>K = 8</it>, ClustalW showed an alignment agreement of 36% or 46% depending on whether there was an excess of insertions or deletions (respectively). The same trend was seen in terms of the alignment sensitivity and specificity measures. Noticably, Pecan was largely unaffected by this dichotomy of data sets. (For additional insights on how alignment accuracy depends on various other descriptive statistics of a data set, e.g., total divergence, indel count, or total indel length, see Additional file <supplr sid="S4">4</supplr>.)</p>
               <suppl id="S4">
                  <title>
                     <p>Additional file 4</p>
                  </title>
                  <text>
                     <p>Dependence of performance (sensitivity (left) and specificity (right)) of each alignment program on various descriptive statistics of the data sets.</p>
                  </text>
                  <file name="1471-2105-11-54-S4.DOC">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <p>The evaluation measures used above consider all pairs of species in the <it>K</it>-species alignment and sum the accuracy values obtained from all pairs, without regard to the varying divergences of different pairs. In an attempt to address this issue, we separately measured the alignment accuracy of different pairs of species (e.g., <it>D. melanogaster </it>- <it>D. simulans</it>, <it>D. melanogaster - D. yakuba</it>, etc.), limiting ourselves to the eight-species data sets. All trends reported above were also seen in this alternative view of the results (Figure <figr fid="F4">4</figr>, and Additional files <supplr sid="S5">5</supplr> and <supplr sid="S6">6</supplr>). The alignment agreement, using Pecan, for <it>D. melanogaster </it>with <it>D. yakuba</it>, <it>D. anannassae</it>, <it>D. pseudoobscura </it>and <it>D. willistoni </it>was found to be 96%, 77%, 71% and 60% respectively.</p>
               <suppl id="S5">
                  <title>
                     <p>Additional file 5</p>
                  </title>
                  <text>
                     <p><b>Performance of multiple alignment tools compared by alignment sensitivity of pairs of species</b>. The scores were calculated by using all synthetic data sets (left panel), and by using only data sets where the expected number of insertions is two times more than the number of deletions or vice versa (middle and right panels respectively).</p>
                  </text>
                  <file name="1471-2105-11-54-S5.DOC">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <suppl id="S6">
                  <title>
                     <p>Additional file 6</p>
                  </title>
                  <text>
                     <p><b>Performance of multiple alignment tools compared by alignment specificity of pairs of species</b>. The scores were calculated by using all synthetic data sets (left panel), and by using only data sets where the expected number of insertions is two times more than the number of deletions or vice versa (middle and right panels respectively).</p>
                  </text>
                  <file name="1471-2105-11-54-S6.DOC">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <fig id="F4">
                  <title>
                     <p>Figure 4</p>
                  </title>
                  <caption>
                     <p>Performance of multiple alignment tools compared by alignment agreement of pairs of species</p>
                  </caption>
                  <text>
                     <p><b>Performance of multiple alignment tools compared by alignment agreement of pairs of species</b>. The scores were calculated by using all synthetic data sets (left panel), and by using only data sets where the expected number of insertions is two times more than the number of deletions or vice versa (middle and right panels respectively).</p>
                  </text>
                  <graphic file="1471-2105-11-54-4" hint_layout="double"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Disagreement with estimates based on existing benchmark</p>
               </st>
               <p>We found a substantial disagreement between our performance estimates and those previously reported by Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> using their own benchmark. For instance, the alignment sensitivity for the <it>D. melanogaster </it>- <it>D. pseudoobscura </it>pair comes out to be ~70% in our assessment and ~40% by their estimates, using the Mlagan alignment tool. We observe such gaps (with higher numbers in our benchmark) also for alignment specificity, and for other species pairs and alignment programs as well (Additional files <supplr sid="S7">7</supplr> and <supplr sid="S8">8</supplr>). (We confirmed this by evaluating the alignment programs ourselves on the Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> benchmark, see Methods.) While this discordance could be in part due to the fact that our benchmark employs a spectrum of parameter values to achieve greater and more realistic variability, we believe the major difference here is that even the average substitution rate, a key parameter in both simulation programs, is widely different between their study and ours. The estimate used by Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> (~2.4 substitutions per site) is based on silent positions in codons, while our estimate (~0.38 substitutions per site) reflects the average subsitution frequency (between these species) seen in non-coding sequences. In light of the results of Figure <figr fid="F2">2D</figr>, where we show that our benchmark accurately mirrors the range of alignment difficulty in real data, the use of non-coding sequences in estimating this key parameter seems better justified. We investigated this issue with additional tests. We collected data sets representing the <it>D. melanogaster - D. pseudoobscura </it>pair from Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, as well as from our benchmark and the real genomes. The alignment quality score (HoT SPS) distributions were computed for each type of benchmark, and are shown in Figure <figr fid="F5">5</figr>. We observed a close agreement between our data sets and the real orthologous sequences, while the Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> data sets were harder to align on average, consistent with the greater substitution rate used there. As noted in Methods, the overall substitution frequency observed in non-coding sequences may be viewed as an average of the corresponding frequency in conserved blocks and the much higher frequency outside conserved blocks. This average is determined by two key parameters <it>&#945;</it>, the fraction of sequence length that falls into conserved blocks, and <it>&#946;</it>, the ratio of the evolutionary rate of conserved blocks to that outside blocks. Given that the divergence estimate used by Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> for these two species is ~2.24 (median) substitutions per site, if we are to treat this value as the neutral rate (i.e., rate outside conserved blocks) in non-coding sequences, what values of <it>&#945; </it>and <it>&#946; </it>would lead to the observed overall substitution frequency of 0.38? We determined that if <it>&#946; </it>= 0.1, as was used by Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> (and also by us), <it>&#945; </it>has to be ~0.92, i.e., about 92% of non-coding sequences have to be conserved blocks, which is far higher than most current estimates of this parameter <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B42">42</abbr></abbrgrp>. Similarly, if we are to trust the values of <it>&#945; </it>= 0.2 and <it>&#946; </it>= 0.1, as was used by Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> (and also by us, based on estimates from real data), then the overall divergence, after averaging between conserved blocks and non-blocks, would be ~1.84 substitutions per site, far greater than what is observed (0.38). We therefore concluded that the use of synonymous substitution rates as the neutral rate for non-coding sequence is likely to lead to benchmarks with overly "diverged" sequences that are more difficult to align than real sequences from those species.</p>
               <suppl id="S7">
                  <title>
                     <p>Additional file 7</p>
                  </title>
                  <text>
                     <p>Comparison of estimated alignment sensitivity and specificity, using Mlagan or Pecan, as obtained from the Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> benchmark and from our benchmark.</p>
                  </text>
                  <file name="1471-2105-11-54-S7.DOC">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <suppl id="S8">
                  <title>
                     <p>Additional file 8</p>
                  </title>
                  <text>
                     <p>Comparison of estimated alignment sensitivity and specificity as obtained from the Pollard et al. benchmark.</p>
                  </text>
                  <file name="1471-2105-11-54-S8.DOC">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <fig id="F5">
                  <title>
                     <p>Figure 5</p>
                  </title>
                  <caption>
                     <p>Distributions of alignment quality scores of data sets representing <it>D. melanogaster - D. pseudoobscura </it>pair from real genomes, Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, and our benchmark</p>
                  </caption>
                  <text>
                     <p><b>Distributions of alignment quality scores of data sets representing <it>D. melanogaster - D. pseudoobscura </it>pair from real genomes, Pollard et al. </b><abbrgrp><abbr bid="B21">21</abbr></abbrgrp><b>, and our benchmark</b>. The collected data sets from each of the three sources were aligned by Pecan <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> and then their alignment quality scores were calculated by HoT SPS <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> method.</p>
                  </text>
                  <graphic file="1471-2105-11-54-5" hint_layout="single"/>
               </fig>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Assessment of indel annotation schemes</p>
            </st>
            <p>Traditional alignment programs mark the predicted locations of insertions and deletions as "gaps", and do not proceed to annotate these gaps as being insertions or deletions. This latter task has received some attention recently with at least two "indel annotation schemes" being published, based on maximum-parsimony ("sbInfer" <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>) and probabilistic-models ("Indelign" <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>) respectively. We examined the accuracy of these two alignment-related tools on our new benchmark. (Indelign was modified for additional efficiency, see Methods.) We noted that the best alignment agreement score (among all methods, as shown in Figure <figr fid="F4">4</figr>) is ~70% for <it>D. melanogaster - D. pseudoobscura</it>, and decreases to ~60% when a more diverged species (<it>D. willistoni</it>) is added. Reasoning that phylogenies for which computed alignments are largely inaccurate would not be suitable for insertion/deletion annotation in any case, we chose to limit our assessment to the following five <it>Drosophila </it>species: <it>D. melanogaster</it>, <it>D. simulans</it>, <it>D. yakuba</it>, <it>D. ananassae</it>, and <it>D. pseudoobscura </it>(see Additional file <supplr sid="S9">9</supplr> for phylogeny). The "true" alignment (as indicated by the simulation program) was provided to the two indel annotation tools and the insertion/deletion annotations on each of the five terminal branches (leading to the extant species) of the phylogeny were compared to the "true" annotations. The following three measures were used for assessment, borrowed from <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>: (i) <it>Indel Count Agreement</it>, which is the agreement of indel counts between true and predicted annotations, (ii) <it>Indel Ratio Agreement</it>, which is the agreement of the ratio of the number of insertions to the total number of indels between the two annotations, and (iii) <it>Indel Annotation Coverage</it>, which is the fraction of indel positions on which the two annotations agree (see Methods). (Both sensitivity and specificity scores were calculated for the Indel Annotation Coverage.)</p>
            <suppl id="S9">
               <title>
                  <p>Additional file 9</p>
               </title>
               <text>
                  <p>Phylogenetic trees and branch lengths in Newick format.</p>
               </text>
               <file name="1471-2105-11-54-S9.TXT">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>As summarized in Table <tblr tid="T1">1</tblr>, Indel Count Agreement scores of the two tools were very similar to each other and close to optimal (0) for most species except <it>D. pseudoobscura</it>, the species with the longest terminal branch in the phylogeny. Indel Ratio Agreement scores of both tools were close to optimal (1) in all five species. While the sensitivity scores of Indel Annotation Coverage of the two tools were above 90% across all five species, the specificity scores were above 90% only for the four species except <it>D. pseudoobscura</it>. The loss of accuracy on the <it>D. pseudoobscura </it>branch is presumably due to the fact that there is no "outgroup" species to aid disambiguation of insertions and deletions on this branch. We further discuss the implications of these observations in the next section. We also repeated our assessment for sequences with an excess of insertions or of deletions, as above, but no significant differences was observed between these two categories (data not shown).</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Performance of indel annotation tools compared by different measures (ICA, IRA, IAC) on five-species alignments.</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center" cspan="2">
                        <p>
                           <b>ICA<sup>a</sup></b>
                        </p>
                     </c>
                     <c ca="center" cspan="2">
                        <p>
                           <b>IRA<sup>b</sup></b>
                        </p>
                     </c>
                     <c ca="center" cspan="2">
                        <p>
                           <b>IAC<sup>c </sup>(sensitivity)</b>
                        </p>
                     </c>
                     <c ca="center" cspan="2">
                        <p>
                           <b>IAC<sup>c </sup>(specificity)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Species</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Indelign</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>sbInfer</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Indelign</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>sbInfer</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Indelign</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>sbInfer</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Indelign</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>sbInfer</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D. sim</p>
                     </c>
                     <c ca="center">
                        <p>0.06</p>
                     </c>
                     <c ca="center">
                        <p>0.06</p>
                     </c>
                     <c ca="center">
                        <p>1.00</p>
                     </c>
                     <c ca="center">
                        <p>1.01</p>
                     </c>
                     <c ca="center">
                        <p>0.97</p>
                     </c>
                     <c ca="center">
                        <p>0.96</p>
                     </c>
                     <c ca="center">
                        <p>0.99</p>
                     </c>
                     <c ca="center">
                        <p>0.99</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D. mel</p>
                     </c>
                     <c ca="center">
                        <p>0.04</p>
                     </c>
                     <c ca="center">
                        <p>0.04</p>
                     </c>
                     <c ca="center">
                        <p>1.00</p>
                     </c>
                     <c ca="center">
                        <p>1.01</p>
                     </c>
                     <c ca="center">
                        <p>0.99</p>
                     </c>
                     <c ca="center">
                        <p>0.99</p>
                     </c>
                     <c ca="center">
                        <p>0.99</p>
                     </c>
                     <c ca="center">
                        <p>0.98</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D. yak</p>
                     </c>
                     <c ca="center">
                        <p>0.06</p>
                     </c>
                     <c ca="center">
                        <p>0.05</p>
                     </c>
                     <c ca="center">
                        <p>1.00</p>
                     </c>
                     <c ca="center">
                        <p>1.01</p>
                     </c>
                     <c ca="center">
                        <p>0.98</p>
                     </c>
                     <c ca="center">
                        <p>0.97</p>
                     </c>
                     <c ca="center">
                        <p>0.97</p>
                     </c>
                     <c ca="center">
                        <p>0.98</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D. ana</p>
                     </c>
                     <c ca="center">
                        <p>0.08</p>
                     </c>
                     <c ca="center">
                        <p>0.07</p>
                     </c>
                     <c ca="center">
                        <p>1.00</p>
                     </c>
                     <c ca="center">
                        <p>1.00</p>
                     </c>
                     <c ca="center">
                        <p>0.93</p>
                     </c>
                     <c ca="center">
                        <p>0.91</p>
                     </c>
                     <c ca="center">
                        <p>0.93</p>
                     </c>
                     <c ca="center">
                        <p>0.96</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D. pse</p>
                     </c>
                     <c ca="center">
                        <p>0.24</p>
                     </c>
                     <c ca="center">
                        <p>0.27</p>
                     </c>
                     <c ca="center">
                        <p>1.02</p>
                     </c>
                     <c ca="center">
                        <p>1.03</p>
                     </c>
                     <c ca="center">
                        <p>0.94</p>
                     </c>
                     <c ca="center">
                        <p>0.96</p>
                     </c>
                     <c ca="center">
                        <p>0.79</p>
                     </c>
                     <c ca="center">
                        <p>0.79</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>a</sup>Indel Count Agreement (optimal value = 0)</p>
                  <p><sup>b</sup>Indel Ratio Agreement (optimal value = 1)</p>
                  <p><sup>c</sup>Indel Annotation Coverage (optimal value = 1)</p>
               </tblfn>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Choosing the most suitable tool for aligning orthologous sequences is essential to studies in comparative genomics and in molecular evolution, making it critical to develop accurate benchmarking methodology. In this study, we propose a novel simulation-based approach to generate realistic data sets mimicking orthologous non-coding sequences from multiple <it>Drosophila </it>species. This new simulation method exploits the spectrum of values of evolutionary statistics (e.g., substitution rate, indel frequency) seen across a genome. We take advantage of an objective "alignment quality" measure to show that the synthetic sequences produced agree with real sequences not only in terms of evolutionary statistics, but are also as easy or hard to align as real data sets. In this sense, our evaluation results are more likely to reflect the actual accuracy values of alignment-related tools on data from <it>Drosophila </it>species. We note that our strategy of sampling parameters (used in evolutionary simulations) from their empirical distribution has parallels with traditional Bayesian inference where one integrates over (i.e., samples from) a prior distribution on parameters, rather than using a single point estimate.</p>
         <p>A key step in our benchmark construction was the ability to assess the quality of an alignment without access to the corresponding true alignment. This ability has been the result of several recent publications by other authors. Prakash and Tompa <abbrgrp><abbr bid="B43">43</abbr><abbr bid="B44">44</abbr></abbrgrp> developed statistical methods to assess if a multiple sequence alignment appears contaminated with one or more unrelated sequences, based on which they identified regions of whole genome alignments as being suspect. The development of the "HoT" method by Landan and Graur <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> then came as a breakthrough to assess the reliability of multiple sequence alignments. Later on, Landan and Graur <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> extended the HoT method to take advantage of co-optimal alternative alignments generated by progressive alignment tools. However, the implementation of this method is too dependent on the specific procedures of a progressive alignment method, making the original HoT score <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> a natural choice for our purpose.</p>
         <p>While our benchmark is shown to be very close to real sequences in terms of the distribution of HoT SPS, we are cautioned by the discrepancy observed between simulated and real sequences in terms of the HoT CS, an alternative alignment quality score from the same authors (Figure <figr fid="F2">2E-H</figr>). This is likely the product of properties of non-coding sequences that are not adequately represented in our simulation process. For example, modeling the functional constraints embedded in non-coding sequences through short conserved blocks (with scaled down phylogenies) is surely an oversimplification of the complexity of genomic architecture. Important progress has been made on this front, in the form of specialized evolutionary simulators that model transcription factor binding site evolution in realistic ways <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B46">46</abbr><abbr bid="B47">47</abbr></abbrgrp>. Each of these simulators makes specific assumptions about <it>cis</it>-regulatory architecture, vis-a-vis the density and evolution of binding sites. However, it is not yet clear which, if any, of these different assumed models of regulatory sequence evolution is most suited to represent the variability in constraint patterns across different regions of the genome. Our simplistic "conserved block" model (borrowed from <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>) seems to be a good approximation that captures the most prominent patterns in orthologous non-coding sequences, in terms of alignment difficulty. We expect that future research on more realistic models of <it>cis</it>-regulatory architecture will lead us to replace the alternating arrangement of conserved blocks and faster evolving segments with a pattern more in line with reality. Future work may also include careful modeling of genomic repeats and repeat generating evolutionary events, since repeat-rich genomes may present additional challenges for the alignment task. Our proposed framework of sampling evolutionary parameters before running the simulation process will remain equally important in future benchmarks that implement such sophisticated models.</p>
         <p>Some clarification is in order with respect to our manner of choosing substitution rates for the simulation process, since it marks a significant departure from traditional thinking. The latter, as embodied in the work of Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, prescribes that the "unconstrained" parts of the sequence evolve with nucleotide substitution rate equal to that infered from synonymous mutations in the nearby gene (or average over all genes). This rate (~2.4 substitutions/site for <it>D. melanogaster </it>- <it>D. pseudoobscura</it>) is widely different from the value observed in real non-coding sequence alignments (~0.4 substitutions/site). One could argue that this gap may be offset if we set an appropriate frequency of conserved positions (with very low rates), resulting in an average substitution rate that is close to the empirically observed value. However, this turned out not be the case for any realistic setting of the frequency of conserved positions (data not shown). We therefore chose to be guided by existing estimates of the frequency and length distribution of conserved blocks, with substitution rates that are some constant <it>&#946; </it>(see Methods) times the "neutral" rate outside of the blocks, and set this neutral rate so that the average rate for the entire sequence matches observed values. Our choice reflects the philosophy that simulated data sets ought to match real data in terms of various evolutionary statistics and net alignment difficulty, and the discordance of the used neutral rate from synonymous substitution rates is ignored for the sake of practicality.</p>
         <p>To our knowledge, no previous benchmarking study has evaluated the effect of insertions and deletions on the performance of alignment tools. Some studies <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp> have used equal frequencies for insertions and deletions and focused on the collective effects of indels. Here, we attempted to elucidate the differing effects of insertions and deletions by separately summarizing results for the two extreme cases where the number of insertions is at least two times the frequency of deletions and vice versa. The results were surprising, and indicated that most multiple alignment tools find it harder to accurately align data sets with an excess of insertions than those with more deletions (Figures <figr fid="F3">3</figr> and <figr fid="F4">4</figr>). L&#246;ytynoja and Goldman <abbrgrp><abbr bid="B48">48</abbr></abbrgrp> offered valuable insight into a possible source of this asymmetry, pointing out that progressive alignment methods (a category to which all the methods tested here belong) "end up penalizing single insertion events multiple times". We speculate therefore, as they did, that claims about insertion/deletion frequencies along the genome should be preceded by an examination of the alignment method's accuracy in regimes of high insertion frequency.</p>
         <p>Finally, a note about our findings on insertion/deletion annotation. Indelign <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> is a probabilistic tool that annotates insertions and deletions by maximum likelihood training of an evolutionary model. sbInfer <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> is a greedy algorithm that reconstructs ancestral sequences based on the maximum parsimony principle, and therefore allows us to infer insertion/deletion annotations. To assess these two tools without being confounded by errors of an alignment program, we examined their performance on the true alignments. We found the two programs to have comparable accuracy on our benchmark for the five <it>Drosophila </it>species. While the accuracy was close to optimal on four of the five terminal branches, we observed that both tools over-estimate insertions as well as deletions on the longest branch (leading <it>to </it><it>D. pseudoobscura</it>), while accurately predicting the ratio of insertions to deletions. We note that the <it>D. pseudoobscura </it>branch in the phylogenetic tree originates from the root of the tree, and we would expect to have better annotation results for this branch if an appropriate outgroup species was used. For studies that intend to use insertion to deletion ratio profiling to identify loci with unusual evolutionary patterns (e.g., <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>) it may be safe to examine all five terminal branches of this tree; however, for the more common requirement of accurately annotating insertion and deletion events, e.g., to study gain and loss patterns of specific classes of transcription factor binding sites <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>, we do not recommend using events on the <it>D. pseudoobscura </it>branch.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>We have presented a novel method for generating benchmarks of non-coding sequence alignments, that relies on a spectrum of parameter values reflecting the genome-wide variation of those parameters. We have shown our benchmarks to accurately match the difficulty of aligning real data, by taking advantage of recent developments in measurement of alignment quality. Benchmark evaluations on <it>Drosophila </it>non-coding sequences suggest a greater accuracy of multiple alignment tools (in this domain) than previously reported, and points to a clear asymmetry in the handling of insertions versus deletions by most alignment tools.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p><it>Drosophila </it>non-coding sequences and alignments</p>
            </st>
            <p>Whole-genome multiple alignments of <it>Drosophila </it>genome sequences (release 5) with 14 insects were downloaded from UCSC Genome Browser Database <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> and all exon positions were masked with symbol "N". An initial phylogeny was obtained from the AAA Drosophila website <abbrgrp><abbr bid="B40">50</abbr></abbrgrp>. In cases where two sibling species are very close to each other, we chose one of them to include in this analysis leading to the following set of eight species: <it>D. melanogaster</it>, <it>D. simulans</it>, <it>D. yakuba</it>, <it>D. ananassae</it>, <it>D. pseudoobscura</it>, <it>D. willistoni</it>, <it>D. mojavensis</it>, and <it>D. grimshawi</it>. We extracted fragments of the genome-wide multiple alignments that have sequences for all eight species, whose minimum length is 1 Kbp, and which have less than 50% of their length masked (a total of 11867 alignment fragments with a total length of ~17 Mbp <it>D. melanogaster </it>sequences). The extracted alignments were used to estimate simulation parameter values, as described below. The distribution of HoT alignment quality scores <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> was computed from the sequences in these alignments by realigning them using Pecan <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Non-coding sequence simulation by traditional method</p>
            </st>
            <p>Median branch lengths of a phylogenetic tree for eight <it>Drosophila </it>species were estimated from the multiple alignments described above, using Paml <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>. This phylogenetic tree is shown in Additional file <supplr sid="S9">9</supplr>. This tree was provided as input to the Dawg simulation program <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, with the evolutionary model being F81 <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>, substitution to indel ratio set to 10:1 <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> and insertion to deletion ratio set to 1:1. We modified the Dawg program to model indel lengths as following a mixture of two geometric distributions, following <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>, with parameters trained from the above multiple alignments and Indelign-based annotation of insertions and deletions. We also modified Dawg to allow it to simulate a sequence that includes so-called "conserved blocks", which are contiguous short segments of varying length, where the evolutionary rate is different from the rest of the sequence. Such conserved blocks were made to cover 20% of the sequence length on average, and their evolutionary rate was 10% of that outside the blocks <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. The length distribution of the conserved blocks was obtained from Bergman and Kreitman <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. The length of root sequences in the simulation was 10 Kbp <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> and the root sequence was sampled from a random pool of 10 Kbp non-coding segments of the <it>D. melanogaster </it>genome.</p>
            <p>The estimated median branch lengths mentioned above reflect an average of the rates in conserved and non-conserved regions of real non-coding sequences, whereas the phylogeny input to Dawg by definition represents the substitution rate outside of blocks. Therefore, the branch lengths of the phylogeny were adjusted based on the specified coverage of conserved blocks and their evolutionary rates. Let <it>t</it><sub><it>o </it></sub>be the overall evolutionary rate (the estimated branch length), <it>t</it><sub><it>n </it></sub>be the unconstrained evolutionary rate (values provided to the simulation program), <it>&#945; </it>be the fraction of sequence length that falls into conserved blocks, and <it>&#946; </it>be the ratio of the evolutionary rate of conserved blocks to that outside blocks. Then we have:</p>
            <p>
               <display-formula>
                  <graphic file="1471-2105-11-54-i1.gif"/>
               </display-formula>
            </p>
         </sec>
         <sec>
            <st>
               <p>Distributions of simulation parameter values</p>
            </st>
            <p>The collection of branch lengths estimated from each fragment of multiple alignments described above, using Paml, was used to produce the distribution of branch lengths. As was done in the traditional simulation method, these branch lengths were adjusted by the above formula. The distributions of the ratio of substitutions to indels and the ratio of insertions to deletions were estimated from the above multiple alignments and Indelign-based annotation of insertions and deletions. The length distribution of indels was determined as in the traditional simulation method. To obtain the genome-wide distribution of the fraction of conserved blocks, we collected Phastcons <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> conservation scores from UCSC Genome Browser Database <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, scanned multiple alignments of <it>Drosophila </it>non-coding sequences and marked consecutive columns as a conserved block if the following two conditions hold: (i) they span at least 10 consecutive non-gapped columns and (ii) Phastcons scores of all columns are greater than or equal to 0.9 (see Additional file <supplr sid="S10">10</supplr> for the distribution of the fraction of conserved blocks). The relative evolutionary rate of conserved blocks was set to the fixed value of 0.1, as in the traditional simulation. The length of a root sequence was set to 1 Kbp (average length of non-coding sequences in the extracted fragments of <it>Drosophila </it>alignments) and the root sequence was sampled from the <it>D. melanogaster </it>non-coding genome (see Additional file <supplr sid="S11">11</supplr> for various descriptive statistics of traditional and new benchmarks).</p>
            <suppl id="S10">
               <title>
                  <p>Additional file 10</p>
               </title>
               <text>
                  <p>Genome-wide distribution of the fraction of conserved blocks estimated by using Phastcons conservation scores and multiple alignments of <it>Drosophila </it>non-coding sequences obtained from UCSC Genome Browser Database.</p>
               </text>
               <file name="1471-2105-11-54-S10.DOC">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S11">
               <title>
                  <p>Additional file 11</p>
               </title>
               <text>
                  <p>Descriptive statistics of traditional and new benchmarks.</p>
               </text>
               <file name="1471-2105-11-54-S11.DOC">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Evaluation of alignment programs on Pollard et al. benchmark</p>
            </st>
            <p>The benchmark generated by Pollard et al. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> parameterizes each data set by a single value (substitutions per site) for the parameter, divergence distance. They provided estimate of this parameter value for the <it>D. melanogaster </it>and <it>D. pseudoobscura </it>pair (mean 2.4 and median 2.24) to link their simulations to the pair of species. They later updated this value in a new phylogeny <url>http://www.danielpollard.com/trees.html</url>. We used their divergence estimates from the latter phylogeny and the benchmark they prescribed for this level of divergence, and evaluated the alignment programs ourselves on this benchmark.</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation measures for indel annotation schemes</p>
            </st>
            <p>Indel Count Agreement is defined by the following formula, where <it>N</it><sub><it>It </it></sub>and <it>N</it><sub><it>Dt </it></sub>are true numbers of insertions and deletions, and <it>N</it><sub><it>Ie </it></sub>and <it>N</it><sub><it>De </it></sub>are predicted numbers of insertions and deletions.</p>
            <p>
               <display-formula>
                  <graphic file="1471-2105-11-54-i2.gif"/>
               </display-formula>
            </p>
            <p>Indel Ratio Agreement is defined by the following formula, with notation as above:</p>
            <p>
               <display-formula>
                  <graphic file="1471-2105-11-54-i3.gif"/>
               </display-formula>
            </p>
            <p>Indel Annotation Coverage is the fraction of indel positions on which the two annotations agree.</p>
         </sec>
         <sec>
            <st>
               <p>Modification of Indelign</p>
            </st>
            <p>The time complexity of the Indelign program is exponential in the number of "conditionally dependent blocks" and this prohibits fast annotation of certain data sets with relatively large numbers of species <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. To reduce the time complexity, when there are more conditionally dependent blocks than a predefined threshold, the alignment is heuristically partitioned by a block that has the smallest effect on the final indel annotation. This process is repeated until all dependent blocks with size greater than the threshold are resolved.</p>
         </sec>
         <sec>
            <st>
               <p>Supplementary website</p>
            </st>
            <p>Source code for the modified Dawg and Indelign programs, phylogenetic trees, simulated sequences and their alignments, and computed alignments by six alignment tools are available from <url>http://europa.cs.uiuc.edu/RealisticAlignmentBenchmarks/</url>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>Indel: insertion and deletion; SPS: sum-of-pair score; CS: column score</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>JK and SS conceived of the study, participated in its design, performed the analysis, and drafted the manuscript. JK developed the software and performed experiments. Both authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work was supported in part by the NSF (CAREER Grant DBI 0746303 to SS) and the NIH (Grant 1R01GM085233-01 to SS). We are thankful to Mathieu Blanchette for sharing the sbInfer software for indel annotation.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>The UCSC Genome Browser Database: 2008 update</p>
            </title>
            <aug>
               <au>
                  <snm>Karolchik</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kuhn</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Baertsch</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Barber</snm>
                  <fnm>GP</fnm>
               </au>
               <au>
                  <snm>Clawson</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Diekhans</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Giardine</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Harte</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Hinrichs</snm>
                  <fnm>AS</fnm>
               </au>
               <au>
                  <snm>Hsu</snm>
                  <fnm>F</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2008</pubdate>
            <volume>36</volume>
            <fpage>D773</fpage>
            <lpage>779</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/gkm966</pubid>
                  <pubid idtype="pmcid">2238835</pubid>
                  <pubid idtype="pmpid" link="fulltext">18086701</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Evolution of genes and genomes on the Drosophila phylogeny</p>
            </title>
            <aug>
               <au>
                  <snm>Clark</snm>
                  <fnm>AG</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>DR</fnm>
               </au>
               <au>
                  <snm>Bergman</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Oliver</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Markow</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Kaufman</snm>
                  <fnm>TC</fnm>
               </au>
               <au>
                  <snm>Kellis</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gelbart</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Iyer</snm>
                  <fnm>VN</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2007</pubdate>
            <volume>450</volume>
            <fpage>203</fpage>
            <lpage>218</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature06341</pubid>
                  <pubid idtype="pmpid" link="fulltext">17994087</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>An overview of multiple sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Simossis</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Kleinjung</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Curr Protoc Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>Chapter 3</volume>
            <issue>Unit 3</issue>
            <fpage>7</fpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">18428699</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Multiple sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Edgar</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Batzoglou</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Curr Opin Struct Biol</source>
            <pubdate>2006</pubdate>
            <volume>16</volume>
            <fpage>368</fpage>
            <lpage>373</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.sbi.2006.04.004</pubid>
                  <pubid idtype="pmpid" link="fulltext">16679011</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Recent evolutions of multiple sequence alignment algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>Notredame</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>PLoS Comput Biol</source>
            <pubdate>2007</pubdate>
            <volume>3</volume>
            <fpage>e123</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1371/journal.pcbi.0030123</pubid>
                  <pubid idtype="pmcid">1963500</pubid>
                  <pubid idtype="pmpid" link="fulltext">17784778</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Multiple sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Pirovano</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Methods Mol Biol</source>
            <pubdate>2008</pubdate>
            <volume>452</volume>
            <fpage>143</fpage>
            <lpage>161</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">full_text</pubid>
                  <pubid idtype="pmpid" link="fulltext">18566763</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Reconstructing large regions of an ancestral mammalian genome in silico</p>
            </title>
            <aug>
               <au>
                  <snm>Blanchette</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>ED</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>2412</fpage>
            <lpage>2423</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.2800104</pubid>
                  <pubid idtype="pmcid">534665</pubid>
                  <pubid idtype="pmpid" link="fulltext">15574820</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>On the inference of parsimonious indel evolutionary scenarios</p>
            </title>
            <aug>
               <au>
                  <snm>Chindelevitch</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Blais</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Blanchette</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J Bioinform Comput Biol</source>
            <pubdate>2006</pubdate>
            <volume>4</volume>
            <fpage>721</fpage>
            <lpage>744</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1142/S0219720006002168</pubid>
                  <pubid idtype="pmpid" link="fulltext">16960972</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Phylogenetic profiling of insertions and deletions in vertebrate genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Snir</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Pachter</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Research in Computational Molecular Biology, Proceedings</source>
            <pubdate>2006</pubdate>
            <volume>3909</volume>
            <fpage>265</fpage>
            <lpage>280</lpage>
            <xrefbib>
               <pubid idtype="doi">full_text</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Transducers: an emerging probabilistic framework for modeling indels on trees</p>
            </title>
            <aug>
               <au>
                  <snm>Bradley</snm>
                  <fnm>RK</fnm>
               </au>
               <au>
                  <snm>Holmes</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <fpage>3258</fpage>
            <lpage>3262</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btm402</pubid>
                  <pubid idtype="pmpid" link="fulltext">17804440</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Exact and heuristic algorithms for the Indel Maximum Likelihood Problem</p>
            </title>
            <aug>
               <au>
                  <snm>Diallo</snm>
                  <fnm>AB</fnm>
               </au>
               <au>
                  <snm>Makarenkov</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Blanchette</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>2007</pubdate>
            <volume>14</volume>
            <fpage>446</fpage>
            <lpage>461</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/cmb.2007.A006</pubid>
                  <pubid idtype="pmpid" link="fulltext">17572023</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Kim</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sinha</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <fpage>289</fpage>
            <lpage>297</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl578</pubid>
                  <pubid idtype="pmpid" link="fulltext">17110370</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Sequence turnover and tandem repeats in cis-regulatory modules in drosophila</p>
            </title>
            <aug>
               <au>
                  <snm>Sinha</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Siggia</snm>
                  <fnm>ED</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2005</pubdate>
            <volume>22</volume>
            <fpage>874</fpage>
            <lpage>885</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msi090</pubid>
                  <pubid idtype="pmpid" link="fulltext">15659554</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Tian</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Araki</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kreitman</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Nagylaki</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Hudson</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bergelson</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>JQ</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2008</pubdate>
            <volume>455</volume>
            <fpage>105</fpage>
            <lpage>108</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature07175</pubid>
                  <pubid idtype="pmpid" link="fulltext">18641631</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>HOMSTRAD: a database of protein structure alignments for homologous families</p>
            </title>
            <aug>
               <au>
                  <snm>Mizuguchi</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Deane</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Blundell</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Overington</snm>
                  <fnm>JP</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>1998</pubdate>
            <volume>7</volume>
            <fpage>2469</fpage>
            <lpage>2471</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/pro.5560071126</pubid>
                  <pubid idtype="pmcid">2143859</pubid>
                  <pubid idtype="pmpid" link="fulltext">9828015</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark</p>
            </title>
            <aug>
               <au>
                  <snm>Thompson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Koehl</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Ripp</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Poch</snm>
                  <fnm>O</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2005</pubdate>
            <volume>61</volume>
            <fpage>127</fpage>
            <lpage>136</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.20527</pubid>
                  <pubid idtype="pmpid" link="fulltext">16044462</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>SABmark--a benchmark for sequence alignment that covers the entire known fold space</p>
            </title>
            <aug>
               <au>
                  <snm>Van Walle</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Lasters</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Wyns</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>1267</fpage>
            <lpage>1268</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth493</pubid>
                  <pubid idtype="pmpid" link="fulltext">15333456</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>DNA assembly with gaps (Dawg): simulating sequence evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Cartwright</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>Suppl 3</issue>
            <fpage>iii31</fpage>
            <lpage>38</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti1200</pubid>
                  <pubid idtype="pmpid" link="fulltext">16306390</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Rose: generating sequence families</p>
            </title>
            <aug>
               <au>
                  <snm>Stoye</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Evers</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1998</pubdate>
            <volume>14</volume>
            <fpage>157</fpage>
            <lpage>163</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/14.2.157</pubid>
                  <pubid idtype="pmpid" link="fulltext">9545448</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>INDELible: a flexible simulator of biological sequence evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Fletcher</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2009</pubdate>
            <volume>26</volume>
            <fpage>1879</fpage>
            <lpage>1888</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msp098</pubid>
                  <pubid idtype="pmcid">2712615</pubid>
                  <pubid idtype="pmpid" link="fulltext">19423664</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Benchmarking tools for the alignment of functional noncoding DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Pollard</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Bergman</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Stoye</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Celniker</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>6</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2105-5-6</pubid>
                  <pubid idtype="pmcid">344529</pubid>
                  <pubid idtype="pmpid" link="fulltext">14736341</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Multiple sequence alignment accuracy and evolutionary distance estimation</p>
            </title>
            <aug>
               <au>
                  <snm>Rosenberg</snm>
                  <fnm>MS</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>278</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2105-6-278</pubid>
                  <pubid idtype="pmcid">1318491</pubid>
                  <pubid idtype="pmpid" link="fulltext">16305750</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Multiple sequence alignment accuracy and phylogenetic inference</p>
            </title>
            <aug>
               <au>
                  <snm>Ogdenw</snm>
                  <fnm>TH</fnm>
               </au>
               <au>
                  <snm>Rosenberg</snm>
                  <fnm>MS</fnm>
               </au>
            </aug>
            <source>Syst Biol</source>
            <pubdate>2006</pubdate>
            <volume>55</volume>
            <fpage>314</fpage>
            <lpage>328</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1080/10635150500541730</pubid>
                  <pubid idtype="pmpid" link="fulltext">16611602</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Pollard</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Moses</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Iyer</snm>
                  <fnm>VN</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>376</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2105-7-376</pubid>
                  <pubid idtype="pmcid">1613255</pubid>
                  <pubid idtype="pmpid" link="fulltext">16904011</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Uncertainty in homology inferences: assessing and improving genomic sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Lunter</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Rocco</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mimouni</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Heger</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Caldeira</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2008</pubdate>
            <volume>18</volume>
            <fpage>298</fpage>
            <lpage>309</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.6725608</pubid>
                  <pubid idtype="pmcid">2203628</pubid>
                  <pubid idtype="pmpid" link="fulltext">18073381</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Noisy: identification of problematic columns in multiple sequence alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Dress</snm>
                  <fnm>AW</fnm>
               </au>
               <au>
                  <snm>Flamm</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Fritzsch</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Grunewald</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kruspe</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Prohaska</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Stadler</snm>
                  <fnm>PF</fnm>
               </au>
            </aug>
            <source>Algorithms Mol Biol</source>
            <pubdate>2008</pubdate>
            <volume>3</volume>
            <fpage>7</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1748-7188-3-7</pubid>
                  <pubid idtype="pmcid">2464588</pubid>
                  <pubid idtype="pmpid" link="fulltext">18577231</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Heads or tails: a simple reliability check for multiple sequence alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Landan</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Graur</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2007</pubdate>
            <volume>24</volume>
            <fpage>1380</fpage>
            <lpage>1383</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msm060</pubid>
                  <pubid idtype="pmpid" link="fulltext">17387100</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Clustal W and Clustal X version 2.0</p>
            </title>
            <aug>
               <au>
                  <snm>Larkin</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Blackshields</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>NP</fnm>
               </au>
               <au>
                  <snm>Chenna</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>McGettigan</snm>
                  <fnm>PA</fnm>
               </au>
               <au>
                  <snm>McWilliam</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Valentin</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Wallace</snm>
                  <fnm>IM</fnm>
               </au>
               <au>
                  <snm>Wilm</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>R</fnm>
               </au>
               <etal/>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <fpage>2947</fpage>
            <lpage>2948</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btm404</pubid>
                  <pubid idtype="pmpid" link="fulltext">17846036</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Subramanian</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Kaufmann</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Algorithms Mol Biol</source>
            <pubdate>2008</pubdate>
            <volume>3</volume>
            <fpage>6</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1748-7188-3-6</pubid>
                  <pubid idtype="pmcid">2430965</pubid>
                  <pubid idtype="pmpid" link="fulltext">18505568</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Recent developments in the MAFFT multiple sequence alignment program</p>
            </title>
            <aug>
               <au>
                  <snm>Katoh</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Toh</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Brief Bioinform</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <fpage>286</fpage>
            <lpage>298</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bib/bbn013</pubid>
                  <pubid idtype="pmpid" link="fulltext">18372315</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>MAVID: constrained ancestral alignment of multiple sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Bray</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Pachter</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>693</fpage>
            <lpage>699</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.1960404</pubid>
                  <pubid idtype="pmcid">383315</pubid>
                  <pubid idtype="pmpid" link="fulltext">15060012</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Brudno</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Do</snm>
                  <fnm>CB</fnm>
               </au>
               <au>
                  <snm>Cooper</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>MF</fnm>
               </au>
               <au>
                  <snm>Davydov</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>ED</fnm>
               </au>
               <au>
                  <snm>Sidow</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Batzoglou</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>721</fpage>
            <lpage>731</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.926603</pubid>
                  <pubid idtype="pmcid">430158</pubid>
                  <pubid idtype="pmpid" link="fulltext">12654723</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Paten</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Herrero</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Beal</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2009</pubdate>
            <volume>25</volume>
            <fpage>295</fpage>
            <lpage>301</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btn630</pubid>
                  <pubid idtype="pmpid" link="fulltext">19056777</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Ultraconserved elements in the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Bejerano</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Pheasant</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Makunin</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Stephen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kent</snm>
                  <fnm>WJ</fnm>
               </au>
               <au>
                  <snm>Mattick</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2004</pubdate>
            <volume>304</volume>
            <fpage>1321</fpage>
            <lpage>1325</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1098119</pubid>
                  <pubid idtype="pmpid" link="fulltext">15131266</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Siepel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bejerano</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Hinrichs</snm>
                  <fnm>AS</fnm>
               </au>
               <au>
                  <snm>Hou</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rosenbloom</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Clawson</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Spieth</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hillier</snm>
                  <fnm>LW</fnm>
               </au>
               <au>
                  <snm>Richards</snm>
                  <fnm>S</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Res</source>
            <pubdate>2005</pubdate>
            <volume>15</volume>
            <fpage>1034</fpage>
            <lpage>1050</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.3715005</pubid>
                  <pubid idtype="pmcid">1182216</pubid>
                  <pubid idtype="pmpid" link="fulltext">16024819</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Genome-wide identification of human functional DNA using a neutral indel model</p>
            </title>
            <aug>
               <au>
                  <snm>Lunter</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Hein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>PLoS Comput Biol</source>
            <pubdate>2006</pubdate>
            <volume>2</volume>
            <fpage>e5</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1371/journal.pcbi.0020005</pubid>
                  <pubid idtype="pmcid">1326222,1326222</pubid>
                  <pubid idtype="pmpid" link="fulltext">16410828</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Bergman</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Kreitman</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>1335</fpage>
            <lpage>1345</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.178701</pubid>
                  <pubid idtype="pmpid" link="fulltext">11483574</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>How well does the HoT score reflect sequence alignment accuracy?</p>
            </title>
            <aug>
               <au>
                  <snm>Hall</snm>
                  <fnm>BG</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2008</pubdate>
            <volume>25</volume>
            <fpage>1576</fpage>
            <lpage>1580</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msn103</pubid>
                  <pubid idtype="pmpid" link="fulltext">18458029</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Maximum Likelihood from Incomplete Data Via EM Algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Dempster</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Laird</snm>
                  <fnm>NM</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>DB</fnm>
               </au>
            </aug>
            <source>Journal of the Royal Statistical Society Series B (Methodological)</source>
            <pubdate>1977</pubdate>
            <volume>39</volume>
            <fpage>1</fpage>
            <lpage>38</lpage>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Aligning multiple genomic sequences with the threaded blockset aligner</p>
            </title>
            <aug>
               <au>
                  <snm>Blanchette</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kent</snm>
                  <fnm>WJ</fnm>
               </au>
               <au>
                  <snm>Riemer</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Elnitski</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Smit</snm>
                  <fnm>AF</fnm>
               </au>
               <au>
                  <snm>Roskin</snm>
                  <fnm>KM</fnm>
               </au>
               <au>
                  <snm>Baertsch</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Rosenbloom</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Clawson</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>ED</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>708</fpage>
            <lpage>715</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.1933104</pubid>
                  <pubid idtype="pmcid">383317</pubid>
                  <pubid idtype="pmpid" link="fulltext">15060014</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Fast statistical alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Bradley</snm>
                  <fnm>RK</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Smoot</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Juvekar</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Do</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dewey</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Holmes</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Pachter</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>PLoS Comput Biol</source>
            <pubdate>2009</pubdate>
            <volume>5</volume>
            <fpage>e1000392</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1371/journal.pcbi.1000392</pubid>
                  <pubid idtype="pmcid">2684580</pubid>
                  <pubid idtype="pmpid" link="fulltext">19478997</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>DrosOCB a high resolution map of conserved non coding sequences in Drosophila</p>
            </title>
            <url>http://arxiv.org/abs/0710.1570</url>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Statistics of local multiple alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Prakash</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Tompa</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>Suppl 1</issue>
            <fpage>i344</fpage>
            <lpage>350</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti1042</pubid>
                  <pubid idtype="pmpid" link="fulltext">15961477</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B44">
            <title>
               <p>Measuring the accuracy of genome-size multiple alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Prakash</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Tompa</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>R124</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/gb-2007-8-6-r124</pubid>
                  <pubid idtype="pmcid">2394773</pubid>
                  <pubid idtype="pmpid" link="fulltext">17594489</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B45">
            <title>
               <p>Local reliability measures from sets of co-optimal multiple sequence alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Landan</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Graur</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>2008</pubdate>
            <fpage>15</fpage>
            <lpage>24</lpage>
            <xrefbib>
               <pubid idtype="pmpid">18229673</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B46">
            <title>
               <p>Phylogenetic simulation of promoter evolution: estimation and modeling of binding site turnover events and assessment of their impact on alignment tools</p>
            </title>
            <aug>
               <au>
                  <snm>Huang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Nevins</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Ohler</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>R225</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/gb-2007-8-10-r225</pubid>
                  <pubid idtype="pmcid">2246299</pubid>
                  <pubid idtype="pmpid" link="fulltext">17956628</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B47">
            <title>
               <p>Alignment and prediction of cis-regulatory modules based on a probabilistic model of evolution</p>
            </title>
            <aug>
               <au>
                  <snm>He</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Ling</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Sinha</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>PLoS Comput Biol</source>
            <pubdate>2009</pubdate>
            <volume>5</volume>
            <fpage>e1000299</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1371/journal.pcbi.1000299</pubid>
                  <pubid idtype="pmcid">2657044</pubid>
                  <pubid idtype="pmpid" link="fulltext">19293946</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B48">
            <title>
               <p>An algorithm for progressive multiple alignment of sequences with insertions</p>
            </title>
            <aug>
               <au>
                  <snm>Loytynoja</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <fpage>10557</fpage>
            <lpage>10562</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1073/pnas.0409137102</pubid>
                  <pubid idtype="pmcid">1180752</pubid>
                  <pubid idtype="pmpid" link="fulltext">16000407</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B49">
            <title>
               <p>Evolution of regulatory sequences in 12 Drosophila species</p>
            </title>
            <aug>
               <au>
                  <snm>Kim</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Sinha</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>PLoS Genet</source>
            <pubdate>2009</pubdate>
            <volume>5</volume>
            <fpage>e1000330</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1371/journal.pgen.1000330</pubid>
                  <pubid idtype="pmcid">2607023</pubid>
                  <pubid idtype="pmpid" link="fulltext">19132088</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B50">
            <title>
               <p>AAA Drosophila website</p>
            </title>
            <url>http://rana.lbl.gov/drosophila/index.html</url>
         </bibl>
         <bibl id="B51">
            <title>
               <p>PAML 4: phylogenetic analysis by maximum likelihood</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2007</pubdate>
            <volume>24</volume>
            <fpage>1586</fpage>
            <lpage>1591</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msm088</pubid>
                  <pubid idtype="pmpid" link="fulltext">17483113</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B52">
            <title>
               <p>Evolutionary trees from DNA sequences: a maximum likelihood approach</p>
            </title>
            <aug>
               <au>
                  <snm>Felsenstein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>1981</pubdate>
            <volume>17</volume>
            <fpage>368</fpage>
            <lpage>376</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF01734359</pubid>
                  <pubid idtype="pmpid">7288891</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>

