<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-4-23</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Importing statistical measures into Artemis enhances gene identification in the <it>Leishmania </it>genome project</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Aggarwal</snm>
               <fnm>Gautam</fnm>
               <insr iid="I1"/>
               <email>gaggarwal@sbri.org</email>
            </au>
            <au id="A2">
               <snm>Worthey</snm>
               <fnm>EA</fnm>
               <insr iid="I1"/>
               <email>lworthey@sbri.org</email>
            </au>
            <au id="A3">
               <snm>McDonagh</snm>
               <mi>D</mi>
               <fnm>Paul</fnm>
               <insr iid="I2"/>
               <email>pmcdonagh@rii.com</email>
            </au>
            <au id="A4" ca="yes">
               <snm>Myler</snm>
               <mi>J</mi>
               <fnm>Peter</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>peter.myler@sbri.org</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Seattle Biomedical Research Institute 4 Nickerson Street, Seattle, WA 98109, USA</p>
            </ins>
            <ins id="I2">
               <p>Immunex Corporation, 51 University Street, Seattle, WA 98101, USA</p>
            </ins>
            <ins id="I3">
               <p>Departments of Pathobiology and Medical Education and Biomedical Informatics, University of Washington, Seattle, WA 98195, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2003</pubdate>
         <volume>4</volume>
         <issue>1</issue>
         <fpage>23</fpage>
         <url>http://www.biomedcentral.com/1471-2105/4/23</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">12793912</pubid>
               <pubid idtype="doi">10.1186/1471-2105-4-23</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>19</day>
               <month>2</month>
               <year>2003</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>7</day>
               <month>6</month>
               <year>2003</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>7</day>
               <month>6</month>
               <year>2003</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2003</year>
         <collab>Aggarwal et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Seattle Biomedical Research Institute (SBRI) as part of the <it>Leishmania </it>Genome Network (LGN) is sequencing chromosomes of the trypanosomatid protozoan species <it>Leishmania major</it>. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with A<smcaps>RTEMIS</smcaps>, an annotation platform with rich and user-friendly interfaces.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (G<smcaps>LIMMER</smcaps>, T<smcaps>ESTCODE</smcaps> and G<smcaps>ENESCAN</smcaps>) into the A<smcaps>RTEMIS</smcaps> sequence viewer and annotation tool. Comparison of these methods, along with the C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps> algorithm built into A<smcaps>RTEMIS</smcaps>, shows the importance of combining methods to more accurately annotate the <it>L. major </it>genomic sequence.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the <it>Leishmania </it>genome project where there is large proportion of novel genes requiring manual annotation.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>At Seattle Biomedical Research Institute (SBRI), we are involved, as part of the <it>Leishmania </it>Genome Network (LGN), in the sequencing and annotation of the trypanosomatid protozoan species <it>L. major </it>Friedlin (LmjF). Following DNA sequence determination, putative protein-coding regions within the sequence are predicted and functionally classified. Although trypanosomatids are eukaryotes, their gene structure is more similar to that of prokaryotes; they have essentially no introns and small intergenic regions. Two small LmjF chromosomes (chr1 and chr3) have been completely sequenced and annotated. The 79 protein-coding genes predicted from chr1 are organized in two large divergent polycistronic gene clusters of 29 and 50 genes, on the "bottom" and "top" DNA strains, respectively <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>; while chr3 contains two convergent polycistronic clusters of 65 and 29 genes, with a single divergent gene at one telomere and a single tRNA between the two large clusters <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>.</p>
         <p>Presently, a large number of methods exist for <it>in silico </it>prediction of coding regions <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. These computational methods use a range of underlying statistical properties of the coding regions and can be generally classified as consensus (signal sensors) and non-consensus (content sensors) <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. The non-consensus methods can be further classified as trained, which require unbiased sets of coding regions, and untrained, which use statistical properties to discriminate between coding and non-coding regions. Although non-consensus methods have been very successful in identifying genes in most of the sequencing projects, currently none have 100% specificity and sensitivity. In the absence of such a method, the use of a combination of methods is next best option <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. Since LmjF genes do not contain introns, and the signal sequences for trans-splicing and polyadenylation are poorly defined, consensus methods have little utility for <it>Leishmania </it>gene prediction. In addition, ~70% of the genes have no significant homology to existing genes in sequence databases, so extrinsic content sensing methods are of limited use; leaving only intrinsic content sensing methods for possible use in gene prediction. Given that the number of experimentally confirmed gene prediction in <it>Leishmania </it>is currently small, and many methods use similar statistical approaches <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, the choice of two trained methods (G<smcaps>LIMMER</smcaps><abbrgrp><abbr bid="B14">14</abbr></abbrgrp> and C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps><abbrgrp><abbr bid="B15">15</abbr></abbrgrp>) and two untrained methods (T<smcaps>ESTCODE</smcaps><abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, and G<smcaps>ENESCAN</smcaps><abbrgrp><abbr bid="B17">17</abbr></abbrgrp>) which rely on unrelated statistical measures should provide substantial power for gene prediction in LmjF.</p>
         <p>The freely available JAVA-based software package A<smcaps>RTEMIS</smcaps><abbrgrp><abbr bid="B18">18</abbr></abbrgrp> was designed specifically as an annotation platform and has a user-friendly graphical interface. It simplifies time-consuming processes such as inter-file format conversion, BLAST analysis <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, and provides a convenient environment for viewing the gene structure and organization of large DNA segments. Here we describe a method for importing data from G<smcaps>LIMMER</smcaps>, T<smcaps>ESTCODE</smcaps>, and into G<smcaps>ENESCAN</smcaps> into A<smcaps>RTEMIS</smcaps>, to enhance gene prediction and annotation.</p>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <p>We have developed a partially automated process for prediction and annotation of LmjF protein-coding genes in which the gene predictions from G<smcaps>LIMMER</smcaps> and the statistical outputs from T<smcaps>ESTCODE</smcaps> and G<smcaps>ENESCAN</smcaps> are imported into A<smcaps>RTEMIS</smcaps> (see <supplr sid="S1">additional file 1</supplr>), where they can be viewed graphically alongside the C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps> statistics already built into A<smcaps>RTEMIS</smcaps>. Figure <figr fid="F1">1</figr> shows a panel containing results from each of the four gene-prediction methods for a typical LmjF sequence. The predictions from G<smcaps>LIMMER</smcaps> are imported as CDS features and displayed as colored rectangles in the panel showing ORFs (the vertical bars are the stop codons) in all six reading frames. The window scans from T<smcaps>ESTCODE</smcaps>, G<smcaps>ENESCAN</smcaps> and C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps> are displayed graphically in panels above the G<smcaps>LIMMER</smcaps> predictions. The thresholds used to indicate likely protein-coding ORFs for T<smcaps>ESTCODE</smcaps> and G<smcaps>ENESCAN</smcaps> are 4.0 and 9.7, respectively. This allows visual comparison of the four gene prediction methods and manual alteration of the G<smcaps>LIMMER</smcaps>-predicted CDS features if necessary. The reliance on multiple gene prediction methods increases confidence in the predictions.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>This panel of A<smcaps>RTEMIS</smcaps> shows the comparison of four different methods used at SBRI for sequence annotation: a) C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps> b) G<smcaps>ENESCAN</smcaps> c) T<smcaps>ESTCODE</smcaps> and d) G<smcaps>LIMMER</smcaps></p>
            </caption>
            <text>
               <p>This panel of A<smcaps>RTEMIS</smcaps> shows the comparison of four different methods used at SBRI for sequence annotation: a) C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps> b) G<smcaps>ENESCAN</smcaps> c) T<smcaps>ESTCODE</smcaps> and d) G<smcaps>LIMMER</smcaps>. The C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps> panel shows results for the three reading frames (shown by different colors) of the top strand; those from the bottom strand are not shown. The panel immediately following the T<smcaps>ESTCODE</smcaps> panel displays the position of all stop codons (with vertical lines) in all six reading frames. The vertical scales in the top three panels refer to the value of the statistic calculated by the corresponding algorithm. The predictions of G<smcaps>LIMMER</smcaps> appear as blue boxes in this panel. The horizontal scale in the center of this panel indicates the nucleotide coordinates of the sequence for this and the three upper panels (and is adjustable on the right hand scroll bar). The bottom panel displays the translated amino acids in six different reading frames. The horizontal scale refers to the nucleotide coordinates for the sequence within this panel.</p>
            </text>
            <graphic file="1471-2105-4-23-1"/>
         </fig>
         <suppl id="S1">
            <title>
               <p>Additional File 1</p>
            </title>
            <text>
               <p>This is a zip file that contains one perl script (glimmer_atremis.pl), two (testcode_unix and testcode_win.exe) executable files and a readme.txt file describing the details of usage and other information relevant to the programs.</p>
            </text>
            <file name="1471-2105-4-23-S1.zip">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>In Table <tblr tid="T1">1</tblr>, we show a comparison of the results of automated gene prediction using the four different programs with the manual annotations for three completely sequenced chromosomes (chr1, chr3 and chr4) from LmjF. The False Positive rate for each individual method was quite high, with G<smcaps>LIMMER</smcaps> being significantly worse than the others. Most of the False Positives were due to prediction of genes on the wrong coding strand. All methods, with the exception of T<smcaps>ESTCODE</smcaps>, showed a low number of False Negatives. The poor performance of T<smcaps>ESTCODE</smcaps> was largely due to use of a high cut-off value (9.7) for the average Fickett statistic of the whole ORF, rather than smaller windows. Thus, individually, each of the automated programs had high Error Discovery Rates (fraction of incorrect predictions made for expected predictions, Table <tblr tid="T1">1</tblr>), ranging from 0.77 for G<smcaps>ENESCAN</smcaps> to 1.96 for G<smcaps>LIMMER</smcaps>.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Automated gene prediction<sup>a </sup>in <it>Leishmania major</it></p>
            </caption>
            <tblbdy cols="10">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>Annotated CDS<sup>b</sup></p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>G<smcaps>LIMMER</smcaps></p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>G<smcaps>ENESCAN</smcaps></p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>T<smcaps>ESTCODE</smcaps></p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps></p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>FP<sup>c</sup></p>
                  </c>
                  <c ca="center">
                     <p>FN<sup>d</sup></p>
                  </c>
                  <c ca="center">
                     <p>FP</p>
                  </c>
                  <c ca="center">
                     <p>FN</p>
                  </c>
                  <c ca="center">
                     <p>FP</p>
                  </c>
                  <c ca="center">
                     <p>FN</p>
                  </c>
                  <c ca="center">
                     <p>FP</p>
                  </c>
                  <c ca="center">
                     <p>FN</p>
                  </c>
               </r>
               <r>
                  <c cspan="10">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Chr1</p>
                  </c>
                  <c ca="center">
                     <p>79</p>
                  </c>
                  <c ca="center">
                     <p>131</p>
                  </c>
                  <c ca="center">
                     <p>0</p>
                  </c>
                  <c ca="center">
                     <p>61</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>68</p>
                  </c>
                  <c ca="center">
                     <p>33</p>
                  </c>
                  <c ca="center">
                     <p>75</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Chr3</p>
                  </c>
                  <c ca="center">
                     <p>94(1)</p>
                  </c>
                  <c ca="center">
                     <p>116</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>57</p>
                  </c>
                  <c ca="center">
                     <p>5</p>
                  </c>
                  <c ca="center">
                     <p>119</p>
                  </c>
                  <c ca="center">
                     <p>51</p>
                  </c>
                  <c ca="center">
                     <p>108</p>
                  </c>
                  <c ca="center">
                     <p>8</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Chr4</p>
                  </c>
                  <c ca="center">
                     <p>123</p>
                  </c>
                  <c ca="center">
                     <p>328</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>97</p>
                  </c>
                  <c ca="center">
                     <p>6</p>
                  </c>
                  <c ca="center">
                     <p>130</p>
                  </c>
                  <c ca="center">
                     <p>56</p>
                  </c>
                  <c ca="center">
                     <p>139</p>
                  </c>
                  <c ca="center">
                     <p>9</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>
                        <b>Total</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>295</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>575</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>2</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>215</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>12</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>317</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>180</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>322</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>21</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="2" ca="center">
                     <p>
                        <b>EDR</b>
                        <sup>e</sup>
                     </p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>1.96</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>0.77</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>1.68</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>1.16</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p><sup>a </sup>All possible ORFs (<it>i.e. </it>starting with an ATG and ending with TAA, TAG or TGA) of >300 bp in the three chromosome sequence were scored by each of the programs. G<smcaps>LIMMER</smcaps> predictions (for ORFs > 100 amino acids, with default settings) were taken straight from the trained software. For G<smcaps>ENESCAN</smcaps> and T<smcaps>ESTCODE</smcaps>, ORFs were considered to be positive if the average score for the ORF exceeded a threshold of 4.0 and 9.7, respectively. For overlapping ORFs on the same strand, that with the highest score was chosen. In case of C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps>, ORFs were predicted as coding when the average in-frame score was higher than the two out-of-frame scores. <sup>b </sup>The number of CDS of more than 300 bp in GenBank Accession numbers AE001274 (chr1), AC125735 (chr3), AL389894 and AL139794 (chr4). The number of annotated CDS of &lt;300 bp are shown in parentheses. <sup>c </sup>False positives <sup>d </sup>False negatives <sup>e </sup>Error Discovery Rate (EDR) = (FN+FP)/(CDS)</p>
            </tblfn>
         </tbl>
         <p>Combination of the programs improved the Error Discovery Rate, especially in terms of false positives (Table <tblr tid="T2">2</tblr>). When only ORFs predicted by all four programs are considered, the false positive rate was &lt;1%, but the false negative rate was almost 50%. By including ORFs predicted by only three of the four programs, the false negative rate was dramatically lowered to 10%, but the false positive rate rose to >10%. Further relaxation of stringency (two of four programs) resulted in a substantial increase in false positives (78%), with only modest decrease in false negatives (~5%). Thus, the Error Discovery Rate is least (21%) by considering the consensus prediction of three out of four programs. The use of two trained (G<smcaps>LIMMER</smcaps> and C<smcaps>ODON USAGE</smcaps>), and two non-trained (T<smcaps>ESTCODE</smcaps> and G<smcaps>ENESCAN</smcaps>) algorithms reduced false positives and false negatives.</p>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Automated gene prediction by combination of different methods.</p>
            </caption>
            <tblbdy cols="8">
               <r>
                  <c ca="center">
                     <p>Chr</p>
                  </c>
                  <c ca="center">
                     <p>Annotated CDS</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>4 methods</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>3 methods</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>2 methods</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="6">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>FP</p>
                  </c>
                  <c ca="center">
                     <p>FN</p>
                  </c>
                  <c ca="center">
                     <p>FP</p>
                  </c>
                  <c ca="center">
                     <p>FN</p>
                  </c>
                  <c ca="center">
                     <p>FP</p>
                  </c>
                  <c ca="center">
                     <p>FN</p>
                  </c>
               </r>
               <r>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Chr1</p>
                  </c>
                  <c ca="center">
                     <p>79</p>
                  </c>
                  <c ca="center">
                     <p>0</p>
                  </c>
                  <c ca="center">
                     <p>34</p>
                  </c>
                  <c ca="center">
                     <p>13</p>
                  </c>
                  <c ca="center">
                     <p>5</p>
                  </c>
                  <c ca="center">
                     <p>65</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Chr3</p>
                  </c>
                  <c ca="center">
                     <p>90</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>50</p>
                  </c>
                  <c ca="center">
                     <p>7</p>
                  </c>
                  <c ca="center">
                     <p>10</p>
                  </c>
                  <c ca="center">
                     <p>50</p>
                  </c>
                  <c ca="center">
                     <p>5</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Chr4</p>
                  </c>
                  <c ca="center">
                     <p>123</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>58</p>
                  </c>
                  <c ca="center">
                     <p>14</p>
                  </c>
                  <c ca="center">
                     <p>13</p>
                  </c>
                  <c ca="center">
                     <p>109</p>
                  </c>
                  <c ca="center">
                     <p>6</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>
                        <b>Total</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>295</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>2</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>142</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>34</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>28</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>224</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>12</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="2" ca="center">
                     <p>
                        <b>EDR</b>
                     </p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>0.49</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>0.21</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>0.80</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>The semi-automated comparative analysis clear shows that some degree of manual annotation is still necessary in projects where there is large proportion of novel genes. The manual annotation is time consuming and labor intensive. The A<smcaps>RTEMIS</smcaps> desktop environment, with importation of trained and non-trained non-consensus gene-prediction algorithms, facilitates easy comparison of the results and allows the user to make more-informed decisions for calling protein-coding genes. Thus, this improvised and powerful software, developed using already existing gene identification methods and annotation platform, is extremely helpful for whole genome sequencing projects.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <p>G<smcaps>LIMMER</smcaps> 2.0 <url>http://www.tigr.org/software/glimmer/</url><abbrgrp><abbr bid="B14">14</abbr></abbrgrp> was trained using predicted protein-coding genes from LmjF chr1 <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> (manual annotations based on T<smcaps>ESTCODE</smcaps> and C<smcaps>ODON USAGE</smcaps>) and chr4 (manual annotations using H<smcaps>EXAMER</smcaps> and C<smcaps>ODON USAGE</smcaps>: A. Ivens, personal communication) using the default settings. The trained G<smcaps>LIMMER</smcaps> was run on LmjF sequence using the default setting with a minimum gene length of 75 amino acids and output was parsed into an EMBL-formatted feature table file. This data were imported into A<smcaps>RTEMIS</smcaps> 4.0 (installed on Intel-based Linux or Windows 2000 machines) using the "Read Features Into" option of the "File" menu. This allows the G<smcaps>LIMMER</smcaps>-predicted genes to be displayed as CDS Features. The T<smcaps>ESTCODE</smcaps><abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, G<smcaps>ENESCAN</smcaps><url>http://202.41.10.146/public_htmlnew/gs.htm</url><abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps><abbrgrp><abbr bid="B15">15</abbr></abbrgrp> algorithms were re-coded in C++ and the statistical results collected in text files with single value for each sliding window (100 nt windows, sliding by onent increments). These T<smcaps>ESTCODE</smcaps> and G<smcaps>ENESCAN</smcaps> data were imported into A<smcaps>RTEMIS</smcaps><url>http://www.sanger.ac.uk/Software/Artemis/</url><abbrgrp><abbr bid="B18">18</abbr></abbrgrp> using the "Add User Plot" option of the "Display" menu, and displayed graphically. This procedure can be used to import other sliding window methods. The C<smcaps>ODON USAGE</smcaps> bias statistics, which has been coded as part of A<smcaps>RTEMIS</smcaps>, is calculated for the three reading frames of each DNA strand and displayed in different colors using the "Add Usage Plot" option of the "Display" menu to import <it>Leishmania </it>C<smcaps>ODON USAGE</smcaps> tables. Figure <figr fid="F1">1</figr> shows a panel containing results from each of the four gene-prediction methods for a typical LmjF sequence.</p>
         <p>For automated G<smcaps>ENESCAN</smcaps>, T<smcaps>ESTCODE</smcaps> and C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps> predictions, genes were called only for those ORFs larger than 100 amino acids with mean scores (over the entire ORF) above thresholds of 4.0, 9.7, and 0, respectively. For overlapping ORFs (on the same or opposite strands), the one with the highest signal was used.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>GA re-coded the T<smcaps>ESTCODE</smcaps>, G<smcaps>ENESCAN</smcaps> and <b>C<smcaps>ODON</smcaps>U<smcaps>SAGE</smcaps></b> algorithms in C++ for UNIX environment and performed the automated combined prediction analysis. PDM coded the wrapper for parsing the G<smcaps>LIMMER</smcaps> predictions. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>The authors thank Kim Rutherford (Wellcome Trust Sanger Institute) for the help and useful discussion. This work was supported by NIH grant AI40599.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p><it>Leishmania major </it>Friedlin chromosome 1 has an unusual distribution of protein-coding genes</p>
            </title>
            <aug>
               <au>
                  <snm>Myler</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>Audleman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>deVos</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Hixson</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Kiser</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Lemley</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Magness</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Rickell</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Sisk</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Sunkin</snm>
                  <fnm>S</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>1999</pubdate>
            <volume>96</volume>
            <fpage>2902</fpage>
            <lpage>2906</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">10077609</pubid>
                  <pubid idtype="doi">10.1073/pnas.96.6.2902</pubid>
                  <pubid idtype="pmcid">15867</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p><it>Leishmania major </it>chromosome 3 contains two long "convergent" polycistronic gene clusters separated by a tRNA gene</p>
            </title>
            <aug>
               <au>
                  <snm>Worthey</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Aggarwal</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Cawthra</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Fazelinia</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Fu</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Hassebrock</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hixson</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Ivens</snm>
                  <fnm>AC</fnm>
               </au>
               <au>
                  <snm>Kiser</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Marsolini</snm>
                  <fnm>F</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucl Acids Res</source>
            <inpress/>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Computational methods for the identification of genes in vertebrate genomic sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Claverie</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Hum Mol Genet</source>
            <pubdate>1997</pubdate>
            <volume>6</volume>
            <fpage>1735</fpage>
            <lpage>1744</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/hmg/6.10.1735</pubid>
                  <pubid idtype="pmpid" link="fulltext">9300666</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>The gene identification problem: an overview for developers</p>
            </title>
            <aug>
               <au>
                  <snm>Fickett</snm>
                  <fnm>JW</fnm>
               </au>
            </aug>
            <source>Computers Chem</source>
            <pubdate>1996</pubdate>
            <volume>20</volume>
            <fpage>103</fpage>
            <lpage>118</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0097-8485(96)80012-X</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>DNA composition, codon usage and exon prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Guigo</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>In Genetics Databases</source>
            <publisher>San Diego: Academic Press, Inc</publisher>
            <editor>Bishop M</editor>
            <pubdate>1999</pubdate>
            <fpage>53</fpage>
            <lpage>80</lpage>
         </bibl>
         <bibl id="B6">
            <title>
               <p>A comparative guide to gene prediction tools for the bioinformatics amateur</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Field</snm>
                  <fnm>JK</fnm>
               </au>
               <au>
                  <snm>Risk</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Int J Oncol</source>
            <pubdate>2002</pubdate>
            <volume>20</volume>
            <fpage>697</fpage>
            <lpage>705</lpage>
            <xrefbib>
               <pubid idtype="pmpid">11894112</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Current methods of gene prediction, their strengths and weaknesses</p>
            </title>
            <aug>
               <au>
                  <snm>Mathe</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sagot</snm>
                  <fnm>MF</fnm>
               </au>
               <au>
                  <snm>Schiex</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Rouze</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>4103</fpage>
            <lpage>4117</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/gkf543</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Gene-finding approaches for eukaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Stormo</snm>
                  <fnm>GD</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2000</pubdate>
            <volume>10</volume>
            <fpage>394</fpage>
            <lpage>397</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">10779479</pubid>
                  <pubid idtype="doi">10.1101/gr.10.4.394</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Finding the genes in genomic DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Burge</snm>
                  <fnm>CB</fnm>
               </au>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Curr Opin Struct Biol</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <fpage>346</fpage>
            <lpage>354</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid">9666331</pubid>
                  <pubid idtype="doi">10.1016/S0959-440X(98)80069-9</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p><it>Ab initio </it>gene identification: prokaryote genome annotation with Genescan and Glimmer</p>
            </title>
            <aug>
               <au>
                  <snm>Aggarwal</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Ramaswamy</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J Biosci</source>
            <pubdate>2002</pubdate>
            <volume>27</volume>
            <fpage>7</fpage>
            <lpage>14</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11927773</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>DIGIT: a novel gene finding program by combining gene-finders</p>
            </title>
            <aug>
               <au>
                  <snm>Yada</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Takagi</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Totoki</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Sakaki</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Takaeda</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>2003</pubdate>
            <fpage>375</fpage>
            <lpage>387</lpage>
            <xrefbib>
               <pubid idtype="pmpid">12603043</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>A Bayesian framework for combining gene predictions</p>
            </title>
            <aug>
               <au>
                  <snm>Pavlovi&#231;</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Garg</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kasif</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>19</fpage>
            <lpage>27</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">11836207</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/18.1.19</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>GAZE: a generic framework for the integration of gene-prediction data by dynamic programming</p>
            </title>
            <aug>
               <au>
                  <snm>Howe</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Chothia</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <fpage>1418</fpage>
            <lpage>1427</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">12213779</pubid>
                  <pubid idtype="doi">10.1101/gr.149502</pubid>
                  <pubid idtype="pmcid">186661</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Improved microbial gene identification with GLIMMER</p>
            </title>
            <aug>
               <au>
                  <snm>Delcher</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Harmon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kasif</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>White</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Salzberg</snm>
                  <fnm>SL</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>4636</fpage>
            <lpage>4641</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/27.23.4636</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Codon preference and its use in identifying protein coding regions in long DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Staden</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>McLachlan</snm>
                  <fnm>AD</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>1982</pubdate>
            <volume>10</volume>
            <fpage>141</fpage>
            <lpage>156</lpage>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Recognition of protein coding regions in DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Fickett</snm>
                  <fnm>JW</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>1982</pubdate>
            <volume>10</volume>
            <fpage>5303</fpage>
            <lpage>5318</lpage>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Prediction of probable genes by Fourier analysis of genomic sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Tiwari</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ramachandran</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bhattacharya</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bhattacharya</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ramaswamy</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Comput Appl Biosci</source>
            <pubdate>1997</pubdate>
            <volume>13</volume>
            <fpage>263</fpage>
            <lpage>270</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9183531</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Artemis: sequence visualisation and annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Rutherford</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Parkhill</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Crook</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Horsnell</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Rice</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Rajandream</snm>
                  <fnm>M-A</fnm>
               </au>
               <au>
                  <snm>Barrell</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <fpage>944</fpage>
            <lpage>945</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">11120685</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/16.10.944</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Basic local alignment search tool</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1990</pubdate>
            <volume>215</volume>
            <fpage>403</fpage>
            <lpage>410</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">2231712</pubid>
                  <pubid idtype="doi">10.1006/jmbi.1990.9999</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
