<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-6-1</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>on DNA stability</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Kanhere</snm>
               <fnm>Aditi</fnm>
               <insr iid="I1"/>
               <email>aditi@mbu.iisc.ernet.in</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Bansal</snm>
               <fnm>Manju</fnm>
               <insr iid="I1"/>
               <email>mb@mbu.iisc.ernet.in</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>1</fpage>
         <url>http://www.biomedcentral.com/1471-2105/6/1</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15631638</pubid>
               <pubid idtype="doi">10.1186/1471-2105-6-1</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>02</day>
               <month>9</month>
               <year>2004</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>05</day>
               <month>1</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>05</day>
               <month>1</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Kanhere and Bansal; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>In the post-genomic era, correct gene prediction has become one of the biggest challenges in genome annotation. Improved promoter prediction methods can be one step towards developing more reliable <it>ab initio </it>gene prediction methods. This work presents a novel prokaryotic promoter prediction method based on DNA stability.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The promoter region is less stable and hence more prone to melting as compared to other genomic regions. Our analysis shows that a method of promoter prediction based on the differences in the stability of DNA sequences in the promoter and non-promoter region works much better compared to existing prokaryotic promoter prediction programs, which are based on sequence motif searches. At present the method works optimally for genomes such as that of <it>Escherichia coli</it>, which have near 50 % G+C composition and also performs satisfactorily in case of other prokaryotic promoters.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>Our analysis clearly shows that the change in stability of DNA seems to provide a much better clue than usual sequence motifs, such as Pribnow box and -35 sequence, for differentiating promoter region from non-promoter regions. To a certain extent, it is more general and is likely to be applicable across organisms. Hence incorporation of such features in addition to the signature motifs can greatly improve the presently available promoter prediction programs.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Accumulation of a huge amount of genome sequence data in recent years and the task of extracting useful information from it, has given rise to many new challenges. One of the biggest challenges is the task of gene prediction and to fulfil this need, several gene prediction programs have been developed (For reviews see <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>). Most of these prediction programs require training based on prior knowledge of sequence features such as codon bias, which in turn are organism specific. In such cases, lack of large enough samples of known genes, as typically seen in a newly sequenced genome, can lead to sub optimal predictions. On the other hand, some gene prediction methods are based on the homology between two or more genomes but these methods are not of much help for gene prediction in case of genomes with no homologues. In addition, most of the gene prediction programs concentrate on the protein-coding regions and RNA genes, that can make up to 5 % of total protein coding genes, are neglected. Hence it is important to design <it>ab initio </it>gene prediction programs. One of the important steps towards <it>ab initio </it>gene prediction is to develop better promoter and TSS (transcription start site) prediction methods.</p>
         <p>Although reasonable progress has been achieved in the prediction of coding region, the promoter prediction methods are still far from being accurate <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp> and there are some very obvious reasons for these inaccuracies. One of the major difficulties is that the regulatory sequence elements in promoters are short and not fully conserved in the sequence; hence there is a high probability of finding similar sequence elements elsewhere in genomes, outside the promoter regions. This is the reason why most of the promoter prediction algorithms, which are based on finding these regulatory sequence elements, end up predicting a lot of false positives. Thus it is likely that incorporation of additional characteristics, which are unique to the promoter region, will help in improving the currently available promoter prediction methods.</p>
         <p>In our earlier analysis, we observed that in case of bacteria as well as in eukaryotes, various properties of the region immediately upstream of TSS differ from that of downstream region <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. There are differences in sequence composition as well as in different sequence dependent properties such as stability, bendability and curvature. The upstream region is less stable, more rigid and more curved than downstream region. Some of these observations are supported by other studies carried out independently on genomic sequences <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>. Among all types of promoters, the most prominent feature is the difference in DNA duplex stabilities of the upstream and downstream regions. Here, we propose a prokaryotic promoter prediction method, which is based on the stability differences between promoter and non-promoter regions.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Lower stability of promoter regions in bacterial sequences</p>
            </st>
            <p>It is well known that the stability of a DNA fragment is a sequence dependent property and depends primarily on the sum of the interactions between the constituent dinucleotides. The overall stability for an oligonucleotide can thus be predicted from its sequence, if one knows the relative contribution of each nearest neighbour interaction in the DNA <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. The average stability profiles for three sets of bacterial promoter sequences calculated (using 15 nt moving window) based on this principle is shown in Figure <figr fid="F1">1</figr>. It is interesting that the promoters from diverse bacteria, which have quite different genome composition (A+T composition: <it>E. coli </it>0.49, <it>B. subtilis </it>0.56 and <it>C. glutamicum </it>0.46), show strikingly similar features. Promoters from all the three bacteria show low stability peak around the -10 region. The second prominent feature in the free energy profiles of all the three bacteria is the difference in stabilities of the upstream and downstream regions. In all the three groups of promoter sequences, the average stability of upstream region is lower than the average stability of downstream region. But the three sets of promoter sequences differ in their basal energy level, which seems to be dependent on the nucleotide composition of the bacteria.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Overall free energy profile around bacterial TSS</p>
               </caption>
               <text>
                  <p><b>Overall free energy profile around bacterial TSS </b>The figure shows the average free energy profiles of A) <it>Escherichia coli </it>(227 promoters) and B) <it>Bacillus subtilis </it>(89 promoters) C) <it>Corynebacterium glutamicum </it>promoters (28 promoters). The profiles extend from 500 nt upstream to 500 nt downstream of transcription start site (positioned at 0, shown as dashed line). The nucleotide sequence position is shown on x-axis. More negative values of free energy indicate greater stability.</p>
               </text>
               <graphic file="1471-2105-6-1-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Detailed analysis of <it>E. coli </it>promoter sequences</p>
            </st>
            <p>In order to get a better insight into the stability feature, we carried out a detailed analysis of <it>E. coli </it>promoter sequences. Our statistical analysis using "Wilcoxon signed test for equality of medians" (see METHODS) shows that the free energy distribution corresponding to a fragment extending from position -148 to 51 in the <it>E. coli </it>sequences is appreciably different from the energy distribution calculated in randomly selected windows, at a significance level as high as 0.0001. A comparison of free energy distribution at position -20 (corresponding to the promoter region) with distributions at positions -200 (corresponding to the region upstream of promoter region) and +200 (corresponding to the coding region) is shown in Figure <figr fid="F2">2</figr>. It is clearly seen that the region immediately upstream of TSS is much less stable than the other two regions. The average free energy at -20 position is -17.48 kcal/mol while average free energies at the -200 and +200 positions are -19.42 kcal and -20.19 kcal/mol respectively. The Kolmogorov-Smirnov test also confirms that the free energy distribution at position -20 significantly differs from that at -200 and +200 positions at a very high significance level (alpha = 10<sup>-10</sup>).</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Histogram showing the free energy distribution corresponding to upstream region (-200), promoter region (-20) and coding region (+200) in <it>E. coli </it>sequences</p>
               </caption>
               <text>
                  <p><b>Histogram showing the free energy distribution corresponding to upstream region (-200), promoter region (-20) and coding region (+200) in <it>E. coli </it>sequences </b>The free energy distribution corresponding to position -20 (calculated for a 15 nt window extending from -20 to -6) is shown as brown bars. Free energy distribution corresponding to positions -200 (calculated for a 15 nt window from -200 to -186, shown in green bars) and +200 (calculated for 15 nt window from +200 to +214, shown in blue bars) are also shown for comparison. Each bar corresponds to 1 kcal/mol. The average free energies corresponding to -20, -200 and +200 positions are -17.48 kcal/mol, -19.42 kcal/mol and -20.19 kcal/mol respectively.</p>
               </text>
               <graphic file="1471-2105-6-1-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Details of methodology</p>
            </st>
            <p>This difference in free energy and the stability of promoter regions as compared to that of coding and other non-coding regions can be used to search for the promoters. Based on this consideration, a new scoring function D(n) is defined, which will look for differences in free energy of the neighbouring regions of position n:</p>
            <p>D(n) = E1(n) - E2(n)</p>
            <p>where,</p>
            <p>
               <graphic file="1471-2105-6-1-i1.gif"/>
            </p>
            <p>Thus, E1(n) and E2(n) represent the free energy (see METHODS) average in the 50 nt region starting from nucleotide n and neighbouring 100 nt region starting from nucleotide n+99, respectively. The E1 value represents the basal energy level, which is characteristic of the given bacterial genome (e.g. in this case <it>E. coli</it>) and the D value represents the free energy difference in the two neighbouring regions. A stretch of DNA is assigned as promoter only if the average free energy of that 50 nt region (E1) and difference in free energy as compared to its neighbouring region (D) is greater than the chosen cut-offs. The protocol followed to calculate the true and false positives and hence sensitivity and precision is presented in the form of a flowchart in Figure <figr fid="F3">3</figr>. Identical sensitivity values can be achieved using different combinations of D and E1 cut-off values, which is obvious from the contour plot shown in Figure <figr fid="F4">4A</figr>. Similarly, different combinations of D and E1 cut-offs can lead to similar precisions (Figure <figr fid="F4">4B</figr>). But we observe that the use of different D and E1 cut-offs, corresponding to a given sensitivity level, results in a wide range of precisions (Figure <figr fid="F5">5</figr>). Hence, in order to attain a desired level of sensitivity the D and E1 cut-off values are chosen such that the number of false positives is minimum and the precision is maximum.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>A flowchart summarizing our methodology</p>
               </caption>
               <text>
                  <p><b>A flowchart summarizing our methodology </b>* If there are more than one predictions in the 200 nt region (-150 to 50) then only one prediction which is nearest to the TSS is taken as a true prediction. The remaining predictions are counted as false predictions.</p>
               </text>
               <graphic file="1471-2105-6-1-3"/>
            </fig>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Sensitivity and precision contour plots</p>
               </caption>
               <text>
                  <p><b>Sensitivity and precision contour plots </b>The E1 value cut-offs are plotted on x-axis while D value cut-offs are plotted on y-axis. The different A) sensitivity and B) precision levels are shown by colours ranging from dark blue to brown, where dark blue corresponds to lowest value and brown corresponds to highest value.</p>
               </text>
               <graphic file="1471-2105-6-1-4"/>
            </fig>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>A plot showing range of precision values obtained for a given sensitivity</p>
               </caption>
               <text>
                  <p><b>A plot showing range of precision values obtained for a given sensitivity </b>The sensitivity (x-axis) and precision (y-axis) corresponding to different E1 and D cut-offs has been plotted.</p>
               </text>
               <graphic file="1471-2105-6-1-5"/>
            </fig>
            <p>Initially, we divided the <it>E. coli </it>sequence data into two sets. The E1 and D cut-off values corresponding to different sensitivity levels were obtained for 100 randomly selected sequences (1<sup>st </sup>set). These cut-off values were then applied to a second set consisting of remaining 127 sequences. The sensitivity and precision values calculated for the first and second set match very well. We also found that very similar results can be obtained when we use the whole dataset (Figure <figr fid="F6">6</figr>). Hence, we present the results for the whole dataset rather than separately for two sets. The D and E1 cut-offs and the number of false positives corresponding to different levels of sensitivity are given in Table <tblr tid="T1">1</tblr>. To confirm the validity of our choice, we used another set of 1000 nt long sequences extracted from the centre of the ORFs, which were more than 2000 nt long. The results corresponding to this set of control fragments are also given in Table <tblr tid="T1">1</tblr> and show very few false positives.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>The comparison of sensitivity and precision values from test and 'training' sets</p>
               </caption>
               <text>
                  <p><b>The comparison of sensitivity and precision values from test and 'training' sets </b>The sensitivity (x-axis) and precision (y-axis) corresponding to 1) test set (filled circles), 2) training set (open circles) and 3) the whole <it>E. coli </it>dataset (red) is shown. The sensitivity and precision values for the test set were calculated using E1 and D cut-offs derived from the training set.</p>
               </text>
               <graphic file="1471-2105-6-1-6"/>
            </fig>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The number of false positives obtained for different levels of sensitivity.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="center">
                        <p>Sensitivity</p>
                     </c>
                     <c ca="center">
                        <p>Cut-off for D</p>
                     </c>
                     <c ca="center">
                        <p>Cut-off for E1 (kcal/mole)</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Frequency of false positives</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>FP (1/nt)<sup>a</sup></p>
                     </c>
                     <c ca="center">
                        <p>FP (1/nt)<sup>b</sup></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.13</p>
                     </c>
                     <c ca="center">
                        <p>3.4</p>
                     </c>
                     <c ca="center">
                        <p>-15.99</p>
                     </c>
                     <c ca="center">
                        <p>1/16214</p>
                     </c>
                     <c ca="center">
                        <p>1/261000</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.22</p>
                     </c>
                     <c ca="center">
                        <p>3.4</p>
                     </c>
                     <c ca="center">
                        <p>-16.7</p>
                     </c>
                     <c ca="center">
                        <p>1/11350</p>
                     </c>
                     <c ca="center">
                        <p>1/130500</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.32</p>
                     </c>
                     <c ca="center">
                        <p>3.3</p>
                     </c>
                     <c ca="center">
                        <p>-17.1</p>
                     </c>
                     <c ca="center">
                        <p>1/8407</p>
                     </c>
                     <c ca="center">
                        <p>1/65250</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.40</p>
                     </c>
                     <c ca="center">
                        <p>3.3</p>
                     </c>
                     <c ca="center">
                        <p>-17.55</p>
                     </c>
                     <c ca="center">
                        <p>1/6486</p>
                     </c>
                     <c ca="center">
                        <p>1/29000</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.50</p>
                     </c>
                     <c ca="center">
                        <p>2.76</p>
                     </c>
                     <c ca="center">
                        <p>-17.53</p>
                     </c>
                     <c ca="center">
                        <p>1/3914</p>
                     </c>
                     <c ca="center">
                        <p>1/13737</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.60</p>
                     </c>
                     <c ca="center">
                        <p>2.45</p>
                     </c>
                     <c ca="center">
                        <p>-17.64</p>
                     </c>
                     <c ca="center">
                        <p>1/2467</p>
                     </c>
                     <c ca="center">
                        <p>1/7250</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.70</p>
                     </c>
                     <c ca="center">
                        <p>2.35</p>
                     </c>
                     <c ca="center">
                        <p>-18.07</p>
                     </c>
                     <c ca="center">
                        <p>1/1621</p>
                     </c>
                     <c ca="center">
                        <p>1/2747</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.81</p>
                     </c>
                     <c ca="center">
                        <p>1.9</p>
                     </c>
                     <c ca="center">
                        <p>-18.15</p>
                     </c>
                     <c ca="center">
                        <p>1/1086</p>
                     </c>
                     <c ca="center">
                        <p>1/1878</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0.90</p>
                     </c>
                     <c ca="center">
                        <p>0.97</p>
                     </c>
                     <c ca="center">
                        <p>-18.37</p>
                     </c>
                     <c ca="center">
                        <p>1/572</p>
                     </c>
                     <c ca="center">
                        <p>1/967</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>a </sup>The false positives in the 1000 nt fragments, with TSS at the centre (-500 to +500).</p>
                  <p><sup>b </sup>The false positives in the 1000 nt fragments extracted from the centre of ORFs with length more than 2000 nt.</p>
               </tblfn>
            </tbl>
            <p>In principle, D can also be calculated using equal sized windows, i.e. 50 nt, for both E1 and E2 instead of a 50 nt window for E1 and a 100 nt window for E2. However, our calculations show that use of equal sized windows, for E1 as well as E2 calculations, results in a slightly lesser precision than when 100 nt window is used for E2 calculations (Figure <figr fid="F7">7</figr>). Hence, in our promoter predictions, we chose a 100 nt window for E2 calculations.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Change in precision with the use of different sized windows for E2 calculation</p>
               </caption>
               <text>
                  <p><b>Change in precision with the use of different sized windows for E2 calculation </b>The sensitivity (x-axis) and precision (y-axis) values corresponding to the use of 1) 50 nt window (black) and 2) 100 nt window (red) for E2 calculation.</p>
               </text>
               <graphic file="1471-2105-6-1-7"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Comparison with other promoter prediction programs</p>
            </st>
            <p>A large number of promoter prediction programs have been developed for eukaryotic sequences and are easily accessible, while NNPP <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp> is the only available prokaryotic promoter prediction program. It is a neural network based method where prediction for each sequence element constituting promoter sequence is combined in time-delay neural networks for a complete promoter site prediction. Some other prokaryotic promoter prediction methods are based on weight matrix pattern searches <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>. One of the representative weight matrix method, proposed by Staden <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, uses three weight matrices corresponding to the -35 sequence, the -10 sequence and the transcription start site. It also takes into account the spacing between the -35 and -10 motifs, as well as the distance between the -10 motif and the transcription start site. A brief comparison of the results obtained by our method and the other two methods (Staden method and NNPP program) is given in Table <tblr tid="T2">2</tblr>. It can be clearly seen from Table <tblr tid="T2">2</tblr> that for similar sensitivity, our program gives much better accuracy than the other two programs. It is pertinent to mention here that our method differs from the other two methods in one major respect, namely our method tries to find a promoter region while the other two programs try to pinpoint the transcription start site. It may be argued that the lesser number of false positives in our prediction method, as compared to the other two algorithms, may be due to this difference. But even after taking this difference into consideration, the number of false positives predicted by our protocol turns out to be smaller than those predicted by the other two methods. For example, Figure <figr fid="F8">8</figr> represents the case of argI and argF genes, where the NNPP program predicts a few extra TSS as compared to our method which correctly picks up a region in the vicinity of TSS. A combination of both the methods can therefore help in reducing the false predictions in the upstream and downstream regions. In principle, by restricting the pattern recognition using NNPP and Staden's methods only to the promoter region located initially with the help of our method, one can reduce the number of false positives. This composite approach will also help in pinpointing the TSS, which is not possible by use of our method alone. But at the same time it should be noted that both types of predictions fail to identify some of the promoters (Figure <figr fid="F8">8</figr>), e.g. for csiE gene, our program could correctly predict the promoter region but the NNPP program could not locate it. On the other hand, our program failed to find the promoter region for gyrA gene while NNPP could correctly position it. And in case of ilvA gene both the programs did not succeed in identifying the promoter region.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Comparison of our method with other prokaryotic prediction algorithms vis-&#224;-vis <it>Escherichia coli </it>promoters.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>TP</p>
                     </c>
                     <c ca="center">
                        <p>FP(1/nt)<sup>a</sup></p>
                     </c>
                     <c ca="center">
                        <p>FP(1/nt)<sup>b</sup></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Our Program</p>
                     </c>
                     <c ca="center">
                        <p>195</p>
                     </c>
                     <c ca="center">
                        <p>1/780</p>
                     </c>
                     <c ca="center">
                        <p>1/1474</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Neural Network [19]</p>
                     </c>
                     <c ca="center">
                        <p>195</p>
                     </c>
                     <c ca="center">
                        <p>1/233</p>
                     </c>
                     <c ca="center">
                        <p>1/514</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Staden's method [21]</p>
                     </c>
                     <c ca="center">
                        <p>195</p>
                     </c>
                     <c ca="center">
                        <p>1/65</p>
                     </c>
                     <c ca="center">
                        <p>1/233</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>a </sup>The false positives in the 1000 nt fragments with TSS at the centre (-500 to +500).</p>
                  <p><sup>b </sup>The false positives in the 1000 nt fragments extracted from the centre of ORFs with length more than 2000 nt.</p>
               </tblfn>
            </tbl>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Examples illustrating the predictions with our method as well as NNPP</p>
               </caption>
               <text>
                  <p><b>Examples illustrating the predictions with our method as well as NNPP </b>The promoter predictions for the argF, argI, csiE, gyrA, ilvA genes by our method (red) as well as by NNPP (blue) in the 1000 nt fragments (-500 to 500) with the TSS at the centre. The figure is generated using FEATURE MAP program [39].</p>
               </text>
               <graphic file="1471-2105-6-1-8"/>
            </fig>
            <p>Very recently a study on improvement of NNPP prediction (TLS-NNPP), by combining this method with additional information such as distance between TSS and translation start site (TLS), has been published <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. With the use of additional information regarding TLS, Burden <it>et al. </it>could significantly increase the precision of NNPP. The TLS-NNPP method was tested on 510 <it>E. coli </it>sequences of length 500 bp. For comparable sensitivity levels, the precision achieved by TLS-NNPP was 0.188 (sensitivity = 0.452) as compared to 0.109 precision (sensitivity = 0.443) achieved by NNPP. It can be seen that, for similar sensitivity levels, the precision achieved by our method (~0.7) is higher as compared to both TLS-NNPP and NNPP (Figure-<figr fid="F9">9</figr>).</p>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>Prediction accuracy of our method in case of promoters from different organisms</p>
               </caption>
               <text>
                  <p><b>Prediction accuracy of our method in case of promoters from different organisms </b>The precision (y-axis) of our method in predicting promoter region in different organisms <it>viz. Escherichia coli </it>(red), <it>Bacillus subtilis </it>(blue) and <it>Corynebacterium glutamicum </it>(black) is plotted against various levels of sensitivity (x-axis).</p>
               </text>
               <graphic file="1471-2105-6-1-9"/>
            </fig>
            <p>Presence of high densities of promoter like signals in the upstream region of TSS may be one of the reasons why pattern matching programs result in low level of precision. This has been shown recently by a systematic analysis of sigma70 promoters from <it>E. coli </it><abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. In this study a number of weight matrices were generated by analysis of 599 experimentally verified promoters and these were tested on the 250 bp region upstream of gene start site. It was found that each 250 bp region on an average has 38 promoter-like signals. The study also presented a more rigorous patter searching method for locating promoters. With the use of this function the authors reach a sensitivity values of 0.86 but the corresponding precision achieved is only ~0.2. In case of our method, for a sensitivity of 0.9 we obtained a precision of 0.35 (as shown in Figure -<figr fid="F9">9</figr>).</p>
            <p>Recently Bockhorst <it>et al. </it><abbrgrp><abbr bid="B26">26</abbr></abbrgrp> proposed a very accurate method for predicting operons, promoters and terminators in <it>E. coli</it>. This method is based on sequence as well as expression data, but requires prior knowledge of coordinates of every ORF in the genome. We would like to emphasize here that our method is different from other methods in that it is independent of any such prior knowledge about the test gene or the organism and hence holds promise as being useful for promoter prediction in a newly sequenced genome.</p>
            <p>The eukaryotic promoter prediction method proposed by Ohler <it>et al. </it><abbrgrp><abbr bid="B27">27</abbr></abbrgrp> is also worth mentioning here. Ohler <it>et al. </it>showed that a 30 % reduction of false positives can be achieved by use of physical properties, such as DNA bendability, in addition to other sequence properties of promoters. Interestingly, our method which also uses a physical property gives much smaller number of false positives as compared to Ohler <it>et al.</it>'s method. (For similar sensitivity, number of false predictions in case of Ohler <it>et al.</it>'s method are 1/4740 nt while in case of our method these are 1/8407 nt).</p>
            <p>Another vertebrate promoter prediction program, 'Promfind' <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> identifies differences in hexanucleotide frequencies of promoter and coding region and is algorithmically quite similar to our method. But Promfind differs from our method in two important aspects. First, the Promfind program is developed mainly for vertebrate promoters and second, it assumes that in a given sequence, a promoter is always present and merely predicts its location. This need not necessarily be the case, as some of the sequences may not have any promoter at all. Our program differs from Promfind in that a promoter is predicted only when the sequence satisfies certain criteria and hence is much more appropriate for carrying out genome scale analysis.</p>
         </sec>
         <sec>
            <st>
               <p>Promoter predictions in case of RNA genes</p>
            </st>
            <p>In addition to protein coding genes there are genes present for the non-coding RNAs (ncRNAs), which play structural, regulatory and catalytic roles. It is a difficult task to find out ncRNA genes in a genome because unlike protein coding regions they lack open reading frames and also they are generally smaller in size. In addition, it is also difficult to do a homology sequence search as only the structure of ncRNA is conserved and not the sequence. There are around 156 <it>E. coli </it>RNA genes reported on the NCBI site <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> and in addition many more small RNA genes are known to exist. Argaman <it>et al. </it><abbrgrp><abbr bid="B30">30</abbr></abbrgrp> recently identified 14 novel sRNA genes by applying a heuristic approach to search for transcriptional signals. We have checked the performance of our algorithm with respect to the 42 RNA transcription units (TUs) reported in Ecocyc database. Our method could pick up around 57 % RNA TUs, at a cut-off corresponding to 60 % sensitivity. The program works much better in case of rRNA operons than tRNA transcription units. We could correctly pick up promoter regions in 6 out of 7 rRNA transcription units, 17 out of 33 tRNA TUs and 1 out of the 2 remaining RNA types.</p>
         </sec>
         <sec>
            <st>
               <p>Promoter prediction in <it>Bacillus subtilis </it>and <it>Corynebacterium glutamicum</it></p>
            </st>
            <p>Finally, it is very important to see whether the method works equally well for other organisms which have genome compositions substantially different from that of <it>Escherichia coli</it>. Hence, we also tested our method using the promoter sequences from 1) the A+T-rich bacteria, <it>Bacillus subtilis </it>and 2) a G+C rich bacteria such as <it>Corynebacterium glutamicum</it>. Figure <figr fid="F9">9</figr> gives a summary of the predictions in case of bacillus and corynebacterium promoters, along with those of <it>Escherichia coli</it>. It can be clearly seen that, at present our method performs optimally for the <it>Escherichia coli </it>promoters and also performs quite well in case of <it>Bacillus subtilis</it>. The prediction accuracy in case of <it>Corynebacterium glutamicum </it>promoters is not as good as that for the other two classes of promoters. However, it should be noted that the number of experimentally determined <it>Corynebacterium </it>promoters is much smaller as compared to other two bacteria and a larger dataset is required to arrive at any firm conclusion.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>It has often been suggested that use of certain properties of promoters, other than just the sequence motifs, which can distinguish promoters from other genomic regions, could significantly improve the gene prediction methods. Although the lower stability of promoter regions as compared to non-promoter regions has been reported previously, this observation was not incorporated into a promoter prediction program. We have been able to successfully use the differential stability of promoter sequences to predict promoter regions. Our method performs better as compared to currently available prokaryotic prediction methods and is also moderately successful in predicting RNA and bacillus promoter regions. The method certainly needs to be further improved to reduce the number of predicted false positives. This can be achieved by combining the approach presented here, with the earlier reported sequence analysis methods. Such a composite method will also help in pinpointing the TSS within the promoter region identified by our method.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Promoter sequence sets</p>
            </st>
            <p>All the promoter sequences used in this study are 1000 nt long, starting 500 nt upstream (position -500) and extending up to 500 nt downstream (position +500) of the TSS. In order to avoid having multiple TSS in a given 1000 nt sequence, we have excluded all the transcription start sites which are less than 500 nt apart. Our promoter set has 227 <it>E. coli </it>promoters, 89 <it>B. subtilis </it>promoters and 28 <it>C. glutamicum </it>promoters.</p>
            <sec>
               <st>
                  <p>a) <it>Escherichia coli </it>promoter sequences</p>
               </st>
               <p>We tested our algorithm using the <it>Escherichia coli </it>promoter sequences, which were taken from the PromEC dataset <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. The PromEC dataset provides a compilation of 471 experimentally identified transcriptional start sites. As mentioned above, after excluding all the transcription start sites which are less than 500 nt apart, the dataset contains 227 promoters. With the help of TSS information, promoter sequences were extracted from <it>Escherichia coli </it>genome sequence (NCBI accession no: NC_000913).</p>
            </sec>
            <sec>
               <st>
                  <p>b) <it>Bacillus subtilis </it>promoter sequences</p>
               </st>
               <p>The transcription start sites for <it>Bacillus subtilis </it>promoters were obtained from the DBTBS database <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. The required length sequences around transcription start sites were extracted from the Bacillus genome sequence (NCBI accession no: NC_000964).</p>
            </sec>
            <sec>
               <st>
                  <p>c) <it>Corynebacterium glutamicum </it>promoter sequences</p>
               </st>
               <p>Analysis of <it>Corynebacterium glutamicum </it>promoters is carried out on a set of promoters compiled by P&#224;tek <it>et al. </it><abbrgrp><abbr bid="B33">33</abbr></abbrgrp> based on experimentally determined transcription sites.</p>
            </sec>
            <sec>
               <st>
                  <p>d) RNA promoter sequences</p>
               </st>
               <p>The transcription start positions of RNA transcription units are obtained from the ecocyc dataset. In this set, both computer predicted as well as experimentally determined transcription start sites, are included. In total, we have 7 rRNA TUs, 33 tRNA TUs and 2 TUs of other RNAs.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Free energy calculation</p>
            </st>
            <p>The stability of DNA molecule can be expressed in terms of free energy. The standard free energy change (&#916;G<sup>o</sup><sub>37</sub>) corresponding to the melting transition of an 'n' nucleotides (or 'n-1' dinucleotides) long DNA molecule, from double strand to single strand is calculated as follows:</p>
            <p>
               <graphic file="1471-2105-6-1-i2.gif"/>
            </p>
            <p>where,</p>
            <p>&#916;G<sup>o</sup><sub>ini </sub>is the initiation free energy for dinucleotide of type ij.</p>
            <p>&#916;G<sup>o</sup><sub>sym </sub>equals +0.43 kcal/mol and is applicable if the duplex is self-complementary.</p>
            <p>&#916;G<sup>o</sup><sub>i,j </sub>is the standard free energy change for the dinucleotide of type ij.</p>
            <p>Since our analysis involves long continuous stretches of DNA molecules, in our calculation we did not consider the two terms, &#916;G<sup>o</sup><sub>ini </sub>and &#916;G<sup>o</sup><sub>sym</sub>, which are more relevant for oligonucleotides. In the present calculation, each promoter sequence is divided into overlapping windows of 15 base pairs (or 14 dinucleotide steps). For each window, the free energy is calculated as given in the above equation and the energy value is assigned to the first base pair in the window. The energy values corresponding to the 10 unique dinucleotide sequences are taken from the unified parameters proposed recently <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Statistical tests</p>
            </st>
            <sec>
               <st>
                  <p>a) Wilcoxon signed test for equality of medians</p>
               </st>
               <p>The free energy distribution at a given position, in the 1000 nt <it>E. coli </it>sequences ranging from -500 to +500, was compared to the distribution in a randomly selected set. For this comparison, we followed a similar procedure as adopted by Margalit <it>et al. </it><abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The random set was chosen such that an energy value per sequence was selected arbitrarily, independent of its position in the sequence. The comparison between the energy distributions was carried out using Wilcoxon signed test for equality of medians. This is a nonparametric test, which is used to test whether the two samples have equal medians or not.</p>
            </sec>
            <sec>
               <st>
                  <p>b) Two-sample Kolmogorov-Smirnov test</p>
               </st>
               <p>We compared the free energy distribution at position -20 (with respect to TSS) with the distributions at the positions -200 and +200 using Kolmogorov-Smirnov two sample test <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>.</p>
               <p>All the calculations related to the statistical tests were carried out using MATLAB 6.0<sup>&#174;</sup>.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Implementation and scoring of NNPP and Staden's method</p>
            </st>
            <p>The promoter predictions were also carried out using two other methods <it>viz. </it>NNPP and Staden's method. NNPP program is available at <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. All the NNPP predictions were carried out at a score cut-off 0.80.</p>
            <p>The implementation of Staden's method was carried out as described in <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B37">37</abbr></abbrgrp>. The weight matrix search was carried out with the help of PATSER program <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>.</p>
            <p>In case of NNPP as well as Staden's method, the true and false positives were scored as in case of our method (Figure <figr fid="F3">3</figr>), with a prediction in -150 to 50 region being considered as a true prediction.</p>
         </sec>
         <sec>
            <st>
               <p>Sensitivity and precision</p>
            </st>
            <p>The sensitivity and precision for the predictions are calculated using the following formulae:</p>
            <p>
               <graphic file="1471-2105-6-1-i3.gif"/>
            </p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>AK performed the analysis, evaluated the results, and drafted the manuscript. MB suggested the problem, helped with evaluation of the results and the manuscript, also provided mentorship. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>During the study, AK was supported by University Grants Commission and Council of Scientific and Industrial Research. We thank Prof. N. V. Joshi for his valuable comments. We also thank Dr MiroslavP&#225;tek for the <it>Corynebacterium </it>promoter sequences. We are grateful to the two unknown referees for their suggestions.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>The gene identification problem: An overview for developers</p>
            </title>
            <aug>
               <au>
                  <snm>Fickett</snm>
                  <fnm>JW</fnm>
               </au>
            </aug>
            <source>Comput Chem</source>
            <pubdate>1996</pubdate>
            <volume>20</volume>
            <fpage>103</fpage>
            <lpage>118</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid">16749184</pubid>
                  <pubid idtype="doi">10.1016/S0097-8485(96)80012-X</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Computational methods for the identification of genes in vertebrate genomic sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Claverie</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Hum Mol Genet</source>
            <pubdate>1997</pubdate>
            <volume>6</volume>
            <fpage>1735</fpage>
            <lpage>1744</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/hmg/6.10.1735</pubid>
                  <pubid idtype="pmpid" link="fulltext">9300666</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Gene-finding approaches for eukaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Stormo</snm>
                  <fnm>GD</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2000</pubdate>
            <volume>10</volume>
            <fpage>394</fpage>
            <lpage>397</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.10.4.394</pubid>
                  <pubid idtype="pmpid" link="fulltext">10779479</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Current methods of gene prediction, their strength and weaknesses</p>
            </title>
            <aug>
               <au>
                  <snm>Math&#233;</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sagot</snm>
                  <fnm>MF</fnm>
               </au>
               <au>
                  <snm>Schiex</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Rouz&#233;</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>4103</fpage>
            <lpage>4117</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">140543</pubid>
                  <pubid idtype="pmpid" link="fulltext">12364589</pubid>
                  <pubid idtype="doi">10.1093/nar/gkf543</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Computational prediction of eukaryotic protein-coding genes</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>MQ</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>698</fpage>
            <lpage>709</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrg890</pubid>
                  <pubid idtype="pmpid" link="fulltext">12209144</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Eukaryotic promoter recognition</p>
            </title>
            <aug>
               <au>
                  <snm>Fickett</snm>
                  <fnm>JW</fnm>
               </au>
               <au>
                  <snm>Hatzigeorgiou</snm>
                  <fnm>AG</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1997</pubdate>
            <volume>7</volume>
            <fpage>861</fpage>
            <lpage>78</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9314492</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Computational approaches to identify promoters and cis-regulatory elements in plant genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Rombauts</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Florquin</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Lescot</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Marchal</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Rouze</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>van de Peer</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Plant Physiol</source>
            <pubdate>2003</pubdate>
            <volume>132</volume>
            <fpage>1162</fpage>
            <lpage>1176</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">167057</pubid>
                  <pubid idtype="pmpid" link="fulltext">12857799</pubid>
                  <pubid idtype="doi">10.1104/pp.102.017715</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>The state of the art of mammalian promoter recognition</p>
            </title>
            <aug>
               <au>
                  <snm>Werner</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Brief Bioinform</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>22</fpage>
            <lpage>30</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2105-4-22</pubid>
                  <pubid idtype="pmpid" link="fulltext">12715831</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>The biology of eukaryotic promoter prediction &#8211; a review</p>
            </title>
            <aug>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>AG</fnm>
               </au>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Chauvin</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Comput Chem</source>
            <pubdate>1999</pubdate>
            <volume>23</volume>
            <fpage>191</fpage>
            <lpage>207</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0097-8485(99)00015-7</pubid>
                  <pubid idtype="pmpid" link="fulltext">10404615</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Identifcation of additional 'punctuation marks' in genomic DNA [abstract]</p>
            </title>
            <aug>
               <au>
                  <snm>Kanhere</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bansal</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>In proceedings of 10th congress of FAOBMB: Bangalore</source>
            <fpage>139</fpage>
            <note>7&#8211;11 December 2003</note>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Helix stability in prokaryotic promoter regions</p>
            </title>
            <aug>
               <au>
                  <snm>Margalit</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Shapiro</snm>
                  <fnm>BA</fnm>
               </au>
               <au>
                  <snm>Nussinov</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Owens</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Jernigan</snm>
                  <fnm>RL</fnm>
               </au>
            </aug>
            <source>Biochemistry</source>
            <pubdate>1988</pubdate>
            <volume>27</volume>
            <fpage>5179</fpage>
            <lpage>5188</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/bi00414a035</pubid>
                  <pubid idtype="pmpid">3167040</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>A DNA structural atlas for Escherichia coli</p>
            </title>
            <aug>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>AG</fnm>
               </au>
               <au>
                  <snm>Jensen</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Staerfeldt</snm>
                  <fnm>HH</fnm>
               </au>
               <au>
                  <snm>Ussery</snm>
                  <fnm>DW</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2000</pubdate>
            <volume>299</volume>
            <fpage>907</fpage>
            <lpage>930</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2000.3787</pubid>
                  <pubid idtype="pmpid" link="fulltext">10843847</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>DNA dynamically directs its own transcription initiation</p>
            </title>
            <aug>
               <au>
                  <snm>Choi</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Kalosakas</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Rasmussen</snm>
                  <fnm>KO</fnm>
               </au>
               <au>
                  <snm>Hiromura</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bishop</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Usheva</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>1584</fpage>
            <lpage>1590</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">390311</pubid>
                  <pubid idtype="pmpid" link="fulltext">15004245</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh335</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Computer analysis and recognition of Drosophila melanogaster gene promoters</p>
            </title>
            <aug>
               <au>
                  <snm>Levitskii</snm>
                  <fnm>VG</fnm>
               </au>
               <au>
                  <snm>Katokhin</snm>
                  <fnm>AV</fnm>
               </au>
            </aug>
            <source>Mol Biol (Mosk)</source>
            <pubdate>2001</pubdate>
            <volume>35</volume>
            <fpage>970</fpage>
            <lpage>978</lpage>
            <xrefbib>
               <pubid idtype="pmpid">11771144</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Determination of common structural features in <it>Escherichia coli </it>promoters by computer analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Lisser</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Margalit</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Eur J Biochem</source>
            <pubdate>1994</pubdate>
            <volume>223</volume>
            <fpage>823</fpage>
            <lpage>830</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1111/j.1432-1033.1994.tb19058.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">8055959</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Discriminant analysis of promoter regions in <it>Escherichia coli </it>sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Nakata</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kanehisa</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Maizel</snm>
                  <fnm>JV</fnm>
                  <suf>Jr</suf>
               </au>
            </aug>
            <source>Comput Appl Biosci</source>
            <pubdate>1988</pubdate>
            <volume>4</volume>
            <fpage>367</fpage>
            <lpage>71</lpage>
            <xrefbib>
               <pubid idtype="pmpid">3046714</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>A relationship between DNA helix stability and recognition sites for RNA polymerase</p>
            </title>
            <aug>
               <au>
                  <snm>Vollenweider</snm>
                  <fnm>HJ</fnm>
               </au>
               <au>
                  <snm>Fiandt</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Szybalski</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1979</pubdate>
            <volume>205</volume>
            <fpage>508</fpage>
            <lpage>511</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.377494</pubid>
                  <pubid idtype="pmpid" link="fulltext">377494</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Predicting DNA duplex stability from the base sequence</p>
            </title>
            <aug>
               <au>
                  <snm>Breslauer</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Frank</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Blocker</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Marky</snm>
                  <fnm>LA</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1986</pubdate>
            <volume>83</volume>
            <fpage>3746</fpage>
            <lpage>3750</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">323600</pubid>
                  <pubid idtype="pmpid" link="fulltext">3459152</pubid>
                  <pubid idtype="doi">10.1073/pnas.83.11.3746</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome</p>
            </title>
            <aug>
               <au>
                  <snm>Reese</snm>
                  <fnm>MG</fnm>
               </au>
            </aug>
            <source>Comput Chem</source>
            <pubdate>2001</pubdate>
            <volume>26</volume>
            <fpage>51</fpage>
            <lpage>56</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0097-8485(01)00099-7</pubid>
                  <pubid idtype="pmpid" link="fulltext">11765852</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>NNPP</p>
            </title>
            <url>http://www.fruitfly.org/seq_tools/promoter.html</url>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Computer methods to locate signals in nucleic acid sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Staden</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1984</pubdate>
            <volume>12</volume>
            <fpage>505</fpage>
            <lpage>519</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">321067</pubid>
                  <pubid idtype="pmpid" link="fulltext">6364039</pubid>
                  <pubid idtype="doi">10.1093/nar/12.1Part2.505</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p><it>Escherichia coli </it>promoter sequences predict in vitro RNA polymerase selectivity</p>
            </title>
            <aug>
               <au>
                  <snm>Mulligan</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Hawley</snm>
                  <fnm>DK</fnm>
               </au>
               <au>
                  <snm>Entriken</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>McClure</snm>
                  <fnm>WR</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1984</pubdate>
            <volume>12</volume>
            <fpage>789</fpage>
            <lpage>800</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">321093</pubid>
                  <pubid idtype="pmpid" link="fulltext">6364042</pubid>
                  <pubid idtype="doi">10.1093/nar/12.1Part2.789</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Application of a new method of pattern recognition in DNA sequence analysis: a study of E. coli promoters</p>
            </title>
            <aug>
               <au>
                  <snm>Alexandrov</snm>
                  <fnm>NN</fnm>
               </au>
               <au>
                  <snm>Mironov</snm>
                  <fnm>AA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1990</pubdate>
            <volume>18</volume>
            <fpage>1847</fpage>
            <lpage>1852</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">330605</pubid>
                  <pubid idtype="pmpid" link="fulltext">2186368</pubid>
                  <pubid idtype="doi">10.1093/nar/18.7.1847</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Sigma70 promoters in <it>Escherichia coli</it>: specific transcription in dense regions of overlapping promoter-like signals</p>
            </title>
            <aug>
               <au>
                  <snm>Huerta</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Collado-Vides</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2003</pubdate>
            <volume>333</volume>
            <fpage>261</fpage>
            <lpage>278</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2003.07.017</pubid>
                  <pubid idtype="pmpid" link="fulltext">14529615</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Improving promoter prediction for the NNPP2.2 algorithm: a case study using <it>E. coli </it>DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Burden</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>YX</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <inpress/>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15454410</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Predicting bacterial transcription units using sequence and expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Bockhorst</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Qiu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Glasner</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Blattner</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Craven</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>Suppl 1</issue>
            <fpage>i34</fpage>
            <lpage>43</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg1003</pubid>
                  <pubid idtype="pmpid" link="fulltext">12855435</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition</p>
            </title>
            <aug>
               <au>
                  <snm>Ohler</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Niemann</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Liao</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>GM</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>Suppl 1</issue>
            <fpage>S199</fpage>
            <lpage>206</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11473010</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>The prediction of vertebrate promoter regions using differential hexamer frequency analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Hutchinson</snm>
                  <fnm>GB</fnm>
               </au>
            </aug>
            <source>Comput Appl Biosci</source>
            <pubdate>1996</pubdate>
            <volume>12</volume>
            <fpage>391</fpage>
            <lpage>398</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8996787</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p><it>Escherichia coli </it>RNA genes at NCBI</p>
            </title>
            <url>http://www.ncbi.nlm.nih.gov/genomes/rnatab.cgi?gi=115&amp;db=Genome</url>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Novel small RNA-encoding genes in the intergenic regions of <it>Escherichia coli</it></p>
            </title>
            <aug>
               <au>
                  <snm>Argaman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Hershberg</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Vogel</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bejerano</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>EG</fnm>
               </au>
               <au>
                  <snm>Margalit</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Altuvia</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Curr Biol</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>941</fpage>
            <lpage>950</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0960-9822(01)00270-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">11448770</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>PromEC: An updated database of <it>Escherichia coli </it>mRNA promoters with experimentally identified transcriptional start sites</p>
            </title>
            <aug>
               <au>
                  <snm>Hershberg</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bejerano</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Santos-Zavaleta</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Margalit</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <fpage>277</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">29777</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125111</pubid>
                  <pubid idtype="doi">10.1093/nar/29.1.277</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>DBTBS: database of transcriptional regulation in <it>Bacillus subtilis </it>and its contribution to comparative genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Makita</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Nakao</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ogasawara</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Nakai</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>Database</issue>
            <fpage>D75</fpage>
            <lpage>77</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">14681362</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh074</pubid>
                  <pubid idtype="pmcid">308808</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Promoters of <it>Corynebacterium glutamicum</it></p>
            </title>
            <aug>
               <au>
                  <snm>P&#224;tek</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Nesvera</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Guyonvarch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Reyes</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Leblon</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J Biotechnol</source>
            <pubdate>2003</pubdate>
            <volume>104</volume>
            <fpage>311</fpage>
            <lpage>323</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-1656(03)00155-X</pubid>
                  <pubid idtype="pmpid" link="fulltext">12948648</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics</p>
            </title>
            <aug>
               <au>
                  <snm>SantaLucia</snm>
                  <fnm>J</fnm>
                  <suf>Jr</suf>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>1460</fpage>
            <lpage>1465</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">19045</pubid>
                  <pubid idtype="pmpid" link="fulltext">9465037</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.4.1460</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Thermodynamics and NMR of internal G.T mismatches in DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Allawi</snm>
                  <fnm>HT</fnm>
               </au>
               <au>
                  <snm>SantaLucia</snm>
                  <fnm>J</fnm>
                  <suf>Jr</suf>
               </au>
            </aug>
            <source>Biochemistry</source>
            <pubdate>1997</pubdate>
            <volume>36</volume>
            <fpage>10581</fpage>
            <lpage>10594</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/bi962590c</pubid>
                  <pubid idtype="pmpid" link="fulltext">9265640</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Proof without prejudice: use of the Kolmogorov-Smirnov test for the analysis of histograms from flow systems and other sources</p>
            </title>
            <aug>
               <au>
                  <snm>Young</snm>
                  <fnm>IT</fnm>
               </au>
            </aug>
            <source>J Histochem Cytochem</source>
            <pubdate>1977</pubdate>
            <volume>25</volume>
            <fpage>935</fpage>
            <lpage>941</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">894009</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p><it>Escherichia coli </it>promoter sequences: analysis and prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Hertz</snm>
                  <fnm>GZ</fnm>
               </au>
               <au>
                  <snm>Stormo</snm>
                  <fnm>GD</fnm>
               </au>
            </aug>
            <source>Methods Enzymol</source>
            <pubdate>1996</pubdate>
            <volume>273</volume>
            <fpage>30</fpage>
            <lpage>42</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8791597</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Identifying DNA and protein patterns with statistically significant alignments of multiple sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Hertz</snm>
                  <fnm>GZ</fnm>
               </au>
               <au>
                  <snm>Stormo</snm>
                  <fnm>GD</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>563</fpage>
            <lpage>577</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.7.563</pubid>
                  <pubid idtype="pmpid" link="fulltext">10487864</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Regulatory sequence analysis tools</p>
            </title>
            <aug>
               <au>
                  <snm>van Helden</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>3593</fpage>
            <lpage>3596</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">168973</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824373</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg567</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
