<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-6-109</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the <it>Drosophila </it>genome: the fluffy-tail test</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Abnizova</snm>
               <fnm>Irina</fnm>
               <insr iid="I1"/>
               <email>irina.abnizova@mrc-bsu.cam.ac.uk</email>
            </au>
            <au id="A2">
               <snm>te Boekhorst</snm>
               <fnm>Rene</fnm>
               <insr iid="I2"/>
               <email>r.teboekhorst@herts.ac.uk</email>
            </au>
            <au id="A3">
               <snm>Walter</snm>
               <fnm>Klaudia</fnm>
               <insr iid="I1"/>
               <email>klaudia.walter@mrc-bsu.cam.ac.uk</email>
            </au>
            <au id="A4">
               <snm>Gilks</snm>
               <mi>R</mi>
               <fnm>Walter</fnm>
               <insr iid="I1"/>
               <email>wally.gilks@mrc-bsu.cam.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge CB2 2SR, UK</p>
            </ins>
            <ins id="I2">
               <p>Computer Science Department, University of Hertfordshire, College Lane, AL10 92BA, Hatfield Campus, UK</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>109</fpage>
         <url>http://www.biomedcentral.com/1471-2105/6/109</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15857505</pubid>
               <pubid idtype="doi">10.1186/1471-2105-6-109</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>17</day>
               <month>12</month>
               <year>2004</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>27</day>
               <month>4</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>27</day>
               <month>4</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Abnizova et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, <it>per se </it>. Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The transcription rate of genes is dictated primarily by interactions between DNA-binding transcription factors. Comparatively short sequences (several hundred to several thousand base pairs, depending on thespecies) upstream or downstream of the transcription start site often play a major role in the regulation of gene expression. Specific sites within such regions are recognized by regulatory proteins (transcription factors), which act upon binding as transcriptional repressors or activators, controlling the rate of transcription. The identification of regulatory regions, which are generally composed of dense clusters of target transcription factor binding sites, forms an essential step in understanding the regulatory interactions that govern the spatial and temporal expression of individual genes (see for example <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>) and genetic regulatory networks, (see for example <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>).</p>
         <p>Ultimately, this task is accomplished experimentally using techniques such as empirical deletion analysis, direct binding measurements, and co-precipitation of protein-DNA complexes. However, experimental verification is expensive and time consuming. Therefore, to address the growing volumes of available genomic sequence, a number of algorithms that identify putative cis-regulatory modules and transcription factor binding sites using evolutionary comparisons, whole-genome data, and known descriptions of transcription factor binding sites, have been successfully developed. Regulatory regions of higher eukaryotes can be subdivided into proximal regulatory units &#8211; promoters &#8211; which are located close to and upstream of the gene, and distal transcription regulatory units called enhancers or cis-regulatory modules. These may be located far upstream or downstream of the target gene, and are much more difficult to recognise. In our work we focus on recognition of enhancers.</p>
         <p>Methods for recognising regulatory DNA may be divided into the following approaches:</p>
         <p>1. Recognition of regulatory DNA regions based on description of known transcription factor binding sites (TFBS). This approach exploits the clustering of known, cooperatively-acting transcription factors (TFs). Extracting clustered recognition motifs is one of the most reliable techniques, but is limited to the recognition of similarly regulated cis-regulatory regions. Among the most popular representatives of search by known TFBS are <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>.</p>
         <p>2. Recognition of regulatory DNA based on phylogenetic foot-printing <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>. Methods of this type assume that regulatory regions are highly conserved in cross-genomic comparison, and conserved segments can be extracted from evolutionary related genomes. Performance of phylogenetic foot-printing depends on the evolutionary distance between given species and on the conservation level of individual genes. This is an actively progressing area, as more and more sequenced genomes appear. However, such an approach offers little information as to the specific function of the conserved sequences. Furthermore, it is still an open question as to how many genomes are sufficient for reliable extraction of regulatory regions.</p>
         <p>3. Methods based on the difference of local nucleotide composition between regulatory and non regulatory DNA <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. It is assumed that this difference is due to presence of multiple transcription signals, such as binding motifs for TFs in regulatory regions. The works <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp> are based on constructing a global interpolated Markov model, applied to promoter recognition only.</p>
         <p>In our method, we assume that the abundance of regulatory motifs within regulatory regions leaves a distinct "signature" in nucleotide composition, and that it is possible to capture this "signature" statistically. More specifically, we hypothesize that it takes the form of an over-representation of "similar words" (which are not simple repeats).</p>
         <p>The approach of looking for over-occurrence of words has also been widely used in motif discovery, but this is not our aim here. This over-representation of similar words should appear as outliers in the right tail of the distribution of similar word lists of variable length. The "fluffy tail test", proposed in this paper, is designed to identify such outliers and is a useful technique when data from multiple genes and genomes are lacking. It may also be used as a complementary tool when such data are available.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>In this section, we first present our new statistical 'fluffy tail' test for measuring the overrepresentation of similar words, and then show its performance on experimentally verified sequence data.</p>
         <sec>
            <st>
               <p>Test bed</p>
            </st>
            <p>To demonstrate the power of our test, we need a positive, experimentally verified, training set of regulatory sequence data, and also negative training sets of non- regulatory sequence data. We use three test beds. The positive training set is a collection of 60 experimentally verified functional <it>Drosophila melanogaster </it>regulatory regions <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. This set consists of cis-regulatory modules located far from gene coding sequences and transcription start sites. It contains many binding sites (and site clusters), best known of which are bicoid, hunchback, Kruppel, knirps and caudal, &#8211; the sites involved in the regulation of developmental genes. The total size of the positive training set comprises about 68 Kb of sequence data, and contains 58 clusters of the same type of TFBS (homotypic). The two negative training sets are: (i) 60 randomly picked <it>Drosophila </it>exons, and (ii) 60 randomly picked <it>Drosophila </it>non-coding, non-regulatory DNA sequences: we excluded exons and regions of length 1 KB upstream and downstream of genes, using the Ensembl Genome Browser <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. Each training set contains 68 Kb of sequences in total.</p>
         </sec>
         <sec>
            <st>
               <p>Estimation of distributions of similar words</p>
            </st>
            <p>To construct the distribution of similar words, we first need to specify the length of words under consideration. We try to mimic the TF core, which is the less variable part of a binding motif. Because the core of TFBSs is relatively short (around 3&#8211;5 bp) we considered 5-mer words, allowing for 1 mismatch. However, our results also hold for words of length 4 through 12, allowing for 1 through 4 mismatches (see Supplementary Materials [see Additional files <supplr sid="S1">1</supplr>, <supplr sid="S2">2</supplr>, <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr>, <supplr sid="S5">5</supplr>, <supplr sid="S6">6</supplr>, <supplr sid="S7">7</supplr>, <supplr sid="S8">8</supplr>, <supplr sid="S9">9</supplr>, <supplr sid="S10">10</supplr>, <supplr sid="S11">11</supplr>, <supplr sid="S12">12</supplr>]). Thus, for each 5-mer word in each of the 180 sequences (60 sequences in each training set) we computed the number n of similar words of the same length. Thus, each word is the "seed" of a list of similar words. Next, the number of (non-disjoint) lists containing n words is counted, where n = 1,2,3....</p>
            <p>(See Methods section for further details). As an example, thehistogram of the distribution of similar 5-mer words is plotted in Figure <figr fid="F1">1</figr>. In this histogram, the Y axis represents the number of lists containing 1,2, ..., n words and the X axis shows the number n of similar words in the list.</p>
            <suppl id="S1">
               <title>
                  <p>Additional File 1</p>
               </title>
               <text>
                  <p>Contains short introduction and notation for Supplementary Material</p>
               </text>
               <file name="1471-2105-6-109-S1.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S2">
               <title>
                  <p>Additional File 2</p>
               </title>
               <text>
                  <p>Contains Supplementary Table1 with results of Fluffy-tail test and Coefficients of Variation for some more experimentally verified regulatory regions for other than Fruit fly species.</p>
               </text>
               <file name="1471-2105-6-109-S2.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S3">
               <title>
                  <p>Additional File 3</p>
               </title>
               <text>
                  <p>Contains a visual example of F dependence on the number of randomisations r.</p>
               </text>
               <file name="1471-2105-6-109-S3.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S4">
               <title>
                  <p>Additional File 4</p>
               </title>
               <text>
                  <p>Gives some more details about spatial clustering threshold</p>
               </text>
               <file name="1471-2105-6-109-S4.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S5">
               <title>
                  <p>Additional File 5</p>
               </title>
               <text>
                  <p>Shows some examples for consistence of fluffiness for different word length, tables.</p>
               </text>
               <file name="1471-2105-6-109-S5.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S6">
               <title>
                  <p>Additional File 6</p>
               </title>
               <text>
                  <p>Shows some examples for consistence of fluffiness for different word length in the histogram form</p>
               </text>
               <file name="1471-2105-6-109-S6.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S7">
               <title>
                  <p>Additional File 7</p>
               </title>
               <text>
                  <p>Consistent fluffiness and coefficient of variation for spatial cluster size for some example sequences.</p>
               </text>
               <file name="1471-2105-6-109-S7.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S8">
               <title>
                  <p>Additional File 8</p>
               </title>
               <text>
                  <p>Contains the Figures showing fluffiness and spatial clustering of similar words for NCNR 3L4 region.</p>
               </text>
               <file name="1471-2105-6-109-S8.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S9">
               <title>
                  <p>Additional File 9</p>
               </title>
               <text>
                  <p>Contains the Figures showing fluffiness and spatial clustering of similar words for NCNR repeat-masked 3L4 region.</p>
               </text>
               <file name="1471-2105-6-109-S9.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S10">
               <title>
                  <p>Additional File 10</p>
               </title>
               <text>
                  <p>Contains the Figures showing fluffiness and spatial clustering of similar words for knirps regulatory region</p>
               </text>
               <file name="1471-2105-6-109-S10.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S11">
               <title>
                  <p>Additional File 11</p>
               </title>
               <text>
                  <p>Contains the Figures showing fluffiness and spatial clustering of similar words for abdominantA regulatory region.</p>
               </text>
               <file name="1471-2105-6-109-S11.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S12">
               <title>
                  <p>Additional File 12</p>
               </title>
               <text>
                  <p>Contains the Figures showing fluffiness and spatial clustering of similar words for internal exon 2r4.</p>
               </text>
               <file name="1471-2105-6-109-S12.doc">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Histogram of similar words for the knirps cis-regulatory module</p>
               </caption>
               <text>
                  <p><b>Histogram of similar words for the knirps cis-regulatory module</b>. An example of a distribution of similar 5-mer words for the knirps cis-regulatory module <it>Drosophila melanogaster </it>. Note that the sequence contains an exceptionally large number (37) of lists with an exceptionally large number (137) of similar words. The Y axis shows the number of lists, the X axis is for list size.</p>
               </text>
               <graphic file="1471-2105-6-109-1"/>
            </fig>
            <p>From this plot it can be seen that most lists contain 10 to 40 words, but there are outliers: some very large lists form a long, "fluffy" tail. We call a list having the largest size the <ul>maximal similar word list</ul> (MSWL). If the original sequence is characterized by the presence of an unusually high number of over-represented words, we expect it to contain more long lists in comparison to a random sequence.</p>
            <p>To sample such a random distribution we shuffled the given sequence of original data 50 times. For each randomisation we assessed the frequency distribution of similar words. Figure <figr fid="F2">2</figr> shows a typical example of the distribution of similar words for one of the randomly shuffled sequences of the same (knirps) cis-regulatory module as in Figure <figr fid="F1">1</figr>. Compared with the distribution of the original data (Figure1), the randomised sequence in Figure <figr fid="F2">2</figr> lacks a heavy, "fluffy" right tail. Figure <figr fid="F3">3</figr> shows the difference between original and randomised similar word distributions in cumulative form. The difference between the two curves reflects the fluffy right tail of the original data.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Histogram of similar words for the knirps cis-regulatory module, after shuffling</p>
               </caption>
               <text>
                  <p><b>Histogram of similar words for the knirps cis-regulatory module, after shuffling</b>. The frequency distribution of similar words for one randomly shuffled version of the knirps cis-regulatory region, <b><it>Drosophila melanogaster </it></b>. The Y axis shows the number of lists, the X axis is for list size.</p>
               </text>
               <graphic file="1471-2105-6-109-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Cumulative histograms</p>
               </caption>
               <text>
                  <p><b>Cumulative histograms</b>. Cumulative histograms for the data in Figures 1 and 2: solid line: original data from Figure 1, dotted line: randomised data from Figure 2. The X axis shows the size of lists of similar words, the Y axis is the number of lists.</p>
               </text>
               <graphic file="1471-2105-6-109-3"/>
            </fig>
            <p>In Figure <figr fid="F4">4</figr>, ten randomised sequences are plotted as dotted contours together with the histogram of the original regulatory knirps data (solid). The cumulative histogram for original (solid) and randomised (dotted) sequences is shown in Figure <figr fid="F4">4</figr> (right). All dotted tails are shorter than the solid one, indicating the statistical significance of the solid tail.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Fluffy-tailed knirps distribution</p>
               </caption>
               <text>
                  <p><b>Fluffy-tailed knirps distribution</b>. (Left) The distribution of the original regulatory knirps sequence: (solid line); the distribution of 10 randomised sequences (dotted lines). (Right) The same distributions in cumulative form. The X axis shows the size of lists of similar words, the Y axis is the number of lists.</p>
               </text>
               <graphic file="1471-2105-6-109-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Definition of the fluffiness coefficient F</p>
            </st>
            <p>To measure how strong the distribution of similar words of regulatory regions deviate from randomness, we introduce a "fluffiness" coefficient F:</p>
            <p><graphic file="1471-2105-6-109-i1.gif"/>)</p>
            <p><it>w </it>here <it>L </it><sub>max,<it>original </it></sub>is the number of words in the maximal similar word list (MSWL) in the original sequence, <graphic file="1471-2105-6-109-i2.gif"/> and &#963;<sub><it>r </it></sub>are the mean and standard deviation of the MSWL size in each of <it>r </it>shuffled sequences. Here we call the sequence "random" if it is obtained from original sequence by shuffling it, preserving its single nucleotide composition. We will omit the subscript <it>r </it>for <it>F</it><sub><it>r </it></sub>later in the paper for simplicity.</p>
            <p>One can regard F as measuring the difference between signal and noise, where the signal is taken from the original sequence, and the noise from the randomised sequences with the same composition and length. Thus, the fluffiness coefficient is normalised for the length and base composition of the sequence, because we compare each original sequence only with respect to shuffled sequences of the same length and composition. Thus one can compare the fluffiness F for sequences of different base composition and length.</p>
         </sec>
         <sec>
            <st>
               <p>Results for regulatory regions</p>
            </st>
            <p>Figure <figr fid="F5">5</figr> shows the distribution of fluffiness coefficient F for regulatory, coding and non-coding non-regulatory (NCNR) DNA. In each sequence we generated r = 50 shuffled versions, in calculating F. One can see that F = 2 distinguishes regulatory DNA from other types of DNA. Thus, we use the value F = 2 as a threshold. A sequence with F>2 we declare to have a "fluffy" tail. Moreover, we found that for each regulatory region having F>2, all the randomised sequences had a shorter tail. This value F = 2 is sufficiently robust: if we vary our threshold a little around F = 2, we still get a fair separation.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Histograms for regulatory (green), coding (cyan) and NCNR (magenta) sequences</p>
               </caption>
               <text>
                  <p><b>Histograms for regulatory (green), coding (cyan) and NCNR (magenta) sequences</b>. The word length is 5, mismatch is 1, r is 50. The X axis shows the fluffiness coefficient F, the Y axis is the number of sequences in the set with this F.</p>
               </text>
               <graphic file="1471-2105-6-109-5"/>
            </fig>
            <p>Our choice of r = 50 shuffled versions for each sequence allows us to obtain reliable estimates for the fluffiness coefficient F and make the computational time reasonable. Table <tblr tid="T1">1</tblr> shows that F is somewhat unstable for smaller r for the knirps regulatory region. However, for each choice of r, F clearly exceeds the threshold value 2, in this example. See Supplementary Materials for more detailed descriptions [see Additional files <supplr sid="S1">1</supplr>, <supplr sid="S2">2</supplr>, <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr>, <supplr sid="S5">5</supplr>, <supplr sid="S6">6</supplr>, <supplr sid="S7">7</supplr>, <supplr sid="S8">8</supplr>, <supplr sid="S9">9</supplr>, <supplr sid="S10">10</supplr>, <supplr sid="S11">11</supplr>, <supplr sid="S12">12</supplr>].</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Sensitivity of F to choice of r, the number of randomisations, for the knirps regulatory region.</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>r</p>
                     </c>
                     <c ca="left">
                        <p>F</p>
                     </c>
                     <c ca="left">
                        <p>&#963;<sub><it>r </it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>25</p>
                     </c>
                     <c ca="left">
                        <p>14.7</p>
                     </c>
                     <c ca="left">
                        <p>5.39</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>50</p>
                     </c>
                     <c ca="left">
                        <p>8.65</p>
                     </c>
                     <c ca="left">
                        <p>8.77</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>100</p>
                     </c>
                     <c ca="left">
                        <p>10.22</p>
                     </c>
                     <c ca="left">
                        <p>7.56</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>Using the methodology described above, we found that 51 out of 60 regulatory regions (85%) analysed in our positive training set exhibit the significant "fluffy-tail" pattern (see Table <tblr tid="T2">2</tblr>). The non-detection of the remaining "non-fluffy" regulatory regions could perhaps be partly due to the limited power of experimental deletion analyses to correctly distinguish the boundaries of the cis-regulatory modules.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>"Fluffiness" predictions for three types of functional region, showing the number of fluffy (F>2) sequences, the number of non-fluffy (F&lt;2) sequences and corresponding positive and negative prediction rates, for each type of the region.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Functional type</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Fluffy tails (F>2)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>No fluffy tails (F&lt;2)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Positive rate</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Negative rate</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Regulatory regions</p>
                     </c>
                     <c ca="left">
                        <p>51</p>
                     </c>
                     <c ca="left">
                        <p>9</p>
                     </c>
                     <c ca="left">
                        <p>85 %</p>
                     </c>
                     <c ca="left">
                        <p>15 %</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Exons</p>
                     </c>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>59</p>
                     </c>
                     <c ca="left">
                        <p>1.6 %</p>
                     </c>
                     <c ca="left">
                        <p>98.4 %</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Non-coding presumed non- regulatory</p>
                     </c>
                     <c ca="left">
                        <p>10</p>
                     </c>
                     <c ca="left">
                        <p>50</p>
                     </c>
                     <c ca="left">
                        <p>16 %</p>
                     </c>
                     <c ca="left">
                        <p>84 %</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>We calculated the distribution of F for our two negative and one positive training sets. The separation of regulatory DNA from coding and non-coding, non-regulatory DNA on the basis of fluffiness was quantified by estimating the distribution of the F coefficients. A Kruskal-Wallis test showed that these regions differ significantly in the magnitude of the fluffiness coefficient (H = 132.81, N = 180, df = 2, p = 0.00001), with exons and non-coding non-regulatory DNA having much lower F-values than regulatory regions (See Fig. <figr fid="F6">6</figr>).</p>
            <fig id="F11">
               <title>
                  <p>Figure 11</p>
               </title>
               <caption>
                  <p>Non-coding presumed non-regulatory sequence before and after repeat-masking</p>
               </caption>
               <text>
                  <p><b>Non-coding presumed non-regulatory sequence before and after repeat-masking</b>. For a non-coding, non-regulatory sequence, randomly picked from chromosome 3L. Panels (a,b,c) show results before repeat-masking; panels (d,e,f) show results after repeat-masking. Panels (a,d) show histograms of similar words (solid: original data; dotted: after random shuffling) as in Figure 1; panels (b,e) show the same data in cumulative form as in Figure 3; panels (c,f) show start locations of similar words as in Figure 7.</p>
               </text>
               <graphic file="1471-2105-6-109-11"/>
            </fig>
            <p>We now turn to examine the location of similar words in the MSWL for a given sequence.</p>
            <p>When the start positions of each of the words in the MSWL are plotted, they tend to be fairly uniformly scattered along the length of the sequence, as illustrated in Figure <figr fid="F7">7</figr>.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Spatial distribution of similar words in MSW L</p>
               </caption>
               <text>
                  <p><b>Spatial distribution of similar words in MSW L</b>. <b>F</b>airly uniform spatial distribution of start locations for words in the MSWL (n = 137, see Fig.1) of the knirps cis- regulatory region of <it>Drosophila melanogaster </it>. The X axis shows the positions of each word start in the sequence, the Y axis is the rank of this position in the list.</p>
               </text>
               <graphic file="1471-2105-6-109-7"/>
            </fig>
            <p>We now examine the relationship between the MSWL and predicted TFBS sites. We found significant enrichment of most MSWLs with the occurrences of TFBS in databases: when submitted to the Transfac and Jaspar TFBS databases, the "seed" words for MSWLs exhibited 10&#8211;20 fold enrichment with putative TFBS in comparison with all 5-mer words within the given regulatory region: thus, for the most part, these "seed" words turned out to be instances of known TFBS (results not shown here).</p>
         </sec>
         <sec>
            <st>
               <p>Results for exons</p>
            </st>
            <p>We repeated the fluffy tail test for randomly picked <it>Drosophila </it>exons, and found that the distribution of over-represented words of the original sequences did not differ statistically from those of their randomised versions (See Table <tblr tid="T2">2</tblr>). Note the absence of a "fluffy tail" in Figure <figr fid="F8">8</figr> (left) and the lack of distinction in the cumulative distribution (Figure <figr fid="F8">8</figr> right).</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Histogram for exon <it>cg3201 </it>3</p>
               </caption>
               <text>
                  <p><b>Histogram for exon <it>cg3201 </it>3</b>. Distribution of similar words for the exon <it>cg3201 </it>3 of <it>Drosophil </it>a (solid line) compared to the histograms of the randomly shuffled versions (dotted lines) in direct (left) and cumulative (right) forms. The X axis shows the size of lists of similar words, the Y axis is the number of lists.</p>
               </text>
               <graphic file="1471-2105-6-109-8"/>
            </fig>
            <p>Thus we have established a statistical difference between exons and regulatory DNA. Next we compare regulatory DNA with non-coding non-regulatory DNA.</p>
         </sec>
         <sec>
            <st>
               <p>Results for non-coding, presumed non-regulatory DNA</p>
            </st>
            <p>The similar words distribution for non-coding non-regulatory DNA typically shows two patterns: (1) without significant tails, as for exons and (2) with significant tails (Figure <figr fid="F9">9</figr>) but in this case &#8211; and in contrast to the regulatory sequences &#8211; the spatial locations of over-represented words are typically clustered (Figure <figr fid="F11">11c</figr>).</p>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>Histogram for non-coding presumed non-regulatory sequence</p>
               </caption>
               <text>
                  <p><b>Histogram for non-coding presumed non-regulatory sequence</b>. Distribution of similar words for a non-coding, non-regulatory sequence, randomly picked from chromosome 3L has significant tail because of simple repeats. The X axis shows the size of lists of similar words, the Y axis is the number of lists.</p>
               </text>
               <graphic file="1471-2105-6-109-9"/>
            </fig>
            <p>To deal with this, we developed a measure of spatial clustering of similar words. We say that two words <it>w </it><sub>1 </sub>and <it>w </it><sub>2 </sub>belong to the same cluster, if their genomic start positions <it>s </it><sub>1 </sub>and <it>s </it><sub>2 </sub>satisfy |<it>s </it><sub>1 </sub>- <it>s </it><sub>2</sub>| &#8804; <it>m</it>&#183;<it>k </it>, where m is the word length, and k is a constant. We examined the following choices for k: 1; 1.5; 2; 2.5; 3.</p>
            <p>The size of a cluster is defined as the number of words in the cluster. For each MSWL we computed the coefficient of variation (CV) in cluster sizes, where CV is the standard deviation divided by the mean cluster size. We used analysis of variance to test for difference in coefficients of variance among four types of functional DNA: exons, non-fluffy NCNR, fluffy NCNR and regulatory regions. The assumptions for ANOVA (homogeneity of variance (CV), no correlation between means and standard deviations of the samples) were satisfied. The results show a strongly significant difference between the four types: see Figure <figr fid="F10">10</figr>. Thus we can use the cluster size CV to distinguish fluffy NCNR from regulatory DNA. CVs for fluffy NCNR are almost always more than 1, for k from 1 to 3; and significantly different from CVs for regulatory DNA.</p>
            <p>We found that large clusters of adjacent over-represented words in fluffy NCNR DNA disappear after repeat-masking <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, thus revealing their identity as non-perfect simple repeats (Figure <figr fid="F11">11</figr>: compare panels a,b,c with d,e,f).</p>
            <p>For details about spatial clustering and illustration of coefficient of variation robustness to choice of <b>k </b>and <b>m</b>, see Supplementary Materials [see Additional files <supplr sid="S1">1</supplr>, <supplr sid="S2">2</supplr>, <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr>, <supplr sid="S5">5</supplr>, <supplr sid="S6">6</supplr>, <supplr sid="S7">7</supplr>, <supplr sid="S8">8</supplr>, <supplr sid="S9">9</supplr>, <supplr sid="S10">10</supplr>, <supplr sid="S11">11</supplr>, <supplr sid="S12">12</supplr>].</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Our method allows us to distinguish regulatory DNA from other non-regulatory DNA. In effect, our method aggregates many small signals contained in the region, and makes an internal comparison with background, represented by shuffled sequences.</p>
         <p>We would like to extend the application of our method to larger sets of experimentally verified regulatory regions, from <it>Drosophila </it>or any other species. Unfortunately, few experimentally (not computationally!) verified sets are available. We managed to extended our positive training set a little, including a few experimentally verified regulatory regions from human, chicken, sea urchin, fruit fly and yeast (see Supplementary Materials [see Additional files <supplr sid="S1">1</supplr>, <supplr sid="S2">2</supplr>, <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr>, <supplr sid="S5">5</supplr>, <supplr sid="S6">6</supplr>, <supplr sid="S7">7</supplr>, <supplr sid="S8">8</supplr>, <supplr sid="S9">9</supplr>, <supplr sid="S10">10</supplr>, <supplr sid="S11">11</supplr>, <supplr sid="S12">12</supplr>]), but it is still not a lot.</p>
         <p>We would also like to explore the correlation between the genomic positions of words in MSWL (most abundant words), and positions of known regulatory elements. This may allow us to utilise our method as a kind of motif discovery algorithm. Unfortunately, again, the lack of reliably annotated regulatory regions with regulatory elements makes this step difficult.</p>
         <p>Phylogenetic foot-printing is an important and rapidly developing branch of motif discovery methodology. It would be very interesting to compare genomic positions of words in MSWL with conserved sequences from phylogenetic foot-printing analyses. This would reveal whether such words are conserved, and therefore of functional significance.</p>
         <p>In a similar vein, we would like to compare the results of fluffiness analysis results across multiple species. We could then answer the question whether cross-species conserved regions have "fluffy" regulatory region properties, and thus infer their putative function.</p>
         <p>We are keen to compare results of our fluffy-tail-analysis with the results of recognition methods based on description of known TFBS, such as in the works <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. These authors <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> also analysed developmental genes of <it>Drosophila melanogaster </it>containing approximately the same clusters of transcription factors.</p>
         <p>The work <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> is closely related to our study. However, it is likely that their method is unable to distinguish non-perfect simple tandem repeat sequences from truly regulatory DNA. We have implemented their method as far as we can understand it, and found out that their separation of positive (cis-regulatory modules) and negative (coding and non-coding non-regulatory DNA) training sets due to local words frequency seems to be less clear than our separation due to "fluffiness" coefficient F (see Figure <figr fid="F6">6</figr>).</p>
         <p>There might be possible other regulatory mechanisms apart from TFBS binding. It may be in some specific cases that the 3D local structure of DNA in the nucleus (chromatin) is the principal factor of gene expression and modulating regulatory modules play little or no role <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Thus one of the next steps in our work will be the incorporation of nucleosome position information.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We present a novel statistical approach that allows regulatory DNA to be distinguished from coding and non-coding non-regulatory regions according to its "fluffiness" values. This method is based on the presence of unusually high number of short runs of over-represented scattered words in the given DNA sequence.</p>
         <p>The performance of the method on experimentally verified sequence data shows that the method allows us to predict whether a sequence may be regulatory.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Description of fluffy tail test</p>
            </st>
            <p>The fluffy tail test essentially consists of the comparison of similar word distributions for the original sequence and for a number of shuffled versions of the original sequences. These shuffled sequences clearly have the same length and single nucleotide composition as the original one.</p>
            <p>To construct a similar words distribution one can perform the following two steps:</p>
            <p>(1). First, obtain the distribution of similar words for a given DNA stretch (as described in detail below under "Distribution of similar words"). Then randomise the original sequence many times, and obtain a distribution of similar words for each shuffled sequence. These randomised sequences represent the null model (or the background model). The distributions of similar words obtained for the randomised sequences are compared with the corresponding distribution for the original sequence. If there are no statistical differences, we conclude that the sequence probably is an exon (related results are in <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>) or a homogeneous non-coding non-regulatory region.</p>
            <p>However, if the given sequence does contain many similar words, these will show up in its distribution as a longer right tail that may even have a second mode. Such "fluffy" tails are seldom found in the distributions of the shuffled sequences, therefore suggesting the sequence is not exonic or homogeneous non-coding, non-regulatory DNA.</p>
            <p>(2). To rule out "fluffy" tails due to non perfect simple tandem repeats, we check whether a) the similar words are spatially clustered and b) if the tails disappear after repeat-masking the sequence (using the on-line tool available at <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>) then repeating procedure (1).</p>
         </sec>
         <sec>
            <st>
               <p>Distribution of similar words</p>
            </st>
            <p>We considered 5-mer words, allowing for 1 mismatch. However, our results also hold for words of length 4 through 12, allowing for 1 through 4 mismatches (see Supplementary materials [see Additional files <supplr sid="S1">1</supplr>, <supplr sid="S2">2</supplr>, <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr>, <supplr sid="S5">5</supplr>, <supplr sid="S6">6</supplr>, <supplr sid="S7">7</supplr>, <supplr sid="S8">8</supplr>, <supplr sid="S9">9</supplr>, <supplr sid="S10">10</supplr>, <supplr sid="S11">11</supplr>, <supplr sid="S12">12</supplr>]). Thus, for each 5-mer word in each of the 180 sequences (60 sequences in each training set) we computed the number n of similar words of the same length. Each word is the "seed" for a list of similar words.</p>
            <p>As an example, consider a stretch of DNA :</p>
            <p><ul>accgg</ul>gtgtaa<ul>accgacctg</ul>at<ul>acccg</ul>gtcg<ul>cccgg</ul>ttttaac...</p>
            <p>The first "seed" 5-word 'accgg' forms the following list of similar words:</p>
            <p>accgg, accga, acctg, acccg, cccgg,</p>
            <p>which we have underscored in the above sequence.</p>
            <p>The second 5-word 'ccggg' forms another list of similar words:</p>
            <p>ccggg, ccggt, ccggt</p>
            <p>etc. The first 5-word has the longest list of similar words here. The lists may intersect: e.g. the list for the 'accga'- seed word contains some words from the 'accgg'-seed word list.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>WRG contributed to development of methodology, KW did numerical comparison with other related methods, RtB statistically processed the data, IA contributed to development of methodology, collected the data and wrote the software. All authors read and approved the final manuscript</p>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>Separation of regulatory DNA</p>
            </caption>
            <text>
               <p><b>Separation of regulatory DNA</b>. Separation of regulatory DNA (column 2) from coding (column 1) and non-coding, non-regulatory (column 3) due to the fluffiness coefficient F (Y-axis). Box-plot of the Fluffiness (Y-axis) index for the three functional regions.</p>
            </text>
            <graphic file="1471-2105-6-109-6"/>
         </fig>
         <fig id="F10">
            <title>
               <p>Figure 10</p>
            </title>
            <caption>
               <p>Coefficient of variation in spatial cluster size for four types of DNA</p>
            </caption>
            <text>
               <p><b>Coefficient of variation in spatial cluster size for four types of DNA: </b>exons (1), non-fluffy NCNR (2), fluffy NCNR (3), regulatory regions (4); Vertical bars denote 95% confidence intervals. The Y axis shows coefficient of variation, the X axis is for four DNA type. We calculated CV based on spatial clustering coefficient k = 1.</p>
            </text>
            <graphic file="1471-2105-6-109-10"/>
         </fig>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We would like to acknowledge Yvonne Edwards, Tanya Vavouri, Adam Woolfe, Krys Kelly, Gayle McEwen, Greg Elgar, Carlo Berzuini, Tom Nye, Lorenz Wernisch, and Kenneth Evans for valuable discussions and support.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Genomic cis-regulatory logic: functional analysis and computational model of a sea urchin gene control system</p>
            </title>
            <aug>
               <au>
                  <snm>Yuh</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bolouri</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Davidson</snm>
                  <fnm>EH</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>279</volume>
            <fpage>1896</fpage>
            <lpage>902</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.279.5358.1896</pubid>
                  <pubid idtype="pmpid" link="fulltext">9506933</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Cis-regulatory logic in the <it>endo </it>16 gene: switching from a specification to a differentiation mode of control</p>
            </title>
            <aug>
               <au>
                  <snm>Yuh</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bolouri</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Davidson</snm>
                  <fnm>EH</fnm>
               </au>
            </aug>
            <source>Development</source>
            <pubdate>2001</pubdate>
            <volume>128</volume>
            <fpage>617</fpage>
            <lpage>29</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11171388</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Genomic Regulatory Systems</p>
            </title>
            <aug>
               <au>
                  <snm>Davidson</snm>
                  <fnm>EH</fnm>
               </au>
            </aug>
            <publisher>Academic Press</publisher>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Exploiting TFBS clustering to identify CRM involved in pattern formation in <it>Drosophila </it>genome</p>
            </title>
            <aug>
               <au>
                  <snm>Berman</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Nibu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Pfeiffer</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tomancak</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Celniker</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Levine</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>PNAS</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <issue>2</issue>
            <fpage>757</fpage>
            <lpage>62</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">117378</pubid>
                  <pubid idtype="pmpid" link="fulltext">11805330</pubid>
                  <pubid idtype="doi">10.1073/pnas.231608898</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>A computational genomics approach to the identification of gene networks</p>
            </title>
            <aug>
               <au>
                  <snm>Wagner</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <issue>1</issue>
            <fpage>3594</fpage>
            <lpage>604</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146952</pubid>
                  <pubid idtype="pmpid" link="fulltext">9278479</pubid>
                  <pubid idtype="doi">10.1093/nar/25.18.3594</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo</p>
            </title>
            <aug>
               <au>
                  <snm>Markstein</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Markstein</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Markstein</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Levine</snm>
                  <fnm>MS</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <issue>2</issue>
            <fpage>763</fpage>
            <lpage>68</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">117379</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752406</pubid>
                  <pubid idtype="doi">10.1073/pnas.012591199</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Identification of functional lists of transcription factor binding motifs in genome sequences: the MSCAN algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Johansson</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Alkema</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Wasserman</snm>
                  <fnm>WW</fnm>
               </au>
               <au>
                  <snm>Lagergren</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>Suppl 1</issue>
            <fpage>I169</fpage>
            <lpage>I176</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg1021</pubid>
                  <pubid idtype="pmpid" link="fulltext">12855453</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Homotypic regulatory lists in Drosophila</p>
            </title>
            <aug>
               <au>
                  <snm>Lifanov</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Makeev</snm>
                  <fnm>VJ</fnm>
               </au>
               <au>
                  <snm>Nazina</snm>
                  <fnm>AG</fnm>
               </au>
               <au>
                  <snm>Papatsenko</snm>
                  <fnm>DA</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <issue>4</issue>
            <fpage>579</fpage>
            <lpage>88</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">430164</pubid>
                  <pubid idtype="pmpid" link="fulltext">12670999</pubid>
                  <pubid idtype="doi">10.1101/gr.668403</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo</p>
            </title>
            <aug>
               <au>
                  <snm>Rajewsky</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Vergassola</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gaul</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Siggia</snm>
                  <fnm>ED</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <issue>1</issue>
            <fpage>30</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">139975</pubid>
                  <pubid idtype="pmpid" link="fulltext">12398796</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-3-30</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Searching for regulatory elements in human non coding sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Duret</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Curr Opin Struct Biol</source>
            <pubdate>1997</pubdate>
            <volume>7</volume>
            <fpage>399</fpage>
            <lpage>406</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0959-440X(97)80058-9</pubid>
                  <pubid idtype="pmpid">9204283</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Algorithms for phylogenetic footprinting</p>
            </title>
            <aug>
               <au>
                  <snm>Blanchette</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Schwikowski</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tompa</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J Comput Bio</source>
            <pubdate>2002</pubdate>
            <volume>2</volume>
            <fpage>11</fpage>
            <lpage>23</lpage>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Strategies and tools for whole-genome alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Couronne</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Poliakov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bray</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Ishkhanov</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ryaboy</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Pachter</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Dubchak</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>73</fpage>
            <lpage>80</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">430965</pubid>
                  <pubid idtype="pmpid" link="fulltext">12529308</pubid>
                  <pubid idtype="doi">10.1101/gr.762503</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Phylogenetic shadowing of primate sequences to find functional regions of the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Boffelli</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>McAuliffe</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ovcharenko</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lewis</snm>
                  <fnm>KD</fnm>
               </au>
               <au>
                  <snm>Ovcharenko</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Pachter</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>EM</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2002</pubdate>
            <volume>299</volume>
            <fpage>1391</fpage>
            <lpage>4</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1126/science.1081331</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Distinguishing regulatory DNA from neutral sites</p>
            </title>
            <aug>
               <au>
                  <snm>Elnitski</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Hardison</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kolbe</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Eswara</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Connor</snm>
                  <fnm>OMJ</fnm>
               </au>
               <au>
                  <snm>Schwartz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Chiaromonte</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>64</fpage>
            <lpage>72</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">430974</pubid>
                  <pubid idtype="pmpid" link="fulltext">12529307</pubid>
                  <pubid idtype="doi">10.1101/gr.817703</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Interpolated Markov chains for eukaryotic promoter recognition</p>
            </title>
            <aug>
               <au>
                  <snm>Ohler</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Harbeck</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Niemann</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Noth</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Reese</snm>
                  <fnm>MG</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>362</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.5.362</pubid>
                  <pubid idtype="pmpid" link="fulltext">10366656</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Promoter prediction on a genomic scale-the Adh experience</p>
            </title>
            <aug>
               <au>
                  <snm>Ohler</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2000</pubdate>
            <volume>10</volume>
            <fpage>539</fpage>
            <lpage>42</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">310866</pubid>
                  <pubid idtype="pmpid" link="fulltext">10779494</pubid>
                  <pubid idtype="doi">10.1101/gr.10.4.539</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Joint modelling of DNA sequence and physical properties to improve eukaryotic promoter recognition</p>
            </title>
            <aug>
               <au>
                  <snm>Ohler</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Niemann</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Liao</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>GM</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <fpage>S199</fpage>
            <lpage>206</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11473010</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency</p>
            </title>
            <aug>
               <au>
                  <snm>Nazina</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Papatsenko</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>65</fpage>
            <lpage>78</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">341902</pubid>
                  <pubid idtype="pmpid" link="fulltext">14690551</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-4-65</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>RepeatMasker</p>
            </title>
            <url>http://www.repeatmasker.org/</url>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Ensembl Genome Browser</p>
            </title>
            <url>http://www.ensembl.org/</url>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes</p>
            </title>
            <aug>
               <au>
                  <snm>Audit</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Vaillant</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Arneodo</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>d'Aubenton-Carafa</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Thermes</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2002</pubdate>
            <volume>316</volume>
            <fpage>903</fpage>
            <lpage>18</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2001.5363</pubid>
                  <pubid idtype="pmpid" link="fulltext">11884131</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Complexity: an internet resource for analysis of DNA sequence complexity</p>
            </title>
            <aug>
               <au>
                  <snm>Orlov</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Potapov</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>W628</fpage>
            <lpage>W633</lpage>
            <note>on-line.</note>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">441604</pubid>
                  <pubid idtype="pmpid" link="fulltext">15215465</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
