<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-541</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>How repetitive are genomes?</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Haubold</snm>
               <fnm>Bernhard</fnm>
               <insr iid="I1"/>
               <email>bernhard.haubold@fh-weihenstephan.de</email>
            </au>
            <au id="A2">
               <snm>Wiehe</snm>
               <fnm>Thomas</fnm>
               <insr iid="I2"/>
               <email>twiehe@uni-koeln.de</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Biotechnology &amp; Bioinformatics, University of Applied Sciences Weihenstephan, Freising, Germany</p>
            </ins>
            <ins id="I2">
               <p>Institute of Genetics, Universit&#228;t zu K&#246;ln, Cologne, Germany</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>541</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/541</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17187668</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-541</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>26</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>22</day>
               <month>12</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>22</day>
               <month>12</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Haubold and Wiehe; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Genome sequences vary strongly in their repetitiveness and the causes for this are still debated. Here we propose a novel measure of genome repetitiveness, the index of repetitiveness, <it>I</it><sub>r</sub>, which can be computed in time proportional to the length of the sequences analyzed. We apply it to 336 genomes from all three domains of life.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The expected value of <it>I</it><sub>r </sub>is zero for random sequences of any G/C content and greater than zero for sequences with excess repeats. We find that the <it>I</it><sub>r </sub>of archaea is significantly smaller than that of eubacteria, which in turn is smaller than that of eukaryotes. Mouse chromosomes have a significantly higher <it>I</it><sub>r </sub>than human chromosomes and within each genome the Y chromosome is most repetitive. A sliding window analysis reveals that the human <it>HOXA </it>cluster and two surrounding genes are characterized by local minima in <it>I</it><sub>r</sub>. A program for calculating the <it>I</it><sub>r </sub>is freely available at <url>http://adenine.biz.fh-weihenstephan.de/ir/</url>.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>The general measure of DNA repetitiveness proposed in this paper can be efficiently computed on a genomic scale. This reveals a broad spectrum of repetitiveness among diverse genomes which agrees qualitatively with previous studies of repeat content. A sliding window analysis helps to analyze the intragenomic distribution of repeats.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Repeat sequences are a common feature of prokaryote and eukaryote genomes <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp> and in both types of organisms the selective neutrality or otherwise of extra copies of sequences has been debated for decades <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Since the start of the genomics era in the mid-1990s the hitherto unexpectedly large amount of repetitive sequences found in bacteria, which may account for more than 10% of the total genome, prompted a flurry of investigations of the functional and evolutionary significance of these elements <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. More recently, Aras <it>et al</it>. surveyed 51 bacterial genomes to quantify the effect repeat sequences might have on genome plasticity due to intragenomic recombination <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The authors conclude that in bacteria repeats might be selected for their positive effect on the adaptability of their host <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. In another <it>in silico </it>survey of 58 completely sequenced bacteria, Achaz <it>et al</it>. noted that inverted repeats are underrepresented in bacterial genomes due to their destabilizing effect on genome structure <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
         <p>In eukaryotes the discrepancy between DNA content and apparent organismic complexity had been noted even before the discovery of the double helix leading to the conclusion that "The relationship between DNA and the size or number of genes is obscure" [<abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, p. 462]. In the 1960s DNA reannealing studies uncovered that eukaryotic genomes contain a highly variable fraction of repetitive DNA. Since the sequencing of complex genomes these observations have been made precise: approximately 50% of the human genome is made up of repetitive sequences <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. However, the term "repetitive sequences" encompasses a rather heterogeneous set of elements: 45% of the human genome is covered by transposons, 3% are repeats of less than a hundred base pairs (microsatellites and minisatellites), and 5% consist of recent duplications of large segments of DNA. Broadly similar observations have been made in other mammalian genomes <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. The human genome contains low, but appreciable, genetic variation caused by transposable elements, indicating that transposable elements have been active over the short time span since humans diverged from their last common ancestor <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. However, the decline of transposon activity in the hominoid lineage contrasts with more recent insertions in mouse, where new spontaneous mutations are 60 times more likely to be caused by transposition than in human <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
         <p>The hypothesis that transposable elements are molecular parasites was originally designed to explain the apparently excessive DNA baggage of eukaryotes <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>. A number of contemporary observations support this view. Transposon-derived sequences are rare close to transcription start sites and inside coding regions, suggesting that insertions are usually deleterious <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. Moreover, the four human <it>HOX </it>clusters and other highly regulated genomic regions contain very few transposable elements <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Direct deletion of megabase-sized regions devoid of known genes also seems to have no effect on mice, even though these regions contain elements that have been conserved since the emergence of mammals <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. There is no contradiction between these observations and the fact that occasionally transposable elements can give rise to beneficial structures including novel gene regulatory regions <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> and the V(D)J recombination mechanism that generates the antibody diversity expressed by vertebrate B cells <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>.</p>
         <p>Since the publication of whole genome data, the quantification and classification of repeat elements has become a major area of research in computational biology <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. Perhaps the best-known program for the detection of repeat elements is repeatmasker <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, which looks for two things: (1) tandem repeats of a few nucleotides, and (2) homology to known repetitive elements. This approach has the advantage of dealing with elements of known origin. Its disadvantage is that the presence of hitherto unknown repetitive elements might be missed. The program repeatfinder implements a highly efficient and more generic approach based on suffix trees that makes no assumptions about the type of repeat present <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. Such methods can be used to compute, for example, the percentage of a given DNA sequence covered by repeats and most methods provide a means of checking the statistical significance of the repeats returned. Suffix trees allow the efficient detection of all exact repeats in a sequence. In contrast, the widely used relative simplicity factor (RSF) <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> is based on the local density of repeat motifs up to four bases long compared to their density in a shuffled version of the input sequence <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Application of the RSF to diverse genomes revealed that eukaryotes are characterized by an elevated "micro-repetitiveness" compared to prokaryotes <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>.</p>
         <p>What is lacking, though, is an all-inclusive measure of repetitiveness. Under the RSF repetitiveness is defined as a quantity that is minimized by shuffling the investigated sequence. As suggested by the term <it>simplicity </it>factor, studies of repetitiveness are related to investigations of complexity <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> &#8211; if repetitiveness is high, complexity is low, though the converse is not always true. For example, the "linguistic complexity" of a string <it>S </it>is defined as the number of substrings of lengths 2, 3, ..., |<it>S</it>| observed in <it>S </it>compared to the maximum number of substrings of these lengths <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. A random DNA sequence with G/C content 0.5 has maximal complexity and minimal repetitiveness. However, a random DNA sequence with a G/C content of, say, 0.1 does not have maximal complexity, while its repetitiveness should still be minimal.</p>
         <p>In this paper we propose a novel measure of repetitiveness which considers repeats of any length, takes into account G/C content, and does not necessitate shuffling for its computation. As explained in detail in the Methods Section, our index of repetitiveness, <it>I</it><sub>r</sub>, is expected to be zero in random DNA sequences of any G/C content and length, and can be computed in time proportional to sequence length. We apply the <it>I</it><sub>r </sub>to 303 sequenced bacterial genomes, 27 archaebacteria, and six model eukaryotes: baker's yeast (<it>Saccharomyces cerevisiae</it>), nematode worm (<it>Caenorhabditis elegans</it>), thale cress (<it>Arabidopsis thaliana</it>), fruit fly (<it>Drosophila melanogaster</it>), mouse (<it>Mus musculus</it>), and human (<it>Homo sapiens</it>).</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>Our first goal was to establish the null distribution of <it>I</it><sub>r</sub>. This can be obtained by shuffling a genomic sequence. As an example we repeatedly randomized the genome of bacteriophage &#955;, which consists of 48,502 bp of DNA, and calculated the <it>I</it><sub>r </sub>from these "repeatless" sequences. Figure <figr fid="F1">1</figr> shows the resulting histogram, which is symmetrically distributed around a mean close to the expected zero (mean = 0.0004, sd = 0.0008). Further analysis of this distribution using the Shapiro-Wilk test <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> revealed that deviation from normality increased as more replicates were added (not shown). The reason for this was an increase in kurtosis (2.972 in Figure <figr fid="F1">1</figr>), while the skewness (0.078 in Figure <figr fid="F1">1</figr>) decreased with higher replication. Notice also that the <it>I</it><sub>r </sub>of the unshuffled &#955; genome is significantly greater than its randomized version. This is not surprising, as biological sequences are no more random sequences of residues than prose is a random sequence of letters.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>The null distribution of <it>I</it><sub>r</sub></p>
            </caption>
            <text>
               <p><b>The null distribution of <it>I</it><sub>r</sub></b>. The genome of bacteriophage &#955; was shuffled 1000 times and the <it>I</it><sub>r </sub>computed; mean = 0.0004, sd = 0.0008.</p>
            </text>
            <graphic file="1471-2105-7-541-1"/>
         </fig>
         <sec>
            <st>
               <p>Survey of <it>I</it><sub>r </sub>values</p>
            </st>
            <p>We calculated <it>I</it><sub>r </sub>values for 330 completely sequenced prokaryote genomes, as well as for representative eukaryotic model organisms: baker's yeast (<it>Saccharomyces cerevisiae</it>; 12 Mb surveyed), nematode worm (<it>Caenorhabditis elegans</it>; 100 Mb surveyed), thale cress (<it>Arabidopsis thaliana</it>; 119 Mb surveyed), and fruit fly (<it>Drosophila melanogaster</it>; 123 Mb surveyed). Figure <figr fid="F2">2A</figr> displays the <it>I</it><sub>r </sub>values of eubacteria as a function of the log genome size [see <supplr sid="S1">Additional file 1</supplr> for a complete listing of prokaryote results]. In this domain of life <it>I</it><sub>r </sub>was not correlated with log genome size (Pearson correlation = 0.046; <it>P </it>= 0.425). The average <it>I</it><sub>r </sub>of eubacteria was 1.048. 94.7% of bacteria had an <it>I</it><sub>r </sub>&#8804; 2. On the other hand, there were 7 bacteria where <it>I</it><sub>r </sub>> 3, with the highest value found in <it>Methylobacillus flagellatus </it>KT (6.337; Figure <figr fid="F2">2A</figr>). The other members of this group were <it>Streptococcus agalactiae </it>NEM316 (<it>I</it><sub>r </sub>= 4.842), <it>Dehalococcoides ethenogenes </it>195 (4.026), <it>Francisella tularensis </it>subsp. tularensis SCHU S4 (3.950), <it>Neisseria meningitidis </it>MC58 (3.842), <it>Francisella tularensis </it>subsp. holarctica (3.723), and <it>Escherichia coli </it>O157:H7 EDL933 (3.521; Figure <figr fid="F2">2A</figr>).</p>
            <suppl id="S1">
               <title>
                  <p>Additional File 1</p>
               </title>
               <text>
                  <p>Supplementary Material. <it>I</it><sub>r </sub>values for 330 completely sequenced prokaryote genomes sorted by <it>I</it><sub>r </sub>or organism.</p>
               </text>
               <file name="1471-2105-7-541-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p><it>I</it><sub>r </sub>values of 334 completely sequenced genomes taken from the three domains of life</p>
               </caption>
               <text>
                  <p><b><it>I</it><sub>r </sub>values of 334 completely sequenced genomes taken from the three domains of life</b>. <it>I</it><sub>r </sub>values shown as a function of their log genome size; dashed lines delineate organisms with <it>I</it><sub>r </sub>> 3. <b>A</b>: Eubacteria, circled values correspond to the genomes subjected to sliding window analysis in Figure 3; Mf: <it>Methylobacillus flagellatus </it>KT; Sa: <it>Streptococcus agalactiae </it>NEM316; De: <it>Dehalococcoides ethenogenes </it>195; Ftt: <it>Francisella tularensis </it>subsp. tularensis SCHU S4; Nm: <it>Neisseria meningitidis </it>MC58; Fth: <it>Francisella tularensis </it>subsp. holarctica; Ec: <it>Escherichia coli </it>O157:H7 EDL933; Oy: Onion yellows phytoplasma OY-M. <b>B</b>: archaebacteria and eukaryotes; Sc: <it>Saccaromyces cerevisiae; </it>Ce: <it>Caenorhabditis elegans</it>; At: <it>Arabidopsis thaliana</it>; Dm: <it>Drosophila melanogaster</it>.</p>
               </text>
               <graphic file="1471-2105-7-541-2"/>
            </fig>
            <p>At the other extreme of the distribution, <it>Buchnera aphidicola </it>str. Bp had the smallest <it>I</it><sub>r </sub>value (0.019), which was even smaller than that observed in phage &#955; (<it>I</it><sub>r </sub>= 0.024; Figure <figr fid="F1">1</figr>). With one exception the ten eubacteria with the lowest <it>I</it><sub>r </sub>values comprised only intracellular organisms sampled form the genera <it>Buchnera, Chlamydophila, Candidatus, Neorickettsia</it>, and <it>Rickettsia</it>. The exception was the highly abundant photosynthetic bacterium <it>Prochlorococcus marinus </it>subsp. marinus str. CCMP1375 [see <supplr sid="S1">Additional file 1</supplr>].</p>
            <p>Figure <figr fid="F2">2B</figr> displays the <it>I</it><sub>r </sub>values of archaebacteria and eukaryotes. In archaebacteria <it>I</it><sub>r </sub>was significantly correlated with log genome size (Pearson correlation = 0.562; <it>P </it>= 0.002), while in eukaryotes the correlation was not significant (Pearson correlation = 0.485; <it>P </it>= 0.515). The average <it>I</it><sub>r </sub>of archaebacteria was 0.467, which is significantly smaller than that of eubacteria (Wilcoxon test, <it>P </it>= 3.15 &#215; 10<sup>-6</sup>). The average <it>I</it><sub>r </sub>of eukaryotes was 2.103, which is in turn significantly greater than either that of eubacteria (<it>P </it>= 4.3 &#215; 10<sup>-3</sup>) or archaebacteria (<it>P </it>= 6.36 &#215; 10<sup>-5</sup>). Among eukaryotes only <it>Drosophila melanogaster </it>had an <it>I</it><sub>r </sub>> 3.</p>
            <p>In order to further investigate some of the extreme <it>I</it><sub>r </sub>values observed in eubacteria (Figure <figr fid="F2">2A</figr>), we subjected them to sliding window analyses. Figure <figr fid="F3">3A</figr> shows such an analysis for <it>M. flagellatus </it>KT and reveals that its global <it>I</it><sub>r </sub>value (Figure <figr fid="F2">2A</figr>, Mf) was caused by two large peaks of local <it>I</it><sub>r </sub>indicating the presence of a very long exact repeat (Figure <figr fid="F3">3A</figr>). This turned out to be a tandem repeat comprising an astonishing 143,034 bp. Removal of one copy of this duplication lead to a much deflated <it>I</it><sub>r </sub>of 0.657. However, not all large <it>I</it><sub>r </sub>values among eubacteria were caused by single exact repeats. Figure <figr fid="F3">3B</figr> displays a sliding window analysis of the genome of Onion yellows phytoplasma OY-M, which had a global <it>I</it><sub>r </sub>value of 2.348 (Figure <figr fid="F2">2A</figr>, Oy). A scan of its local <it>I</it><sub>r </sub>values indicated the presence of numerous regions of significant repetitiveness (Figure <figr fid="F3">3B</figr>).</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Sliding window analyses of two bacterial genomes</p>
               </caption>
               <text>
                  <p><b>Sliding window analyses of two bacterial genomes</b>. <b>A</b>: <it>Methylobacillus flagellatus </it>KT with tandem repeat comprising 143 kb (boxes); <b>B</b>: Onion yellows phytoplasma OY-M. The global <it>I</it><sub>r</sub>-values of these two bacteria are circled in Figure 2A.</p>
               </text>
               <graphic file="1471-2105-7-541-3"/>
            </fig>
            <p>The bacterium with the second highest global <it>I</it><sub>r</sub>-value, <it>Strepotococcus agalactiae </it>NEM316 (<it>I</it><sub>r </sub>= 4.842; Figure <figr fid="F2">2A</figr>) was an outlier among the other 14 streptococci investigated, which have an average <it>I</it><sub>r </sub>of 1.665 [see <supplr sid="S1">Additional file 1</supplr>]. Window analysis of <it>S. agalactiae </it>NEM316 revealed three exact repeats of 47 kb (not shown) and their removal resulted in an <it>I</it><sub>r </sub>of 1.756. Similarly, <it>Escherichia coli </it>OH157:H7 EDL933 had an exceptionally high <it>I</it><sub>r </sub>of 3.521 (Figure <figr fid="F2">2A</figr>) compared to the other five strains of <it>E. coli </it>sampled (average <it>I</it><sub>r</sub>: 1.049; cf. <supplr sid="S1">Additional file 1</supplr>). In this case window analysis of <it>E. coli </it>OH157:H7 EDL933 (not shown) highlighted a repeat region of approximately 100 kb located at positions 1,050,000&#8211;1,150,000 and 1,450,000&#8211;1,550,000, which contained several long exact repeats with the longest spanning over 41 kb. Removal of one copy of the 100 kb repeat region reduced the <it>I</it><sub>r </sub>to 1.756.</p>
         </sec>
         <sec>
            <st>
               <p>Mouse and human chromosomes</p>
            </st>
            <p>The average <it>I</it><sub>r </sub>for human chromosomes was 0.985 and values for individual chromosomes ranged from 0.229 in chromosome 21 to 4.313 in the Y chromosome (Figure <figr fid="F4">4A</figr>). The Y chromosome was the only human chromosome with <it>I</it><sub>r </sub>> 3, which agrees with the view that it has the highest DNA turnover in the genome <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p><it>I</it><sub>r </sub>values as a function of the number of nucleotides surveyed in human (A) and mouse (B) chromosomes</p>
               </caption>
               <text>
                  <p><b><it>I</it><sub>r </sub>values as a function of the number of nucleotides surveyed in human (A) and mouse (B) chromosomes</b>. Dashed line delineates chromosomes with <it>I</it><sub>r </sub>> 3.</p>
               </text>
               <graphic file="1471-2105-7-541-4"/>
            </fig>
            <p>The average <it>I</it><sub>r </sub>for mouse chromosomes was 1.773 (Figure <figr fid="F4">4B</figr>), which is significantly larger than that of humans (Wilcoxon test, <it>P </it>= 1.4 &#215; 10<sup>-3</sup>). This agrees with the observation that the rodent lineage has experienced a higher rate of retro-transposition than hominoids <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. Individual mouse chromosomes had <it>I</it><sub>r </sub>values ranging from 0.7 in chromosome 19 to 3.654 in the Y chromosome. As in the human genome, the Y chromosome from mouse was characterized by the largest <it>I</it><sub>r</sub>. In addition, chromosomes 7 and X had <it>I</it><sub>r </sub>values > 3 (Figure <figr fid="F2">2B</figr>).</p>
         </sec>
         <sec>
            <st>
               <p>HOX genes in human and D. melanogaster</p>
            </st>
            <p>The <it>HOX </it>genes encode transcription factors that function as fundamental developmental switches in all animals. In human the four clusters of <it>HOX </it>genes contain very few insertion sequences <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. To assess the effect of this on the landscape of human <it>I</it><sub>r </sub>values, we carried out a sliding window analysis of 1 Mb around the <it>HOXA </it>cluster on chromosome 7. Figure <figr fid="F5">5A</figr> displays the conspicuous footprint of low <it>I</it><sub>r </sub>values that coincides with the location of the <it>HOXA </it>cluster. In order to make this eye-ball analysis more quantitative, we searched the fragment of chromosome 7 displayed in Figure <figr fid="F5">5A</figr> for runs of <it>I</it><sub>r </sub>&#8804; 0 that extended for at least 2 kb. This uncovered 13 intervals ranging in size from 2.1 to 4.1 kb (arrows in Figure <figr fid="F5">5</figr>). Ten of these intervals were located within the <it>HOXA </it>cluster. The remaining three arrows are marked by stars in Figure <figr fid="F5">5</figr>. Two of the corresponding regions with low <it>I</it><sub>r </sub>values intersected with <it>SCAP2</it>, a <it>src </it>family associated phosphoprotein involved in signal transduction leading to T cell activation <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. The last region of low <it>I</it><sub>r </sub>outside of the <it>HOXA </it>region intersected with <it>EVX1</it>. This is a homologue of the even-skipped homeobox gene originally discovered in <it>D. melanogaster</it>. In vertebrates it is involved in eye development <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. Human <it>EVX1 </it>is located just 42.73 kb upstream from the most 5' of the <it>HOXA </it>genes, <it>HOXA13 </it>(Figure <figr fid="F5">5</figr>).</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Sliding window analysis of <it>HOX genes</it></p>
               </caption>
               <text>
                  <p><b>Sliding window analysis of <it>HOX genes</it></b>. <b>A</b>: 1 Mb of human chromosome 7 containing the <it>HOXA </it>cluster. Arrows indicate runs of <it>I</it><sub>r </sub>&#8804; 0 longer than 2 kb; starred arrows point to regions outside of the <it>HOXA </it>cluster, which consists of 13 individual genes. <b>B</b>: 1 Mb of chromosome 3R from <it>D. melanogaster </it>containing the antennapedia complex.</p>
               </text>
               <graphic file="1471-2105-7-541-5"/>
            </fig>
            <p>A sliding window analysis of the <it>antennapedia </it>complex in <it>D. melanogaster</it>, which is homologous to part of the human <it>HOXA </it>cluster, revealed a very different topology of repetitiveness (Figure <figr fid="F5">5B</figr>). On a background of <it>I</it><sub>r </sub>&#8776; 0, large peaks marked the presence of long exact repeats and the <it>antennapedia </it>cluster was not characterized by a conspicuous change in <it>I</it><sub>r </sub>values.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>"At this point we do not know what most of the DNA in eukaryotes is doing" [<abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, p. 253]. Today, thirty-five years later, the function of apparently excess DNA in both eukaryotes and prokaryotes remains a topic of intense research activity <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Our method to quantify this excess DNA, the index of repetitiveness, is close in spirit to the investigation of linguistic complexity based on suffix trees <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Linguistic complexity is maximized in random sequences with equiprobable residues. Deviations from equiprobability lead to a reduction in complexity even if the sequence remains completely random. In contrast, in this paper we were interested in quantifying repetitiveness with respect to genome composition and to make this measure comparable across genomes. Our starting point was an investigation of the complement of repeats, the unique sequences. These are trivially easy to find, for example a sequence is always unique with respect to itself, and for this reason we have concentrated on <it>shortest </it>unique substrings. A shortest unique substring occurs only once in its parent string and cannot be reduced in length without losing its uniqueness. A genome with many long repeats contains many excessively long shortest unique substrings, while its shuffled version contains only the shortest unique substrings expected to be there by chance alone (cf. Methods). Since we have derived the latter quantity analytically <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, the <it>I</it><sub>r </sub>is constructed as the logarithm of the ratio between the observed and expected aggregate number of nucleotides found in shortest unique substrings. At the cost of ignoring homology relationships, this measure has the advantage that it can be computed for any double-stranded DNA sequence and its expectation is always zero. It is also possible to estimate an <it>I</it><sub>r </sub>value for sequences over alphabets other than the four nucleotides. In this case the quantity <it>A</it><sub>e </sub>defined in Equation (2) can be estimated by shuffling the input sequence. For example, the <it>I</it><sub>r </sub>of this paper is approximately 0.7.</p>
         <p>Since the construction of the underlying suffix tree takes only time proportional to the length of the sequence analyzed, the <it>I</it><sub>r </sub>can be computed in time proportional to the length of the input sequence. In contrast, traditional repeat analysis such as implemented in the program repeatmasker <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> runs in time proportional to the product of the length of query and subject sequence.</p>
         <p>Like most suffix tree implementations, the suffix tree on which our analysis is based, is kept entirely in the main memory (RAM) of the computer <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. This has the advantage of being relatively easy to implement. The disadvantage of this approach is that the amount of sequence data that can be analyzed in a single run of the program is limited by the available RAM rather than by the much cheaper hard disk space. We are currently studying advances in disk-based suffix tree construction <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> in order to break through the RAM barrier.</p>
         <p>It may come as a surprise that the <it>I</it><sub>r </sub>values for human and mouse chromosomes were within the range of <it>I</it><sub>r </sub>values observed for less complex eubacterial genomes (Figure <figr fid="F2">2</figr>). However, this does not contradict the well-known fact that mammalian genomes are full of interspersed repeats, while bacteria usually contain fewer of these elements. The apparent paradox is due to the fact that the effect of interspersed repeats on the excess amount of exact repeats in a given genome &#8211; which is what the <it>I</it><sub>r </sub>measures &#8211; depends not only on the fraction of sequence covered by repetitive elements; equally important is the number of mutations accumulated since the divergence of an interspersed repeat from its most recent ancestor. As a result of the mutation process, ancient repetitive elements may not contain longer motifs repeated elsewhere than the rest of the genome. The presence of such elements would leave the <it>I</it><sub>r </sub>unchanged compared to the identical genome without them.</p>
         <p>A similar argument applies to the interpretation of the high <it>I</it><sub>r </sub>values found in the Y chromosomes of human and mouse. The two factors determining the accumulation of sequence polymorphisms, time to the most recent common ancestor and mutation rate, cannot be separated. In addition, the effective mutation rate differs between autosomes and the Y chromosome. Under neutrality the number of SNPs expected for a pair of homologous sequences is <it>&#952; </it>= 4<it>N</it><sub>e<it>&#956;</it></sub>, where <it>N</it><sub>e </sub>is the effective population size and <it>&#956; </it>the rate of mutation. Since the effective population size of mammalian Y chromosomes is only one quarter that of autosomes, repeat pairs on the Y chromosome are broken up more slowly by mutations than elsewhere in the genome contributing to higher <it>I</it><sub>r </sub>values.</p>
         <p>It should be noted at this point that neither the mouse nor the human genome are completely sequenced to date. If new sequence data comes predominantly from regions that are difficult to sequence due to their repetitiveness, future editions of the human and mouse genomes are expected to have higher <it>I</it><sub>r</sub>.</p>
         <p>The <it>I</it><sub>r </sub>values found in our whole genome analyses (Figure <figr fid="F2">2</figr>) correlate well with the relative simplicity factors (RSFs) reported previously <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> (Pearson correlation = 0.552, <it>P </it>= 3.3 &#215; 10<sup>-4</sup>). This correlation is not perfect due to the fact that the RSF measures the local excess of short repeats, while the <it>I</it><sub>r </sub>measures the excess of all repeats throughout the sequence. Moreover, no significant correlation between archaebacterial genome size and RSF was observed by Hancock <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, in contrast to our finding. This effect, however, is simply due to differences in sampling; if we reduce our sample of 27 archaebacterial genomes to the nine investigated by Hancock, the correlation between <it>I</it><sub>r </sub>and log genome size also vanishes. In contrast, a tenfold increase in the number of bacterial genomes investigated between Hancock's and our study only confirmed the earlier diagnosis of no correlation between RSF and genome size.</p>
         <p>The average <it>I</it><sub>r </sub>for eubacteria was 1.048. However, it is clear that there are a few extreme <it>I</it><sub>r </sub>values that inflate this average (Figure <figr fid="F2">2A</figr>). The largest <it>I</it><sub>r </sub>for bacteria (or for any other organism) was found in <it>Methylobacillus flagellatus </it>KT (6.337). This value was the most extreme of a set of seven organisms with <it>I</it><sub>r </sub>> 3 that also included the human pathogens <it>Neisseria meningitidis </it>MC58 and <it>Escherichia coli </it>O157:H7 EDL933 (Figure <figr fid="F2">2</figr>). In a previous survey of 58 bacteria, <it>Neisseria meningitidis </it>was already singled out as having a highly repetitive genome <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The low <it>I</it><sub>r </sub>values found by us among obligately host-associated bacteria also agree with a known lack of repeats in these genomes <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. While other bacteria appear to harbor repeats to increase genome plasticity <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, we speculate that intracellular symbionts and pathogens are less dependent on genome shuffling for their survival as they live in more stable environments. Our sliding window analyses revealed that the computation of <it>I</it><sub>r </sub>values for entire genomes averages out sharp regional fluctuations in <it>I</it><sub>r </sub>(Figures <figr fid="F3">3</figr> and <figr fid="F5">5</figr>). In bacteria a high <it>I</it><sub>r </sub>value may be caused by a few extreme duplications, as was the case for <it>M. flagellatus </it>KT (Figure <figr fid="F3">3A</figr>) and <it>S. agalactiae </it>NEM316. In the human genome the 13 genes making up the <it>HOXA </it>cluster were characterized by a 100 kb footprint of low <it>I</it><sub>r </sub>values (Figure <figr fid="F5">5A</figr>). The fact that additional runs of low <it>I</it><sub>r </sub>outside the <it>HOXA </it>cluster also coincided with known genes leads us to currently search the entire human genome for further regions of low <it>I</it><sub>r</sub>.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Investigations of repetitiveness are traditionally carried out using some form of alignment algorithm. Such algorithms tend to run in time proportional to the product of the length of the query and subject sequence. In this paper we present an approach that runs in time linear in the length of the input sequence. It is based on a comparison between the observed and expected sums of the lengths of shortest unique substrings. We apply the resulting index of repetitiveness, <it>I</it><sub>r</sub>, to prokaryote and eukaryote genomes. Our global repetitiveness measures agree qualitatively with current knowledge about genome structure. However, a more detailed picture emerges by subjecting the genomes to window analyses. In the human genome the highly regulated <it>HOXA </it>cluster is known to lack insertion sequences. Accordingly, it is characterized by a footprint of low <it>I</it><sub>r</sub>. This suggests that in mammalian genomes regions of low <it>I</it><sub>r </sub>may be due to strong selection against mutagenesis by insertion sequences. If this is the case, scanning mammalian genomes for further intervals of low <it>I</it><sub>r </sub>may reveal tracts under strong purifying selection.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Measuring repetitiveness</p>
            </st>
            <p>In the following we derive a generic measure of repetitiveness in DNA sequences, the index of repetitiveness, <it>I</it><sub>r</sub>. Consider a genome, <it>S</it>, consisting on its forward and reverse strands of 2<it>l </it>nucleotides. At each position <it>i </it>along this genome we can determine the length of the shortest unique substring starting at that position, <it>x</it><sub><it>i</it></sub>. Such a shortest unique substring has the property that the substring <it>S </it>[<it>i</it>..<it>i </it>+ <it>x</it><sub><it>i </it></sub>- 1] is unique, while <it>S </it>[<it>i</it>..<it>i </it>+ <it>x</it><sub><it>i </it></sub>- 2] is not. Figure <figr fid="F6">6</figr> shows the example sequence <it>S </it>= CGGT and the lengths of all the corresponding shortest unique substrings. Notice that no shortest unique substrings start at the two most 3' positions of the reverse strand. In that case we assign suffix length plus one as the shortest unique substring length (bold numbers in Figure <figr fid="F6">6</figr>). In other words, we pretend that each string is terminated by a unique "sentinel" character.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Shortest unique substring lengths for the DNA sequence CGGT and its complement</p>
               </caption>
               <text>
                  <p><b>Shortest unique substring lengths for the DNA sequence CGGT and its complement</b>. Starting from, say, the first nucleotide, three steps in the 3' direction are necessary to generate a unique substring. The numbers in bold correspond to suffix length plus one; see text for details.</p>
               </text>
               <graphic file="1471-2105-7-541-6"/>
            </fig>
            <p>We have used suffix trees <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> to detect shortest unique substrings in genomic sequences. Figure <figr fid="F7">7</figr> shows the suffix tree that corresponds to our example sequence. This tree is read as follows: the concatenated labels along a path leading from the root at the top to a leaf yield a suffix of the input string starting at the position indicated by the label of the leaf. Suffix trees have the useful property that any string starting at the root and ending somewhere on an internal branch is a repeated substring. For example, substring CG occurs at position 1 in <it>T</it><sub>1 </sub>and at position 3 in <it>T</it><sub>2 </sub>(Figure <figr fid="F7">7</figr>). Conversely, a string starting at the root and ending anywhere on an external branch, e.g. CGG, is a unique substring (cf. bold edge labels in Figure <figr fid="F7">7</figr>). Given a suffix tree, it is therefore easy to locate the <it>shortest </it>unique substrings starting at any position <it>i </it>in the genome by looking up the length of the path label from the root to the parent of the leaf referring to position <it>i</it>. This length is known as the <it>string depth </it>of a node, <it>s</it>. The desired length of the shortest unique substring starting at <it>i </it>is then simply <it>x</it><sub><it>i </it></sub>= <it>s </it>+ 1.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Suffix tree corresponding to the forward and reverse strands of the example sequence CGGT (cf. Figure 6)</p>
               </caption>
               <text>
                  <p><b>Suffix tree corresponding to the forward and reverse strands of the example sequence CGGT (cf. Figure 6)</b>. Leaf labels consist of a string identifier, followed by the starting position of the suffix read from the root to the leaf. For example, the suffix GGT$ starts at position 2 in string #1. Any string starting at the root of the tree and ending on a terminal branch, e.g. the substring CGG shown in bold, is unique. CGG is also <it>shortest </it>unique because it extends only for one character on the external branch.</p>
               </text>
               <graphic file="1471-2105-7-541-7"/>
            </fig>
            <p>Figure <figr fid="F8">8A</figr> shows the value of <it>x</it><sub><it>i </it></sub>along 2 kb of genomic sequence from the human pathogen <it>Mycoplasma genitalium</it>. The spikes in this curve correspond to unusually long shortest unique substrings, which are caused by repeats that are longer than expected by chance alone. We define the observed aggregate length of shortest unique substrings as</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Lengths of shortest unique substrings (shulen) along 2 kb of the genome of human pathogen <it>Mycoplasma genitalium</it></p>
               </caption>
               <text>
                  <p><b>Lengths of shortest unique substrings (shulen) along 2 kb of the genome of human pathogen <it>Mycoplasma genitalium</it></b>. <b>A</b>: Original sequence; <b>B</b>: shuffled sequence.</p>
               </text>
               <graphic file="1471-2105-7-541-8"/>
            </fig>
            <p>
               <m:math name="1471-2105-7-541-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:msub>
                           <m:mi>A</m:mi>
                           <m:mtext>o</m:mtext>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munderover>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                              <m:mrow>
                                 <m:mn>2</m:mn>
                                 <m:mi>l</m:mi>
                              </m:mrow>
                           </m:munderover>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>x</m:mi>
                                 <m:mi>i</m:mi>
                              </m:msub>
                           </m:mrow>
                        </m:mstyle>
                        <m:mo>.</m:mo>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqdaWgaaWcbaGaee4Ba8gabeaakiabg2da9maaqahabaGaemiEaG3aaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKMaeyypa0JaeGymaedabaGaeGOmaiJaemiBaWganiabggHiLdGccqGGUaGlcaWLjaGaaCzcamaabmaabaGaeGymaedacaGLOaGaayzkaaaaaa@3FDE@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>The quantity <it>A</it><sub>o </sub>corresponds to the area under the curve shown in Figure <figr fid="F8">8A</figr>.</p>
            <p>We have previously derived an exact expression for the number of shortest unique substrings of length <it>x </it>expected in a completely shuffled genome of a given length and G/C content, <it>N</it><sub><it>x </it></sub><abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. It is therefore convenient to define the expected aggregate length of shortest unique substrings as</p>
            <p>
               <m:math name="1471-2105-7-541-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:msub>
                           <m:mi>A</m:mi>
                           <m:mtext>e</m:mtext>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munder>
                              <m:mo>&#8721;</m:mo>
                              <m:mi>x</m:mi>
                           </m:munder>
                           <m:mrow>
                              <m:mi>x</m:mi>
                              <m:msub>
                                 <m:mi>N</m:mi>
                                 <m:mi>x</m:mi>
                              </m:msub>
                           </m:mrow>
                        </m:mstyle>
                        <m:mo>.</m:mo>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>2</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqdaWgaaWcbaGaeeyzaugabeaakiabg2da9maaqafabaGaemiEaGNaemOta40aaSbaaSqaaiabdIha4bqabaaabaGaemiEaGhabeqdcqGHris5aOGaeiOla4IaaCzcaiaaxMaadaqadaqaaiabikdaYaGaayjkaiaawMcaaaaa@3CC5@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>Figure <figr fid="F8">8B</figr> shows the length of shortest unique substrings at each position along a shuffled version of the 2 kb fragment from the genome of <it>M. genitalium</it>. Notice that all the spikes indicating long repeats contained in the original sequence data (Figure <figr fid="F8">8A</figr>) have vanished, leaving a narrow baseline of shortest unique substring lengths. The quantity <it>A</it><sub>e </sub>is the expectation of the area under this baseline curve.</p>
            <p>The index of repetitiveness, <it>I</it><sub>r</sub>, is now defined as the logarithm of the ratio of the observed aggregate shortest unique substring length and its theoretical expectation:</p>
            <p>
               <m:math name="1471-2105-7-541-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:msub>
                           <m:mi>I</m:mi>
                           <m:mtext>r</m:mtext>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:mi>log</m:mi>
                        <m:mo>&#8289;</m:mo>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mrow>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>A</m:mi>
                                       <m:mtext>o</m:mtext>
                                    </m:msub>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>A</m:mi>
                                       <m:mtext>e</m:mtext>
                                    </m:msub>
                                 </m:mrow>
                              </m:mfrac>
                           </m:mrow>
                           <m:mo>)</m:mo>
                        </m:mrow>
                        <m:mo>.</m:mo>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>3</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGjbqsdaWgaaWcbaGaeeOCaihabeaakiabg2da9iGbcYgaSjabc+gaVjabcEgaNnaabmaabaWaaSaaaeaacqWGbbqqdaWgaaWcbaGaee4Ba8gabeaaaOqaaiabdgeabnaaBaaaleaacqqGLbqzaeqaaaaaaOGaayjkaiaawMcaaiabc6caUiaaxMaacaWLjaWaaeWaaeaacqaIZaWmaiaawIcacaGLPaaaaaa@4002@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>For genomes devoid of excess repeat sequences <it>I</it><sub>r </sub>&#8776; 0, while for sequences with an excess of repeats <it>I</it><sub>r </sub>> 0. We have written the program ir for calculating <it>I</it><sub>r</sub>. The software is accessible using any standard web browser <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Sequence data</p>
            </st>
            <p>All 330 completely sequenced prokaryote genomes contained in RefSeq <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> at the time of analysis were downloaded from the NCBI ftp-site (<url>ftp://ftp.ncbi.nih.gov</url>). Their accession numbers and <it>I</it><sub>r </sub>values are provided in <supplr sid="S1">Additional file 1</supplr>. Table <tblr tid="T1">1</tblr> summarizes the sources of the six eukaryotic genomes analyzed in this study.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The sources of the eukaryotic genomes analyzed in this study.</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Organism</p>
                     </c>
                     <c ca="left">
                        <p>Source</p>
                     </c>
                     <c ca="left">
                        <p>Version</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>A. thaliana</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <url>http://ftp.ncbi.nih.gov</url>
                        </p>
                     </c>
                     <c ca="left">
                        <p>n/a</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>C. elegans</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <url>http://www.ucsc.edu</url>
                        </p>
                     </c>
                     <c ca="left">
                        <p>ce2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. melanogaster</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <url>http://www.ucsc.edu</url>
                        </p>
                     </c>
                     <c ca="left">
                        <p>dm2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>H. sapiens</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <url>http://www.ensembl.org</url>
                        </p>
                     </c>
                     <c ca="left">
                        <p>38</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>M. musculus</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <url>http://www.ucsc.edu</url>
                        </p>
                     </c>
                     <c ca="left">
                        <p>mm8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>S. cerevisiae</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <url>http://www.ucsc.edu</url>
                        </p>
                     </c>
                     <c ca="left">
                        <p>SacCer1</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p><it>I</it><sub>r </sub>calculations and statistical analysis</p>
            </st>
            <p>All <it>I</it><sub>r </sub>values presented in Figure <figr fid="F2">2</figr> were computed from the complete genome data available. Unsequenced regions marked by Ns were removed to prevent artificial inflation of <it>I</it><sub>r</sub>. The human and mouse genomes were too large for complete analysis with the computing equipment available to us. We therefore analyzed only individual chromosomes (Figure <figr fid="F4">4</figr>). With the exception of human and mouse chromosomes 1 and 2, all sequences were analyzed on their reverse and forward strands. Due to their sizes, only the forward strands of human and mouse chromosomes 1 and 2 were included in the computation of <it>I</it><sub>r</sub>.</p>
            <p>For the sliding window analyses (Figures <figr fid="F3">3</figr> and <figr fid="F5">5</figr>) <it>A</it><sub>o </sub>is computed as the sum of shortest unique substring lengths starting inside an interval of 1000 bp. Similarly, <it>A</it><sub>e </sub>is a function of the local G/C content and window length (1000 in our case). The window is then moved by a tenth of its length, i.e. 100 bp, and the <it>I</it><sub>r </sub>is recomputed.</p>
            <p>The significance of differences between average values computed from sets of <it>I</it><sub>r </sub>values was tested using the two-sample Wilcoxon test as implemented in the statistics software R <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>We have implemented <it>I</it><sub>r </sub>computations in the program ir, which can be accessed via a web-interface at</p>
         <p>
            <url>http://adenine.biz.fh-weihenstephan.de/ir/</url>
         </p>
         <p>The C source code of a stand-alone version of the program is also freely available from this web site under the terms of the GNU General Public License.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>BH designed and implemented the software, performed data analysis and contributed to the writing of the manuscript. TW initiated the study of shortest unique substrings, derived the null distribution of their lengths, and contributed to the writing of the manuscript. Both authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank A. B&#246;rsch-Haubold, P. Pfaffelhuber, and C. Schl&#246;tterer for constructive criticism. BH is supported financially by Dehner Gartencenter GmbH and the Stifterverband der Deutschen Wissenschaft.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Repeated sequences in DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Britten</snm>
                  <fnm>RJ</fnm>
               </au>
               <au>
                  <snm>Kohne</snm>
                  <fnm>DE</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1968</pubdate>
            <volume>161</volume>
            <fpage>529</fpage>
            <lpage>540</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.161.3841.529</pubid>
                  <pubid idtype="pmpid" link="fulltext">4874239</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Functional and evolutionary roles of long repeats in prokaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Rocha</snm>
                  <fnm>EPC</fnm>
               </au>
               <au>
                  <snm>Danchin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Viari</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Research in Microbiology</source>
            <pubdate>1999</pubdate>
            <volume>150</volume>
            <fpage>725</fpage>
            <lpage>733</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0923-2508(99)00120-5</pubid>
                  <pubid idtype="pmpid" link="fulltext">10673010</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Synergy between sequence and size in large-scale genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Gregory</snm>
                  <fnm>TR</fnm>
               </au>
            </aug>
            <source>Nature Reviews Genetics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>699</fpage>
            <lpage>708</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrg1674</pubid>
                  <pubid idtype="pmpid" link="fulltext">16151375</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Introduction</p>
            </title>
            <aug>
               <au>
                  <snm>Hofnung</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Shapiro</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <source>Research in Microbiology</source>
            <pubdate>1999</pubdate>
            <volume>150</volume>
            <fpage>577</fpage>
            <lpage>578</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0923-2508(99)00133-3</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Extensive repetitive DNA facilitates prokaryotic genome plasticity</p>
            </title>
            <aug>
               <au>
                  <snm>Aras</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Kang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Tschumi</snm>
                  <fnm>AI</fnm>
               </au>
               <au>
                  <snm>Harasaki</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Blaser</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Proceedings of the National Academy of Sciences, USA</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <fpage>13579</fpage>
            <lpage>13584</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.1735481100</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Associations between inverted repeats and the structural evolution of bacterial genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Achaz</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Coissac</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Netter</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Rocha</snm>
                  <fnm>EPC</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2003</pubdate>
            <volume>164</volume>
            <fpage>1279</fpage>
            <lpage>1289</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1462642</pubid>
                  <pubid idtype="pmpid" link="fulltext">12930739</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>The desoxyribonucleic acid content of animal cells and its evolutionary significance</p>
            </title>
            <aug>
               <au>
                  <snm>Mirsky</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Ris</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>The Journal of General Physiology</source>
            <pubdate>1951</pubdate>
            <volume>34</volume>
            <fpage>451</fpage>
            <lpage>462</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1085/jgp.34.4.451</pubid>
                  <pubid idtype="pmpid" link="fulltext">14824511</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Initial sequencing and analysis of the human genome</p>
            </title>
            <aug>
               <au>
                  <cnm>International Human Genome Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2001</pubdate>
            <volume>409</volume>
            <fpage>860</fpage>
            <lpage>921</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35057062</pubid>
                  <pubid idtype="pmpid" link="fulltext">11237011</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Initial sequencing and comparative analysis of the mouse genome</p>
            </title>
            <aug>
               <au>
                  <cnm>Mouse Genome Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>420</volume>
            <fpage>520</fpage>
            <lpage>561</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature01262</pubid>
                  <pubid idtype="pmpid" link="fulltext">12466850</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Genome sequence of the brown Norway rat yields insights into mammalian evolution</p>
            </title>
            <aug>
               <au>
                  <cnm>Rat Genome Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2004</pubdate>
            <volume>428</volume>
            <fpage>493</fpage>
            <lpage>521</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature02426</pubid>
                  <pubid idtype="pmpid" link="fulltext">15057822</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Initial sequence of the chimpanzee genome and comparison with the human genome</p>
            </title>
            <aug>
               <au>
                  <cnm>The Chimpanzee Sequencing and Analysis Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2005</pubdate>
            <volume>437</volume>
            <fpage>69</fpage>
            <lpage>87</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature04072</pubid>
                  <pubid idtype="pmpid" link="fulltext">16136131</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Natural genetic variation caused by transposable elements in humans</p>
            </title>
            <aug>
               <au>
                  <snm>Bennett</snm>
                  <fnm>EA</fnm>
               </au>
               <au>
                  <snm>Coleman</snm>
                  <fnm>LE</fnm>
               </au>
               <au>
                  <snm>Tsui</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Pittard</snm>
                  <fnm>SW</fnm>
               </au>
               <au>
                  <snm>Devine</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2004</pubdate>
            <volume>168</volume>
            <fpage>933</fpage>
            <lpage>951</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1448813</pubid>
                  <pubid idtype="pmpid" link="fulltext">15514065</pubid>
                  <pubid idtype="doi">10.1534/genetics.104.031757</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Selfish DNA: the ultimate parasite</p>
            </title>
            <aug>
               <au>
                  <snm>Orgel</snm>
                  <fnm>LE</fnm>
               </au>
               <au>
                  <snm>Crick</snm>
                  <fnm>FHC</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1980</pubdate>
            <volume>284</volume>
            <fpage>604</fpage>
            <lpage>607</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/284604a0</pubid>
                  <pubid idtype="pmpid">7366731</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Selfish genes, the phenotype paradigm and genome evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Doolittle</snm>
                  <fnm>WF</fnm>
               </au>
               <au>
                  <snm>Sapienza</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1980</pubdate>
            <volume>284</volume>
            <fpage>601</fpage>
            <lpage>603</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/284601a0</pubid>
                  <pubid idtype="pmpid">6245369</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Origin of a substantial fraction of human regulatory sequences from transposable elements</p>
            </title>
            <aug>
               <au>
                  <snm>Jordan</snm>
                  <fnm>JI</fnm>
               </au>
               <au>
                  <snm>Rogozin</snm>
                  <fnm>IB</fnm>
               </au>
               <au>
                  <snm>Glazko</snm>
                  <fnm>GV</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Trends in Genetics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>68</fpage>
            <lpage>72</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-9525(02)00006-9</pubid>
                  <pubid idtype="pmpid" link="fulltext">12547512</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Megabase deletions of gene deserts result in viable mice</p>
            </title>
            <aug>
               <au>
                  <snm>N&#243;brega</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Y</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Plajzer-Frick</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>V</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>EM</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2004</pubdate>
            <volume>431</volume>
            <fpage>988</fpage>
            <lpage>933</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature03022</pubid>
                  <pubid idtype="pmpid" link="fulltext">15496924</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Transposition of <it>hAT </it>elements links transposable elements and V(D)J recombination</p>
            </title>
            <aug>
               <au>
                  <snm>Zhou</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Atkinson</snm>
                  <fnm>PW</fnm>
               </au>
               <au>
                  <snm>Hickman</snm>
                  <mi>FAB</mi>
                  <fnm>Dyda</fnm>
               </au>
               <au>
                  <snm>Craig</snm>
                  <fnm>NL</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2004</pubdate>
            <volume>432</volume>
            <fpage>995</fpage>
            <lpage>1001</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature03157</pubid>
                  <pubid idtype="pmpid" link="fulltext">15616554</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>REPuter &#8211; fast computation of maximal repeats in complete genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Kurtz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Schleiermacher</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>426</fpage>
            <lpage>427</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.5.426</pubid>
                  <pubid idtype="pmpid" link="fulltext">10366664</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>A clustering method for repeat analysis in DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Volfovsky</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Haas</snm>
                  <fnm>BJ</fnm>
               </au>
               <au>
                  <snm>Salzberg</snm>
                  <fnm>SL</fnm>
               </au>
            </aug>
            <source>Genome Biology</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <fpage>0027.1</fpage>
            <lpage>0027.11</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1186/gb-2001-2-8-research0027</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>RepeatMasker</p>
            </title>
            <url>http://www.repeatmasker.org</url>
         </bibl>
         <bibl id="B21">
            <title>
               <p>The contribution of slippage-like processes to genome evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Evolution</source>
            <pubdate>1995</pubdate>
            <volume>41</volume>
            <fpage>1038</fpage>
            <lpage>1047</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF00173185</pubid>
                  <pubid idtype="pmpid">8587102</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Cryptic simplicity in DNA is a major source of genetic variation</p>
            </title>
            <aug>
               <au>
                  <snm>Tautz</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Trick</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Dover</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1986</pubdate>
            <volume>322</volume>
            <fpage>652</fpage>
            <lpage>656</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/322652a0</pubid>
                  <pubid idtype="pmpid" link="fulltext">3748144</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Genetica</source>
            <pubdate>2002</pubdate>
            <volume>115</volume>
            <fpage>93</fpage>
            <lpage>103</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1023/A:1016028332006</pubid>
                  <pubid idtype="pmpid" link="fulltext">12188051</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Complexity: an internet resource for analysis of DNA sequence complexity</p>
            </title>
            <aug>
               <au>
                  <snm>Orlov</snm>
                  <fnm>YL</fnm>
               </au>
               <au>
                  <snm>Potapov</snm>
                  <fnm>NV</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>W628</fpage>
            <lpage>W633</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">441604</pubid>
                  <pubid idtype="pmpid" link="fulltext">15215465</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity</p>
            </title>
            <aug>
               <au>
                  <snm>Troyanskaya</snm>
                  <fnm>OG</fnm>
               </au>
               <au>
                  <snm>Arbell</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Loren</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Landau</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Bolshoy</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>679</fpage>
            <lpage>688</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.5.679</pubid>
                  <pubid idtype="pmpid" link="fulltext">12050064</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>An analysis of variance test for normality (complete samples)</p>
            </title>
            <aug>
               <au>
                  <snm>Shapiro</snm>
                  <fnm>SS</fnm>
               </au>
               <au>
                  <snm>Wilk</snm>
                  <fnm>MB</fnm>
               </au>
            </aug>
            <source>Biometrika</source>
            <pubdate>1965</pubdate>
            <volume>52</volume>
            <fpage>591</fpage>
            <lpage>611</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/2333709</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>FYB (FYN binding protein) serves as a binding partner for lymphoid protein and FYN kinase substrate SKAP55 and a SKAP55-related protein in T cells</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Raab</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>da Silva</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Kraeft</snm>
                  <fnm>SK</fnm>
               </au>
               <au>
                  <snm>Rudd</snm>
                  <fnm>CR</fnm>
               </au>
            </aug>
            <source>Proceedings of the National Academy of Sciences, USA</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>8779</fpage>
            <lpage>8784</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.95.15.8779</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Isolation and mapping of ENVX1, a human homeobox gene homologous to <it>even-skipped</it>, localized at the 5' end of HOX1 locus on chromosome 7</p>
            </title>
            <aug>
               <au>
                  <snm>Faiella</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>D'Esposito</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rambaldi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Acampora</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Balsofiore</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Stornaiuolo</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mallamaci</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Migliaccio</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Gulisano</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Simeone</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bonicelli</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1991</pubdate>
            <volume>19</volume>
            <fpage>6541</fpage>
            <lpage>6545</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">329215</pubid>
                  <pubid idtype="pmpid" link="fulltext">1684419</pubid>
                  <pubid idtype="doi">10.1093/nar/19.23.6541</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>The genetic organization of chromosomes</p>
            </title>
            <aug>
               <au>
                  <snm>Thomas Jn</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Annual Reviews of Genetics</source>
            <pubdate>1971</pubdate>
            <volume>5</volume>
            <fpage>237</fpage>
            <lpage>256</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1146/annurev.ge.05.120171.001321</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Genome comparison without alignment using shortest unique substrings</p>
            </title>
            <aug>
               <au>
                  <snm>Haubold</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Pierstorff</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>M&#246;ller</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Wiehe</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>123</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1166540</pubid>
                  <pubid idtype="pmpid" link="fulltext">15910684</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-123</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <aug>
               <au>
                  <snm>Gusfield</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology</source>
            <publisher>Cambridge: Cambridge University Press</publisher>
            <pubdate>1997</pubdate>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Practical methods for constructing suffix trees</p>
            </title>
            <aug>
               <au>
                  <snm>Tian</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Tata</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hankins</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Patel</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>The VLDB Journal</source>
            <pubdate>2005</pubdate>
            <volume>14</volume>
            <fpage>281</fpage>
            <lpage>299</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/s00778-005-0154-8</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Calculate the Repetitiveness of DNA Sequences</p>
            </title>
            <url>http://adenine.biz.fh-weihenstephan.de/ir/</url>
         </bibl>
         <bibl id="B34">
            <title>
               <p>NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Pruitt</snm>
                  <fnm>KD</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Maglott</snm>
                  <fnm>DR</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2005</pubdate>
            <issue>33 Database</issue>
            <fpage>D501</fpage>
            <lpage>4</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">539979</pubid>
                  <pubid idtype="pmpid" link="fulltext">15608248</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <aug>
               <au>
                  <cnm>R Development Core Team</cnm>
               </au>
            </aug>
            <source>R: A Language and Environment for Statistical Computing</source>
            <publisher>R Foundation for Statistical Computing, Vienna, Austria</publisher>
            <pubdate>2004</pubdate>
            <url>http://www.R-project.org</url>
         </bibl>
      </refgrp>
   </bm>
</art>
