<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-3-6</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Universal sequence map (USM) of arbitrary discrete sequences</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Almeida</snm>
               <mi>S</mi>
               <fnm>Jonas</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>almeidaj@musc.edu</email>
            </au>
            <au id="A2">
               <snm>Vinga</snm>
               <fnm>Susana</fnm>
               <insr iid="I2"/>
               <email>svinga@itqb.unl.pt</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Dept Biometry &amp; Epidemiology, Medical Univ South Carolina, 135 Cannon  street, Suite 303, PO Box 250835, Charleston SC 29425, USA</p>
            </ins>
            <ins id="I2">
               <p>Inst. Tecnologia Qu&#237;mica e Biol&#243;gica Univ. Nova Lisboa, Av. da Rep&#250;blica (EAN), PO Box 127, 2781-901 Oeiras, Portugal</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2002</pubdate>
         <volume>3</volume>
         <issue>1</issue>
         <fpage>6</fpage>
         <url>http://www.biomedcentral.com/1471-2105/3/6</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/1471-2105-3-6</pubid>
               <pubid idtype="pmpid">11895567</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>02</day>
               <month>11</month>
               <year>2001</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>05</day>
               <month>2</month>
               <year>2002</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>05</day>
               <month>2</month>
               <year>2002</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2002</year>
         <collab>Almeida and Vinga; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis &#8211; without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have successfully identified such an iterative function for bijective mapping&#968; of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov Chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:<url>http://bioinformatics.musc.edu/~jonas/usm/</url>.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. The basic idea is that sequences of symbols, such as nucleotides in genomes, aminoacids in proteomes, repeated sequences in MLST [Multi Locus Sequence Typing, 4], words in languages or letters in words, would define trajectories in this continuous space conserving the statistical properties of the original sequences <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. Accordingly, the coordinate position of each unit would uniquely encode for both its identity and its context, i.e. the identity of its neighbors <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Ideally, the position should be scale-independent, such that the extraction of the encompassing sequence can be performed with any resolution, leading to an oligomer of arbitrary length. The pioneer work by Jeffrey published in 1990 <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> achieved this for genomic sequences by using the Chaos Game Representation technique (CGR), defining a unit-square where each corner corresponds to one of the 4 possible nucleotides. Subsequent work further explored the properties of CGR of biological sequences, but two main obstacles prevented the realization of its early promise &#8211; lack of scalability with regard to the number of possible unique units and inability to represent succession schemes. Meanwhile, Markov Chain theory already offered a solid foundation for the identification of discrete spaces to represent sequences as cross-tabulated conditional probabilities &#8211; Markov transition tables. This Bayesian technique is widely explored in bioinformatic applications seeking to measure homology and align sequences <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. In a recent report <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> we have shown that, for genomic sequences, Markov tables are in fact a special case of CGR, contrary to what had been suggested previously <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. This raised the prospect of an advantageous use of iterative maps as state spaces not only for representation of sequences but also to identify scale independent stochastic models of the succession scheme. That work <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> is hereby extended and further generalized to be applicable to sequences with arbitrary numbers of unique component units, without sacrificing the inverse correlation between distance in the map and sequence similarity independent of position. Accordingly, the technique is named Universal Sequence Map (USM).</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>The Results are divided in two sections. The first section presents the foundations for identifying an iterative function with the desired properties. The second section describes algorithm implementation illustrated with a sample data set. Both sections are best understood by using the accompanying web-based tool (see Abstract for address) where the different steps of the procedure can be verified and reproduced with the test data or the reader's own data.</p>
         <sec>
            <st>
               <p>Conceptual foundations</p>
            </st>
            <p>The USM generalization proposed here is achieved by observing two stipulations: A-alternative units in the iterative map are positioned in distinct corners of <it>unit block structures',</it> and B &#8211; sequence processing is bi-directional.</p>
            <p>Basis for USM generalization:</p>
            <p>A. Each unique unit is referenced in the map for positions that are at equal <it>n-distances</it> from each other, and possibly, but not necessarily, defining a complete <it>block structure</it><abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. <it>n-distances</it> are defined as the maximum distance along any dimension, e.g. <it>n-distance</it> between [<it>a</it><sub>1</sub>, <it>a</it><sub>2</sub>, ...,<it>a</it><sub><it>n</it></sub>] and [<it>b</it><sub>1</sub>, <it>b</it><sub>2</sub>, ...,<it>b</it><sub><it>n</it></sub>] is <it>max(|b</it><sub>1</sub> - <it>a</it><sub>1</sub>|, |<it>b</it><sub>2</sub>- <it>a</it><sub>2</sub>|, ..., |<it>b</it><sub><it>n</it></sub> - <it>a</it><sub><it>n</it></sub>|), see also Equation 3. It will be shown that this stipulation leads to the definition of spaces where distance is inversely proportional to sequence similarity, independent of position. In this respect, USM departs from previous attempts to generalize Chaos Game Representation that conserve the bi-dimensionality of the original CGR representation <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>.</p>
            <p>B. The iterative positioning is performed in both directions. Therefore, there will be two sets of coordinates, the result of forward and backward iterative operations. It will be shown that, by adding backward and forward map distances between two positions, the number of identical units in the encompassing sequences can be extracted directly from the USM coordinates. As a consequence, two arbitrary positions can be compared, and the number of contiguous similar units is extracted by an algebraic operation that relies solely on the USM coordinates of those very two positions.</p>
         </sec>
         <sec>
            <st>
               <p>Implementation of USM algorithm</p>
            </st>
            <p>The algorithm will be first illustrated for the first and last stanzas of Wendy Cope's poem "The Uncertainty of the Poet" (14), respectively, "I am a poet. I am very fond of bananas." and "I am of very fond bananas. Am I a poet?". The procedure includes four steps:</p>
            <p>1. Identification of unique sequence units &#8211; e.g. these two stanzas have 19 unique characters, (table <tblr tid="T1">1</tblr>), i.e. <it>uu = 19.</it></p>
            <p>2. Replacement of each unique unit (in this case units are alphabetic characters) by a unique binary number &#8211; e.g. in table <tblr tid="T1">1</tblr> each of the 19 unique units is replaced by its rank order minus one, represented as a binary number. Other arrangements are possible leading to the same final result as discussed below. The minimum number of dimensions necessary to accommodate <it>uu</it> unique units, <it>n,</it> is the upper integer of the length of its binary representation: <it>n = ceil(log</it><sub>2</sub><it>(uu)).</it> For W. Cope's stanzas, <it>n = ceil(log</it><sub>2</sub><it>(19)) = 5.</it> The binary reference coordinates for the unique units are defined by the numerals of the binary code &#8211; for example, <it>a</it> will be assigned to the position <it>U</it><sub>'<it>a</it>'</sub> = <it>[0,0,1,0,1].</it> Each symbol is represented as a corner in a n-dimensional cube (Table <tblr tid="T1">1</tblr>). The purpose of these first two steps is to guarantee that the reference positions for each unique sequence unit component are equidistant (stipulation A) in the <it>n-metric</it> defined above. Any other procedure resulting in equidistant unique positions will lead to the same final results independently of the actual binary numbers used or the number of dimensions used to contain them.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Binary codes for the 19 possible units occurring in the two stanzas. The first unit is a space character " ".</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Unit</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Bin. Code</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>00000</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>.</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>00001</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>?</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>00010</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>A</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>00011</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>A</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>00101</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>B</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>00110</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>D</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>00111</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>E</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>01000</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>F</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>01001</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>I</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>00100</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>M</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>01010</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>N</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>01011</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>O</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>01100</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>P</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>01101</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>R</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>01110</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>S</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>01111</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>T</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>10000</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>V</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>10001</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>Y</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>10010</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>3. The CGR procedure <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> (Eq. 1) is applied independently to each coordinate, <it>j = 1,2, ...,n,</it> for each unit, <it>i,</it> in the sequence of length <it>k, u</it><sub><it>j</it></sub><sup><it>(i)</it></sup> with <it>i = 1,2, ...,k,</it> and starting with a random map position taken from a uniform distribution in [0,1]<sup><it>n</it></sup>, i.e. <it>Unif([0,1]</it><sup><it>n</it></sup>). The random seed is not fundamentally different from using the middle position in the map as is conventional in CGR and it has the added feature that it prevents the invalidation of the inverse logarithmic proportionality of <it>n-distance</it> to sequence similarity <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> for sequences that start or end with the same motif.</p>
            <p>For a sequence with <it>k</it> units, the USM positions <it>i = 1,..., k</it> for the <it>j = 1,..., n</it> dimensions are determined as follows:</p>
            <p>
               <graphic file="1471-2105-3-6-i1.gif"/>
            </p>
            <p>4. The previous step generated <it>k</it> positions in a <it>n</it>-dimension space by processing the sequence forward (Eq. 1). This subsequent step adds an additional set of <it>n</it> dimensions by implementing the same procedure backward (Eq. 2), again starting at random positions for each coordinate. Consequently the first <it>n</it> dimensions of USM will be referred as defining <it>a forward map</it> and the second set of <it>n</it> dimensions will define a <it>backward map.</it> Put together, the bidirectional USM map defines a <it>2n-unit block structure.</it></p>
            <p>The <it>n</it> additional backward coordinates are determined as follows:</p>
            <p>
               <graphic file="1471-2105-3-6-i2.gif"/>
            </p>
            <p>The forward USM map for genomic sequences, where <it>uu = 4,</it> and, consequently, <it>n = 2,</it> is the same as the result generated by CGR. However, by freeing the iterative map from the dual-dimensional constraint of conventional CGR, the USM forward map alone achieved the goal of producing a scale independent representation of sequences of arbitrary number of unique units. These properties will be briefly illustrated with W Cope's example. The 16<sup>th</sup> unit of the first stanza, "I am a poet. I am very fond of bananas.", has USM coordinates <it>USM</it><sub><it>[1,...,2n]</it></sub><sup><it>(16)</it></sup> = <it>[0.02 0.01 0.63 0.00 0.53 0.07 0.30 0.52 0.27 0.57].</it> The first <it>n = 5</it> coordinates, the position in the forward map, can now be used, by reversing equation 1 <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>, not only to extract the identity the unit <it>i = 16</it> but also the identity of the preceding units:</p>
            <p>- using forward coordinates alone <it>[0.0156 0.0138 0.6314 0.0001 0.5338]</it></p>
            <p>
               <graphic file="1471-2105-3-6-i12.gif"/>
            </p>
            <p>The same procedure can be applied to the remaining <it>n = 5</it> coordinates, the position in the backward map, to extract the identity of the succeeding units, now ordered backwards.</p>
            <p>- using backward coordinates alone <it>[0.0703 0.3004 0.5169 0.2742 0.5652]</it></p>
            <p>
               <graphic file="1471-2105-3-6-i13.gif"/>
            </p>
            <p>The length of the sequence that can be recovered from a position in the CGR or USM space is only as long as the resolution, in bits, of the coordinates themselves. In addition, the relevance of these iterative techniques is not associated with the property of recovering sequences as much as with the ability to recover the succession schemes, e.g. the Markov probability tables. It has been recognized for almost a decade that the density of positions in unidirectional, bi-dimensional, iterated CGR maps (e.g. of genomic sequences, <it>uu = 4</it> -><it>n = 2</it>) defines a Markov table <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. The complete accommodation of Markov chains in unidirectional USM (i.e. either forward or backward, which is an equivalent to a multidimensional solution for CGR) can be quickly established by noting that the identity of a quadrant is set by its middle coordinates<abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. In order to extract the Markov format, for an arbitrary integer order <it>ord,</it> each of the two n-unit hypercubes, the set of <it>n</it> forward or backward coordinates, would be divided in <it>q = 2</it><sup><it>n.(ord+1)</it></sup> equal quadrants and the quadrant frequencies rearranged <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. The use of <it>quadrant</it> to designate what is in fact a sub-unit hypercube is a consonance with the preceding work on bidimensional CGR maps <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, where it was shown that since any number of subdivisions can be considered in a continuous domain, the density distribution becomes an order-free Markov Table that accommodates both integer and fractal memory lengths. The extraction of Markov chain transition tables from USM representations, both forward and backward, is included in the accompanying web-based application (see Abstract).</p>
            <p>Above, the USM procedure was shown to allow for the representation of sequences as multidimensional objects without loss of identity or context. These objects can now be analyzed to characterize the sequences for quantities such as similarity between segments or entropy <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp> within the sequence. In figure <figr fid="F1">1</figr> the 10-dimensional object defined by the USM positions of the two stanzas was projected in 3-dimensions by principal component analysis. The dimensionality reduction by principal factor extraction has visualization purposes only. As established above, the minimum necessary dimensionality of the USM state space is set by the binary logarithm of the number of unique units. Nevertheless, the sequence variance associated with each component is provided in the figure legend. In figure <figr fid="F1">1a</figr>, the segments " <it>very fond</it> of" in the two stanzas are linked by solid lines to highlight the fact that sequence similarity is reflected by spatial proximity of USM coordinates. The representation is repeated in Figure <figr fid="F1">1b</figr> with solid lining of the segment " <it>bananas</it>". The matching of the two segments of the second stanza (light) to the similar segments of the first stanza (dark) is, again, visually apparent.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Representation of the USM of the two stanzas.</p>
               </caption>
               <text>
                  <p>Representation of the USM of the two stanzas, respectively dark and light spheres connected by dashed lines, in a reduced 3-dimension space obtained using the first three principal components, PC<sub>1,2,3</sub>. In a) the units corresponding to the segment "<it>very fond of</it>" both stanzas are connected by solid lines. The procedure is repeated in b) for the segment "<it>bananas</it>". These figures illustrate the property that similar segments converge in the USM representation, which is reflected by the docking of homologous units. The factorization for dimensionality reduction serves visualization purposes only. The variance represented by each of the three principal components is 40%, 13% and 11%, respectively.</p>
               </text>
               <graphic file="1471-2105-3-6-1"/>
            </fig>
            <p>The USM algorithm determines that similar sequences, or segments of sequences, will have converging iterated trajectories: the distance will be cut in half for every consecutive similar unit. This property was notice before for CGR of genomic sequences <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, and will be further explored here for USM generalization. In that preceding work it was shown that the number of similar consecutive units can be approximated by a symmetrical logarithmic transformation of the maximum distance between two positions in either of the dimensions (<it>n-distance</it>), <it>d.</it></p>
            <p><it>d</it> = -log<sub>2</sub> (Max|&#916;<it>USM</it><sub><it>undirectional</it></sub>|) &#8195;&#8195;&#8195; (Eq. 3)</p>
            <p>Since the USM coordinates include two CGR iterations per dimension, one forward and another backward, two distances can be extracted. The first 1,..., <it>n</it> coordinates define a forward similarity estimate, <it>d</it><sub><it>f</it></sub>, and the second <it>n+1,..., 2n</it> coordinates can be used to estimate backward similarity, <it>d</it><sub><it>b</it></sub>. The former measures similarity with regard to the units preceding the one being compared and the latter does the same for those the succeeding that same units. Therefore, the forward and backward distances between the positions <it>i</it> and <it>j</it> of two sequences, <it>a</it> and <it>b,</it> with a length of <it>k</it><sub><it>a</it></sub> and <it>k</it><sub><it>b</it></sub>, respectively, would be calculated as described by equation 4, defining two rectangular matrices, <it>d</it><sub><it>f</it></sub> and <it>d</it><sub><it>b</it></sub>, of size <it>k</it><sub><it>a</it></sub> &#215; <it>k</it><sub><it>b</it></sub> (Fig. <figr fid="F2">2a,2b</figr>).</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Cross-tabulation of similarity between positions of the two stanzas.</p>
               </caption>
               <text>
                  <p>Cross-tabulation of similarity between positions of the two stanzas. The figures can be reproduced using accompanying web based USM tool (see Abstract for URL address, test data also included), a) forward distance, <it>d</it><sub><it>f</it></sub> (Eq. 4); b) backward distance, <it>d</it><sub><it>b</it></sub> (Eq. 4); c) bi-directional similarity, <it>D,</it> compensated for &#966;<sub><it>P</it>3 = 0.55,<it>n</it> = 4.25</sub> (Eq. 11). Notice that the values of diagonals between similar segments estimate the number of units in the segments, although each <it>D</it> value is computed solely from a single pairwise comparison of UCM coordinates; d) Compounded similarity, <it>dc,</it> with a maximum for the mid-position of the similar segments (Eq. 12).</p>
               </text>
               <graphic file="1471-2105-3-6-2"/>
            </fig>
            <p>
               <graphic file="1471-2105-3-6-i3.gif"/>
            </p>
            <p>However, the values of d necessarily overestimate the number of similar contiguous units preceding (<it>d</it><sub><it>f</it></sub>, illustration for stanza comparison in Fig. <figr fid="F2">2a</figr>) or succeeding (<it>d</it><sub><it>b</it></sub>, illustration for stanza comparison in Fig. <figr fid="F2">2b</figr>) the positions being compared. The value of <it>d</it> would be the exact number of contiguous similar units, <it>h,</it> if the starting positions for the similar segments where at a <it>n-distance</it> of 1, e.g. if they were in different corners of the unit hyper-dimensional USM cube. Since the initial distance is always somewhat smaller, the homology, <it>h,</it> measured as the number of consecutive similar units, will be smaller than <it>d</it> (Eq. 5).</p>
            <p>
               <graphic file="1471-2105-3-6-i4.gif"/>
            </p>
            <p>The contribution of &#966; to the similarity distance, <it>d,</it> can be estimated from the distribution of positions in the USM map of a random sequence. A uniformly random sequence <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp> will occupy the USM space uniformly, and, for that matter, so will the random seed of forward and backward iterative mapping, respectively equations 1 and 2. Therefore, a uniform distribution is an appropriated starting point to estimate the effect of (<it>p,</it> the over-determination of <it>h</it> by <it>d</it> (Eq.5). Accordingly, for a given <it>x&#8712; [0,1],</it> the probability, <it>P</it><sub><it>o</it></sub>, that any two coordinates, <it>x</it><sub>1</sub> and <it>x</it><sub>2</sub>, are located within a radius <it>r &#8712; (0,1)</it> is given by Equation 6.</p>
            <p>
               <graphic file="1471-2105-3-6-i5.gif"/>
            </p>
            <p>Since <it>P</it><sub><it>o</it></sub>(<it>r</it>) is the probability of two points chosen randomly from a uniform distribution <it>Unif([0,1])</it> being at a distance less than <it>r</it> from each other, for any set of <it>n</it> coordinates in the USM, the likelihood of finding another position within a block distance of <it>r</it> would be described by raising equation 6 to the <it>n</it> exponent. Finally, recalling from equation 3 that sequence similarity can be obtained by a logarithmic transformation of r, the probability that the unidirectional coordinates of two random sequences are at a similar length <it>d > &#966;</it> is described by equation 7. The simplicity of the expansion for higher dimensions highlights the order-statistics properties <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> of the <it>n-metric</it> introduced above (Eq. 3). It is noteworthy that the model for the likelihood of over-determination is the null-model, e.g. the comparison of actual sequences is evaluated against the hypothesis that the similarity observed happened by chance alone.</p>
            <p>
               <graphic file="1471-2105-3-6-i6.gif"/>
            </p>
            <p>Finally, it is also relevant to recall that the null model for <it>d</it> (Eq.7 for unidirectional comparisons, bi-directional null models are derived below) allows the generalization for non-integer dimensions. For example, the 19 unique unites found in the two stanzas (Table <tblr tid="T1">1</tblr>), define forward and backward USM maps in 5 dimensions each. However the 5<sup>th</sup> dimension is not fully utilized, as that would require <it>2</it><sup>5</sup> = <it>32</it> unique units. Therefore, if there is no requirement for an integer result, the effective value of <it>n</it> for the two stanzas can be refined as being <it>n = log</it><sub>2</sub><it>(19) = 4.25.</it></p>
            <p>An estimation of bi-directional similarity will now be introduced that adds the forward and backward distance measures <it>d</it><sub><it>f</it></sub> and <it>d</it><sub><it>b</it></sub>. The motivation for this new estimate is the the determination of the similar length of the entire similar segment between two sequences solely by comparing any two homologous units. Accordingly, since <it>d</it><sub><it>f</it></sub> is an estimate of preceding similarity and <it>d</it><sub><it>b</it></sub> provides the succeeding similarity equivalent the sum of the two similar distances, <it>D,</it> (Eq. 8) will estimate of the bi-directional similarity, e.g. the length of the similar segment, <it>H.</it></p>
            <p>
               <graphic file="1471-2105-3-6-i7.gif"/>
            </p>
            <p>As illustrated later in the implementation, for pairwise comparisons of homologous units of similar segments, all values of <it>D</it> and, consequently, of &#966;, are exactly the same. This result could possibly have been anticipated from the preceding work <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> by noting that the value of <it>d</it> between two adjacent homologous units differs exactly by one unit. However, this result was in fact a surprise and one with far reaching fundamental and practical implications.</p>
            <p>Similarly to unidirectional similarity estimation, <it>d,</it> the bi-directional estimate, <it>D,</it> being the sum of two overestimates, is also overestimated by a quantity to be defined, &#966; (Eq. 8). The derivation of an expression for the bi-directional overestimation will require the decomposition of <it>P</it><sub>1</sub> (Eq. 7) for two cases, comparison between unidirectional coordinates of similar quadrants, <it>P</it><sub>1<it>a</it></sub>, and of opposite quadrants, <it>P</it><sub>1<it>b</it></sub>, as described in equation 9. Recalling from equation 2, positions in the same quadrant correspond to sequence units with the same identity, and positions in opposite quadrants correspond to comparison between coordinates of units with a different identity.</p>
            <p>
               <graphic file="1471-2105-3-6-i8.gif"/>
            </p>
            <p>The need for the distinction between same and opposite quadrant comparison, which is to say between similar and between dissimilar sequence units, is caused by the fact that same quadrant comparisons are more likely to lead to higher values of <it>d.</it> As illustrated above for the 16<sup>th</sup> unit of the first stanza, the forward and backward coordinates must fall in the same quadrant. Consequently, the similar pattern of same and opposite quadrant comparisons for each dimension will be reflected as a bias in the bi-directional overestimation. The determination of probability, <it>P</it><sub>2</sub>, of over-determination between sums of independent unidirectional similarity estimates is derived in equation 10.</p>
            <p>
               <graphic file="1471-2105-3-6-i9.gif"/>
            </p>
            <p>The probability of bi-directional over-determination, can now be established by using the same and opposite unidirectional comparison expressions presented in Equation 9. The resulting expression for similarity over-determination by the distance between bidirectional USM coordinates, <it>P</it><sub>3</sub>, is presented in equation 11.</p>
            <p>
               <graphic file="1471-2105-3-6-i10.gif"/>
            </p>
            <p>In figure <figr fid="F3">3</figr>, the probability distribution for both unidirectional (<it>P</it><sub>1</sub>, in gray) and bidirectional (<it>P</it><sub>3</sub>, in black) comparisons is represented for different dimensions, <it>n.</it> It is clearly apparent that the over-determination becomes much less significant as dimensionality increases. From a practical point of view, the over-determination is of little consequence because the computational load of comparing sequences corresponds mostly to the identification of candidate pairing combinations. The fact that the n-metric unidirectional distances, <it>d</it><sub><it>f</it></sub> and <it>d</it><sub><it>b</it></sub>, defined in Equation 4, and bidirectional <it>D,</it> defined in Eq. 8, are over-determined implies that the identification of similar segments between two sequences will include false positives but will not generate false negatives. The false positive identifications can be readily recognized by comparing the sequences extracted from the coordinates, as demonstrated above for the 16<sup>th</sup> unit of the first stanza. Nevertheless, since over-determination will necessarily occur, its probability distribution was identified (Eq. 11, Fig. <figr fid="F3">3</figr>). This can also be achieved for individual values by solving Eq. 11 for the value of &#966; observed. For example, for the conditions of the two stanzas, the value of (&#966;<sub><it>p</it>1</sub> = 0.5, <it>n</it> = 4.25 is 0.71 sequence units, which is the expected median unidirectional over-determination, <it>P</it><sub>1</sub>, of <it>d</it><sub><it>f</it></sub> and <it>d</it><sub><it>b</it></sub> (Eqs. 5, 7). The corresponding probability of bi-directional overdetermination, <it>P</it><sub>3</sub>, should be somewhat above twice that value. Using equation 11, the value obtained is 1.67 similar units. Finally, it is worthy to stress that the expressions for calculation of likelihood of arbitrary levels of over-determination (Eq. 5&#8211;11) can be inverted to anticipate the level of over-determination for arbitrary probability levels. This use of the null random model is also included in the accompanying online tool (see Abstract for URL).</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Probability distribution of similarity estimates for the uniformly random sequence null model.</p>
               </caption>
               <text>
                  <p>Probability distribution of similarity estimates for the uniformly random sequence null model &#8211; e.g. experimental values deviating from this model would indicate real homology, as in Fig. <figr fid="F4">4</figr>. The dark lines represent the numerical solution for the bi-directional over-determination, <it>P</it><sub>3</sub> (Equation 11), for different dimensionalities, <it>n,</it> identified by numbers in the plot. The gray lines represent the numerical solution for the same values of <it>n,</it> for the uni-directional over-determination, <it>P</it><sub>1</sub> (Equation 9). The solution for the dimensionality of the two stanzas, <it>n = log</it><sub>2</sub><it>(19) = 4.25,</it> is highlighted by a thick line, for both <it>P</it><sub>3</sub> (thick dark line) and <it>P</it><sub>1</sub> (thick gray line).</p>
               </text>
               <graphic file="1471-2105-3-6-3"/>
            </fig>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Cumulative distribution of bi-directional similarity.</p>
               </caption>
               <text>
                  <p>Cumulative distribution of bi-directional similarity, <it>D,</it> between the two stanzas and comparison of genomic and proteomic sequences of <it>E. coli</it> threonine gene A, <it>thrA</it> (2463 base pairs for the genomic sequence and 820 aminoacids for the proteomic sequence), with B, <it>thrB</it> (933 base pairs for the genomic sequence and 310 aminoacids for the proteomic sequence). The null model expectation, that of uniform random distribution of units, is represented by dashed lines, obtained using Eq. 11. for <it>n = 2</it> (half dimensionality of USM state space for DNA) and <it>n = 4.3</it> (half dimensionality of USM state space for proteins, <it>n = 4.32,</it> and for the two stanzas, <it>n = 4.25</it>). The solid lines represent the actual cumulative distribution of <it>D</it> values.</p>
               </text>
               <graphic file="1471-2105-3-6-4"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p><it>H</it> is the number of contiguous units that are similar between the two sequences aligned at the positions being compared (Eq. 8). This value is estimated by <it>D,</it> which is the sum of the overestimated number of preceding, <it>d</it><sub><it>f</it></sub>, and succeeding, <it>d</it><sub><it>b</it></sub>, homologous units (Eq. 4, 5 and 8). The determination of these similarity estimates, <it>d</it><sub><it>f</it></sub> and <it>d</it><sub><it>b</it></sub>, was illustrated for the two stanzas in figures <figr fid="F2">2.a,2b</figr>. The same values compensated for over-determination at <it>P</it><sub>3</sub> = 0.5 are represented in Fig. <figr fid="F2">2c</figr>. The striking property of bi-directional similarity (H, Eq.8) is that the D values obtained for any two homologous pair from similar segments are exactly the same. That value is an estimator of the length of the entire similar segment, H (Eq. 11). This is further illustrated in figure <figr fid="F5">5</figr> for comparison of genomic sequences, where it is also observed that the values of the distances between similar segments are constant and estimate the similar length. This was a somewhat unexpected property of enormous practical value since the length of the similar segment can be determined by a single pair-wise comparison between any of analogous positions. Consequently, when comparing two sequences of length <it>k</it><sub><it>a</it></sub> and <it>k</it><sub><it>b</it></sub> to identify all similar segments of length <it>w</it> or above, <it>k</it><sub><it>a</it></sub><it>k</it><sub><it>b</it></sub> /w pair-wise comparisons will suffice. In addition, each pair-wise comparison is now achievable with a single algebraic operation (Eq.8) rather than requiring the conventional dynamic programming approach <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The computational effort of positioning database sequences in the USM state space occurs at the level of database indexing. Consequently, search algoritms based on the USM state space representation will necessarily lead to speedier implementations. In order to facilitate the comparison with dynamic programming, the software library of functions, in MATLAB format, Mathworks Inc., for the determination of USM coordinates is also provided <url>http://bioinformatics.musc.edu/~jonas/usm/</url>.</p>
         <fig id="F5">
            <title>
               <p>Figure 5</p>
            </title>
            <caption>
               <p>Comparison of uni-directional and bi-directional USM implementations for DNA sequences.</p>
            </caption>
            <text>
               <p>Comparison of uni-directional and bi-directional USM implementations for DNA sequences. The similarity matrices for, respectively, <it>d</it><sub><it>f</it></sub> and <it>D</it> values between two portions of <it>E. coli</it> K-12 MG1655 threonine gene A (<it>thrA,</it> genome positions 337&#8211;2799) and threonine gene B (<it>thrB,</it> genome positions 2801&#8211;3733) are presented. The numbers in the axis identify the position in the gene. Actual values of <it>d</it><sub><it>f</it></sub> and <it>D</it> are shown for the framed region on the table to the right. a) The <it>d</it><sub><it>f</it></sub> values were obtained by a unidirectional implementation of the USM procedure (Eq. 4). By comparing this figure with a similar analysis reported previously <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> for the same sequences (Fig. 10 of that report) it can be seen that they are nearly indistinguishable, even if the exact values vary. The equivalence between unidirectional USM for <it>n = 2</it> and CGR highlights the property that CGR is a special case of USM. The fact that the latter can be implemented for any value of n or any number of unique units justifies the Universal naming; b) In this plot the same sequences were compared using bidirectional USM, and, accordingly, generate a matrix of <it>D</it> values (Eq.8, 11). It is clearly apparent, and as already noted for Figure <figr fid="F2">2c</figr>, that D-similarity between any two homologous units is an estimate of the length of the entire homologous segment.</p>
            </text>
            <graphic file="1471-2105-3-6-5"/>
         </fig>
         <p>Additional measures of similarity can be derived for specific practical purposes using bi-directional and unidirectional d values. For example, the use of docking algorithms to align sequences would benefit from a measure with a maximum value in the center of the similar segments. This could be provided by defining a compounded similarity measure, <it>Hc,</it> as suggested in equation 12. The behavior of <it>Hc,</it> which would be obtained by the overestimated value of <it>dc,</it> is illustrated for the two test stanzas in Figure <figr fid="F2">2.d</figr>.</p>
         <p>
            <graphic file="1471-2105-3-6-i11.gif"/>
         </p>
         <p>The detection of similar segments in arbitrary sequences using <it>D</it> becomes very effective as the length of the similar segment increases. This was clear in the distribution of over-determination in Fig. <figr fid="F3">3</figr> but it is even more so when the distances between sequences with homologous segments are represented. In figure <figr fid="F4">4</figr> the distances between the two stanzas are represented alongside the distances to be expected if no homology existed, apart from the coincidental (random null model, using Eq. 11). It can be observed for the comparison of the two stanzas (Fig. <figr fid="F4">4</figr>, gray lines) that H values above 4 units occur with higher frequency than allowed by the random distribution model, reflecting the presence of real homologous segments (similar words).</p>
         <sec>
            <st>
               <p>USM of biological sequences</p>
            </st>
            <p>The representation of biological information as discrete sequences is dominated by the fact that genomes are sequences of discrete units and so are the products of its transcription and translation. However, not all biological sequences are composed of units that are functionally equally distinct from each other, as is the case of proteomic data and Multi-locus sequence typing [MLST, 4]. To avoid the issue of unit inequality and highlight the general applicability of the USM procedure, stanzas of a poem were used to illustrate the implementation instead. Nevertheless the original motivation of analyzing biological sequences is now recalled.</p>
            <p>In the preceding report the authors have illustrated the properties of unidirectional <it>n-metric</it> estimation of similarity for the threonine operon of <it>E. coli</it><abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. The same two two regions of <it>thrA</it> and <it>thrB</it> sequences of <it>E. coli</it> K-12 MG1655 are compared in Figure <figr fid="F5">5</figr> to highlight the advancement achieved by USM. It should be recalled that the particular dimensionality of DNA sequences, <it>n = 2,</it> allows a very convenient unidirectional bi-dimensional representation, which is in fact the Chaos Game Representation procedure (CGR) <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Consequently, CGR is a particular case of USM, obtained when <it>n = 2</it> and only the forward coordinates are determined. This can also be verified by comparing Figure <figr fid="F5">5a</figr> with a similar representation reported before <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, obtained with the same data using CGR [Fig. 10 of that report]. The advantageous properties of full (bi-directional) USM become apparent when Fig. <figr fid="F5">5a</figr> is compared with Fig. <figr fid="F5">5b</figr>. It is clearly apparent for bi-directional USM (Fig. <figr fid="F5">5b</figr>) that all pair-wise comparisons of units of identical segments now have the same <it>D</it> values. This coverts any individual homologous pair-wise comparison into an estimation of the length of the entire similar segment. The conservation of statistical properties by the distances obtained, <it>D,</it> can also be confirmed by comparing observed values with the corresponding null models (Fig. <figr fid="F4">4</figr>). For the analysis of this figure it is noteworthy to recall that the statistical properties of prokaryote DNA are often undistinguishable from uniform randomness <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. The genomic sequence of the first gene of the threonine operon of <it>E. coli, thrA,</it> is compared with that of the second, <it>thrB.</it> The distribution of the resulting <it>D</it> values is represented in figure <figr fid="F4">4</figr> (solid black line), alongside with the null model for that dimensionality (Eq. 11, with <it>n = log</it><sub>2</sub><it>(4) = 2</it>, gray dotted line). The genomic sequences of <it>thrA</it> and <it>thrB</it> were translated into proteomic sequences using SwissProt's on line translator, applied to the 5'3' first frame <url>http://www.expasy.ch/tools/dna.html</url>. Similarly, the distribution of <it>D</it> values for the comparison of the proteomic <it>thrA</it> and <it>thrB</it> sequences is also represented in Figure <figr fid="F4">4</figr>, alongside with the null model, Eq. 11, for its dimensionality (<it>n = log</it><sub>2</sub>(<it>uu = 20 possible aminoacids) = 4.32</it>), which is graphically nearly undistiguishable from that of the comparison between the stanzas, with <it>n = log</it><sub>2</sub>(<it>uu = 19 possible letters) = 4.25</it> (dotted gray line for the rounded value, <it>n = 4.3).</it> Both the genomic and the proteomic distribution of <it>D</it> values is observed to be contained by the null model, unlike the comparison between the stanzas discussed above, where the existence of structure is clearly reflected by its distribution. The genomic and proteomic of <it>thrA</it> and <it>thraB,</it> used to illustrate this discussion, are provided with the web-based implementation of USM (see Methods for URL).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>The mounting quantity and complexity of biological sequence data being produced <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> commands the investigation of new approaches to sequence analysis. In particular, the need for scale independent methodologies becomes even more necessary as the limitations of conventional Markov chains are increasingly noted <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. These limitations are bound to become overwhelming when signals such as succession schemes of the expression of over 30,000 human genes <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> become available. This particular signal would be conveniently packaged within a 30 dimension USM unit block (<it>n = ceil(log</it><sub>2</sub>(<it>3 10</it><sup>3</sup>) = 15).</p>
         <p>In addition, the advances in statistical mechanics for the study of complex systems, particularly in non-linear dynamics, have not been fully utilizable for the analysis of sequences due to the missing formal link between discrete sequences and trajectories in continuous spaces. The properties of USM reported above suggest that this may indeed be such a bridge. For example, the embedding of dimensions, a technique at the foundations of many time-series analysis techniques offers a good example of the completeness of USM representation of sequences. By embedding the forward and backward coordinates separately, at the relevant memory length, the resulting embedded USM is exactly what would be obtained by applying USM technique to the embedded dimeric sequence itself.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Computation</p>
            </st>
            <p>The algorithms described in this manuscript were coded using MATLAB&#8482; 6.0 language (Release 12), licensed by The MathWorks Inc <url>http://www.mathworks.com</url>. An internet interface was also developed to make them freely accessible through user-friendly web-pages <url>http://bioinformatics.musc.edu/~jonas/usm/</url>.</p>
         </sec>
         <sec>
            <st>
               <p>Source code</p>
            </st>
            <p>In order to facilitate the development of sequence analysis applications based on the USM state space, the software library of functions written to calculate the USM coordinates is provided with the web-based implementation (see address above). The code is provided in MATLAB format, which is general enough as to be easily ported into other environments. These functions process sequences provided as text files in FASTA format. In addition to the functions, the test datasets and a brief readme.txt documentation file are also included.</p>
         </sec>
         <sec>
            <st>
               <p>Test data</p>
            </st>
            <p>The USM mapping proposed is applicable to any discrete sequence, even if the primary goal is the analysis of biological sequences. For ease of illustration and to emphasize USM's general validity, the test dataset used to describe implementation of the algorithm consists of two stanzas of a Poem by Wendy Cope, "The Uncertainty of the Poet" <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. In the Discussion section, USM was also applied to the DNA sequence of the threonine operon of <it>Escherichia coli</it> K-12 MG1655, obtained from the University of Winsconsin <it>E. coli</it> Genome Project <url>http://http:/www.genetics.wisc.edu</url>, and to its 5'3' first frame proteomic translation obtained by using SwissProt on line translator <url>http://www.expasy.ch/tools/dna.html</url>. The three test sequence datasets are also included in the web-based USM application.</p>
         </sec>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgments</p>
            </st>
            <p>The authors thank Dr Santosh Mishra, Eli Lilly Co., for the insightful suggestions about the applicability of USM, and John H. Schwacke, at the Department of Biometry and Epidemiology of the Medical University of South Carolina for revising the coherence of mathematical deduction. The authors thankfully acknowledge financial support by grant SFRH/BD/3134/2000 to S. Vinga and project SAPIENS-34794/99 of Funda&#231;&#227;o para a Ci&#234;ncia e Tecnologia of the Portuguese Ministry of Science and Technology.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Application of information theory to DNA sequence analysis: a review.</p>
            </title>
            <aug>
               <au>
                  <snm>Rom&#225;n-Rold&#225;n</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bernaola-Galv&#225;n</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Oliver</snm>
                  <fnm>JL</fnm>
               </au>
            </aug>
            <source>Pattern Recognition</source>
            <pubdate>1996</pubdate>
            <volume>29</volume>
            <fpage>1187</fpage>
            <lpage>1194</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0031-3203(95)00145-X</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Recent investigations into global characteristics of long DNA sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Nady</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Indian Journal of Biochemistry and Biophysics</source>
            <pubdate>1994</pubdate>
            <volume>31</volume>
            <fpage>149</fpage>
            <lpage>155</lpage>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Spatial representation of symbolic sequences through iterative function systems.</p>
            </title>
            <aug>
               <au>
                  <snm>Ti&#242;o</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>IEEE Transactions on Systems, Man, and Cybernetics &#8211; Part A: Systems and Humans</source>
            <pubdate>1999</pubdate>
            <volume>29</volume>
            <fpage>386</fpage>
            <lpage>393</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1109/3468.769757</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Multilocus sequence typing of Streptococcus pneumoniae directly from cerebrospinal <it>fluid.</it></p>
            </title>
            <aug>
               <au>
                  <snm>Enright</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Knox</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Griffiths</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Crook</snm>
                  <fnm>DWM</fnm>
               </au>
               <au>
                  <snm>Spratt</snm>
                  <fnm>BG</fnm>
               </au>
            </aug>
            <source>Eur. J. Clin. Microbiol. Infect. Dis.</source>
            <pubdate>2001</pubdate>
            <volume>19</volume>
            <fpage>627</fpage>
            <lpage>630</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/s100960000321</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Chaos game representation of gene structure.</p>
            </title>
            <aug>
               <au>
                  <snm>Jeffrey</snm>
                  <fnm>HJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acid Res.</source>
            <pubdate>1990</pubdate>
            <volume>18</volume>
            <fpage>2163</fpage>
            <lpage>2170</lpage>
            <xrefbib>
               <pubid idtype="pmpid">2336393</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>The evolution of species-type specificity in the global DNA sequence organization of mitochondrial genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Hill</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>SM</snm>
                  <fnm>Singh</fnm>
               </au>
            </aug>
            <source>Genome</source>
            <pubdate>1997</pubdate>
            <volume>40</volume>
            <fpage>342</fpage>
            <lpage>356</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/geno.1996.4574</pubid>
                  <pubid idtype="pmpid">9202414</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>"Chaos games" for iterated function systems with grey level maps.</p>
            </title>
            <aug>
               <au>
                  <snm>Forte</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Mendivil</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Vrscay</snm>
                  <fnm>ER</fnm>
               </au>
            </aug>
            <source>SIAM J. Math. Anal.</source>
            <pubdate>1998</pubdate>
            <volume>29</volume>
            <fpage>878</fpage>
            <lpage>890</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1137/S0036141096306911</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Chaos game representation of protein structures.</p>
            </title>
            <aug>
               <au>
                  <snm>Fiser</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Tusn&#225;dy</snm>
                  <fnm>GE</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Mol. Graphics</source>
            <pubdate>1994</pubdate>
            <volume>12</volume>
            <fpage>302</fpage>
            <lpage>304</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0263-7855(94)80109-6</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Genomic sigature: characterization and classification of species assessed by chaos game representation of sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Deschavanne</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>Giron</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Vilain</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Fagot</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Fertil</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Mol. Biol. Evol.</source>
            <pubdate>1999</pubdate>
            <volume>16</volume>
            <fpage>1391</fpage>
            <lpage>1399</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10563018</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Novel techniques of graphical representation and analysis of DNA sequences &#8211; a review.</p>
            </title>
            <aug>
               <au>
                  <snm>Roy</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Raychaudhury</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Nandy</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J. Biosci.</source>
            <pubdate>1998</pubdate>
            <volume>23</volume>
            <fpage>55</fpage>
            <lpage>71</lpage>
         </bibl>
         <bibl id="B11">
            <aug>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Krogh</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mitchison</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Biological Sequence Analysis &#8211; Probabilistic Models of Proteins and Nucleic Acids</source>
            <publisher>Cambridge University Press, Cambridge</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Analysis of genomic sequences by chaos game representation.</p>
            </title>
            <aug>
               <au>
                  <snm>Almeida</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Carri&#231;o</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Maretzek</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Noble</snm>
                  <fnm>PA</fnm>
               </au>
               <au>
                  <snm>Fletcher</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J. Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <fpage>429</fpage>
            <lpage>437</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/bioinformatics/17.5.429</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Goldman</snm>
                  <fnm>N.</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res.</source>
            <pubdate>1993</pubdate>
            <volume>21</volume>
            <fpage>2487</fpage>
            <lpage>2491</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8506142</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <aug>
               <au>
                  <snm>Cope</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Serious Concerns,</source>
            <publisher>Faber &amp; Faber Inc</publisher>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Chaos game representation of proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Basu</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Pan</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Dutta</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Das</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J. Mol. Graph. Model.</source>
            <pubdate>1997</pubdate>
            <volume>15</volume>
            <fpage>279</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1093-3263(97)00106-X</pubid>
                  <pubid idtype="pmpid" link="fulltext">9640559</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Representation of amino acid sequences as two-dimensional point patterns.</p>
            </title>
            <aug>
               <au>
                  <snm>Plei&#946;ner</snm>
                  <fnm>KP</fnm>
               </au>
               <au>
                  <snm>Wernisch</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Oswald</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Fleck</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Electrophoresis</source>
            <pubdate>1997</pubdate>
            <volume>18</volume>
            <fpage>2709</fpage>
            <lpage>2713</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9504802</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>A new approach for the classification of functional regions of DNA sequences based on fractal representation.</p>
            </title>
            <aug>
               <au>
                  <snm>Solovyev</snm>
                  <fnm>VV</fnm>
               </au>
               <au>
                  <snm>Korolev</snm>
                  <fnm>SV</fnm>
               </au>
               <au>
                  <snm>Lim</snm>
                  <fnm>HA</fnm>
               </au>
            </aug>
            <source>Int. J. Genom. Res.</source>
            <pubdate>1993</pubdate>
            <volume>1</volume>
            <fpage>109</fpage>
            <lpage>128</lpage>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Entropic feature for sequence pattern through iterated function systems.</p>
            </title>
            <aug>
               <au>
                  <snm>Rom&#225;n-Rold&#225;n</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bernaola-Galv&#225;n</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Oliver</snm>
                  <fnm>JL</fnm>
               </au>
            </aug>
            <source>Pattern Recognition Letters</source>
            <pubdate>1994</pubdate>
            <volume>15</volume>
            <fpage>567</fpage>
            <lpage>573</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0167-8655(94)90017-5</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Entropic profiles of DNA sequences through chaos-game-derived images .</p>
            </title>
            <aug>
               <au>
                  <snm>Oliver</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Bernaola-Galv&#225;n</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Guerrero-Garcia</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rom&#225;Rol&#225;n</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J. Theor. Biol.</source>
            <pubdate>1993</pubdate>
            <volume>160</volume>
            <fpage>457</fpage>
            <lpage>470</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jtbi.1993.1030</pubid>
                  <pubid idtype="pmpid" link="fulltext">8501918</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Visualization of random sequences using the chaos game algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Mata-Toledo</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Willis</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J. Systems Software</source>
            <pubdate>1997</pubdate>
            <volume>39</volume>
            <fpage>3</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0164-1212(96)00158-6</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>A first course in order statistics.</p>
            </title>
            <aug>
               <au>
                  <snm>Arnold</snm>
                  <fnm>BC</fnm>
               </au>
               <au>
                  <snm>Balakrishnan</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Nagaraja</snm>
                  <fnm>HN</fnm>
               </au>
            </aug>
            <source>Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics</source>
            <publisher>John Wiley &amp; Sons, Inc., New York</publisher>
            <pubdate>1992</pubdate>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Bioinformatics &#8211; trying to swim in a sea of data.</p>
            </title>
            <aug>
               <au>
                  <snm>Roos</snm>
                  <fnm>DS</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2001</pubdate>
            <volume>291</volume>
            <fpage>1260</fpage>
            <lpage>1261</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.291.5507.1260</pubid>
                  <pubid idtype="pmpid" link="fulltext">11233452</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>The sequence of the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2001</pubdate>
            <volume>291</volume>
            <fpage>1304</fpage>
            <lpage>1351</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1058040</pubid>
                  <pubid idtype="pmpid" link="fulltext">11181995</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
