<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-382</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>XSTREAM: A practical algorithm for identification and architecture modeling of tandem repeats in protein sequences</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Newman</snm>
               <mi>M</mi>
               <fnm>Aaron</fnm>
               <insr iid="I1"/>
               <email>a_newman@lifesci.ucsb.edu</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Cooper</snm>
               <mi>B</mi>
               <fnm>James</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>jcooper@lifesci.ucsb.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Biomolecular Science and Engineering Program, University of California, Santa Barbara, CA 93106, USA</p>
            </ins>
            <ins id="I2">
               <p>Department of Molecular, Cellular, and Developmental Biology, University of California, Santa Barbara, CA 93106, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>382</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/382</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17931424</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-382</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>23</day>
               <month>5</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>11</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>11</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Newman and Cooper; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Biological sequence repeats arranged in tandem patterns are widespread in DNA and proteins. While many software tools have been designed to detect DNA tandem repeats (TRs), useful algorithms for identifying protein TRs with varied levels of degeneracy are still needed.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>To address limitations of current repeat identification methods, and to provide an efficient and flexible algorithm for the detection and analysis of TRs in protein sequences, we designed and implemented a new computational method called XSTREAM. Running time tests confirm the practicality of XSTREAM for analyses of multi-genome datasets. Each of the key capabilities of XSTREAM (e.g., merging, nesting, long-period detection, and TR architecture modeling) are demonstrated using anecdotal examples, and the utility of XSTREAM for identifying TR proteins was validated using data from a recently published paper.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We show that XSTREAM is a practical and valuable tool for TR detection in protein and nucleotide sequences at the multi-genome scale, and an effective tool for modeling TR domains with diverse architectures and varied levels of degeneracy. Because of these useful features, XSTREAM has significant potential for the discovery of naturally-evolved modular proteins with applications for engineering novel biostructural and biomimetic materials, and identifying new vaccine and diagnostic targets.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Repeated sequences, often organized as extended tandem arrays, abound in biology, and computational approaches have been critical for the identification and analysis of such sequence elements from genomic data. Tandem Repeats (TRs) are formally defined as two identical copies of finite non-empty words with no intervening characters <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Since biological sequences evolve naturally by mutation, both by base substitutions and insertions/deletions (indels), a biological TR is defined as two or more <it>sufficiently similar </it>biological words lacking intervening characters, where sufficiency is arbitrarily defined. The work described in this paper focuses exclusively on non-evolutionary TRs (for evolutionary TR detection, see <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>), each of which has three important properties: <it>consensus sequence</it>, a word representing the TR pattern, <it>period</it>, the number of characters in the consensus sequence, and <it>copy number</it>, the number of words in the entire TR domain.</p>
         <p>Bioinformatics studies of TRs have primarily focused on DNA. DNA TRs are traditionally classified on the basis of increasing period into microsatellites, minisatellites, and large-scale duplications. In some human TR loci, copy number changes are associated with triplet-repeat expansion diseases that include Huntington's disease and Fragile X Syndrome <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Because genomic TR loci are often highly polymorphic, even expanding and contracting from generation to generation, DNA TRs have forensic and biomedical applications, and may play important roles in genome evolution <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>.</p>
         <p>Nucleotide repeats occurring in protein coding genes can result in protein sequences containing repetitive elements. Though less studied than DNA repeats, peptide repeats are likewise known to be widespread in nature <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. Peptide TRs impart a modular architecture to proteins and are found in important structural proteins such as animal collagens and keratins, insect and spider silks, plant cell wall extensins, and the proteins that form adhesive plaques and byssal threads of bivalve mussels <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. TR domains are also found in other modular proteins, including prion proteins, ice nucleation and antifreeze proteins, FG-rich proteins in nuclear pore complexes, surface antigens of microbial pathogens and parasites, histones, and zinc-finger transcription factors. <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. Peptide TRs may provide an evolutionary shortcut for the modular construction of new proteins through recombination and copy number adjustment <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>. To understand both the evolutionary diversity and functional significance of protein TRs, facile methods for the <it>a priori </it>identification and analysis of TRs from protein sequence databases will be critical.</p>
         <p>Numerous bioinformatics tools have been developed for <it>de novo </it>repeat detection in DNA and protein sequences. One class of tools utilizes sequence self-alignment (SSA) <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>. Importantly, SSA approaches allow for the substitutions and indels in repeat sequences that often arise in biology. Because protein repeat detection tools that use SSA (RADAR, TRUST, Pellegrini et al. method) detect all repeated sequences, not only TRs, these algorithms may incorrectly characterize TR domains as non-TRs. With &#937;(<it>n</it><sup>2</sup>) time complexity (where <it>n </it>= length of input sequence), SSA algorithms are less than ideal for long protein sequences and repeat-detection in large multi-genome datasets. An alternative strategy implemented for <it>a priori </it>peptide repeats detection is based on a sliding window (SW) approach <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>. In general, SW algorithms are simple to implement, but do not readily accommodate indels and are thus likely to miss many degenerate TRs. The &#937;(<it>n</it><sup>3</sup>) time complexity of SW algorithms used to detect repeats of all periods also renders this strategy inappropriate for analysis of long sequences.</p>
         <p>An efficient heuristic employed for detecting DNA TRs in whole genome data relies on seed extension (SE) <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. Seed extension algorithms have &#937;(<it>n</it>) time complexity for repeat detection, and depending on implementation, can approximate O(<it>n</it>) time complexity, making them fast enough for analyses of large sequence databases. Furthermore, since SE allows for both indels and substitutions, this method is very appropriate for repeat finding applications in naturally evolving biological sequences.</p>
         <p>To complement and improve upon current software tools for peptide repeat detection, we implemented a SE algorithm to explicitly locate exact and degenerate (with substitutions and indels) TRs of all periods in protein sequences. This new tool, called XSTREAM for Variable ('X') Sequence Tandem Repeats Extraction and Architecture Modeling, was designed to efficiently mine large genomic datasets for TRs of any period, to effectively characterize degenerate TR domains, and to produce concise TR output. Important features of XSTREAM include novel heuristics that achieve 1) practical running time without period limitations, 2) effective reduction of TR output redundancy, 3) merging of discontinuous degenerate TR domains, 4) identification of nested TR architectures, and 5) TR domain clustering. Though developed specifically for analyzing TR protein sequences, XSTREAM works equally well to extract TR patterns in DNA sequences, or for that matter, TRs in any ASCII string of characters. The practical utility of XSTREAM is demonstrated through testing and validation using publicly available genome sequence data.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <p>The XSTREAM program implements a SE approach that includes heuristics to efficiently and effectively detect exact and degenerate TRs of any period from large input sequence datasets. The program utilizes two important strategies in addition to SE to achieve practical running times without period limitations: a user-modifiable sequence alignment method called Gap-Restricted Dynamic Programming (GRDP), and a new long-period TR filter (both described in the Appendix). In addition, XSTREAM applies several strategies, including the use of <it>irreducible </it>repeats, to effectively combat the redundancy in TR detection inherent in biological TR sequences. Other novel features incorporated into XSTREAM include merging of degenerate TR domains and modeling of nested TR architectures. XSTREAM provides non-redundant output of TRs meeting a suite of user-defined criteria for attributes such as minimum and maximum period, minimum copy number, minimum domain length, minimum % input sequence coverage, and maximum character mismatch.</p>
         <sec>
            <st>
               <p>Algorithm</p>
            </st>
            <p>The primary functionalities of XSTREAM, as shown in Figure <figr fid="F1">1</figr>, can be divided into five high level stages: Pre-Processing, TR Detection, TR Characterization, Post-Processing, and Output. For a technical description of the algorithm, presented within the same organizational context, refer to the Appendix section.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>XSTREAM Program Flow Chart</p>
               </caption>
               <text>
                  <p><b>XSTREAM Program Flow Chart</b>. Activity Diagram of XSTREAM modeled using Enterprise Architect version 4.10.739 (Sparx Systems).</p>
               </text>
               <graphic file="1471-2105-8-382-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Pre-Processing</p>
            </st>
            <p>For processing by XSTREAM, input sequences must be in FASTA format. Valid sequences are sent to the seed detection module. XSTREAM searches the input sequence for short exact substring repeats, or seeds, of two or three sizes, depending on the input length (see <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> for an excellent example of the use of seeds, or <it>k</it>-tuple <it>probes </it>in TR detection). Seed pairs are used to provide starting points and potential periods for TR detection. The use of seeds allows XSTREAM to rapidly identify putative TRs. For every adjacent pair of matching seeds, XSTREAM records both the sequence distance between them and the sequence index of the leftmost seed. Each distance is a potential TR period.</p>
         </sec>
         <sec>
            <st>
               <p>TR Detection</p>
            </st>
            <p>Following seed detection, XSTREAM attempts to extend each seed pair. Two sequence iterators move downstream from each seed in a parallel manner, returning characters for comparison. Running totals of character match and mismatch are kept. We define <it>i </it>as the amount of character matching required between two tandemly arranged words in order for them to be designated a TR. For example, if <it>i </it>is set to 0.8, then at least 80% of the aligned characters among two words at a given period must be identical. Seed extension always stops when for any seed pair, the iterator for the leftmost seed collides with the rightmost seed. If at any point during the procedure, the character mismatch count divided by the current potential period exceeds or equals 1 - <it>i</it>, seed extension is aborted, thereby reducing running time. Similarly, seed extension is prematurely terminated if the match count becomes sufficiently high. To include indels during seed extension, we use a novel heuristic, which is presented in the Appendix section.</p>
            <p>Each candidate TR resulting from successful seed extension is subjected to further expansion using the same basic mechanism as seed extension. XSTREAM examines sequence space both downstream and upstream of the current candidate domain using increments equal to the TR period. Potential repeat copies are evaluated by comparing new sequence space with the reference repeat, which is the leftmost repeat resulting from the initial seed extension. If indels are allowed and if domain expansion using seed extension fails to agree with <it>i</it>, we invoke a second strategy. The second approach, termed GRDP (see Appendix), can more accurately perform a subsequence pairwise comparison at the expense of slightly increased running time. A novel feature of our implementation is the user's ability to limit the maximum width of the dynamic programming (DP) matrix (parameter <it>g</it>), resulting in &#952;(<it>n</it>) time and space complexities for global pairwise alignments.</p>
            <p>Following domain expansion, we instantiate a procedure called maximality. Employing a user-adjustable scoring scheme, maximality finds the longest stretch of characters both downstream and upstream that can legitimately be added to each candidate TR. This procedure is invoked because TRs in nature do not always occur in integer copy numbers and XSTREAM's TR domain expansion method is limited to integer copies.</p>
            <p>Finally, XSTREAM masks input sequence space corresponding to each maximally extended candidate TR. Sequence masking prevents further seed extensions in sequence regions that constitute TR domains, thus functioning to prevent output redundancy as well as reduce running time. For details of sequence masking, refer to Redundancy Elimination I as well as Two-stage TR detection in the Appendix.</p>
         </sec>
         <sec>
            <st>
               <p>TR Characterization</p>
            </st>
            <p>To further refine each candidate TR, XSTREAM segments every TR domain into its component copies. Parsing can be accomplished by a trivial subdivision of the TR domain using the current period, an optimal subdivision using wrap-around dynamic programming (WDP, <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>), or a heuristic subdivision using GRDP. For details about implementation and when each method is invoked, refer to the Appendix section.</p>
            <p>Following TR parsing, each TR undergoes a multiple alignment of its copies. A procedure identical in concept to STAR Alignment is used when indels are allowed. Because practical running time is emphasized in our implementation, pairwise sequence comparisons during STAR Alignment may be computed in a non-optimal manner using GRDP.</p>
            <p>Following multiple alignment of each TR, a consensus sequence is computed. Each consensus is democratically derived using the majority rule. In addition, XSTREAM computes an error associated with the consensus &#8211; the lower the error, the stronger the agreement between the consensus and its represented domain. We define <it>I </it>as the minimum allowable matching between the consensus and the aligned TR for the TR to be reported to the user. For example, if <it>I </it>equals 0.8, then the consensus error cannot exceed 0.2 or 20% disagreement.</p>
            <p>Next, XSTREAM inspects the edges of each aligned TR domain (with TR copy number greater than 2) for accordance with the consensus. If either edge mismatches with the consensus, that edge is truncated. Since all TRs must have at least 2 copies, edge trimming is not performed on TR domains with TR copy number = 2.</p>
            <p>Occasionally, because of matching considerations, TR domains are identified with periods that are reducible. Therefore, the last step of TR Characterization functions to reduce overestimated TR periods (see Redundancy Elimination II in the Appendix).</p>
         </sec>
         <sec>
            <st>
               <p>Post-Processing</p>
            </st>
            <p>XSTREAM attempts to merge <it>sufficiently similar </it>TRs that either overlap in the input sequence or are in close enough proximity to one another. To compute sufficient similarity, XSTREAM invokes the concept of cyclical permutations, which enables effective consensus sequence comparison (see <it>Merging </it>and <it>Consensus Comparison </it>in the Appendix). As a result, XSTREAM can identify TR domains with large regions of indels and/or substitutions that, without merging, would be reported as separate TRs. This procedure is thus important for detecting rapidly evolving TR sequences.</p>
            <p>Following merging, XSTREAM invokes a series of finalizing functions called finishing touches, which serve to fine-tune the characterization of each TR domain as well as remove TRs that are insufficiently fit for output. TR characterization refinement involves rerunning maximality, redoing multiple alignment, rerunning reducibility, and looking for nested TRs (see Appendix). After additional characterization, finishing touches removes TRs with unacceptable amounts of overlap (see Redundancy Elimination III in the Appendix). Finally, remaining TRs are tested for agreement with user-defined filtration criteria.</p>
            <p>All TRs that satisfy the output criteria are sent to the consensus comparison (CC) module. CC clusters TRs on the basis of consensus similarity. By ordering TRs by consensus sequence homology in the output, XSTREAM reduces output redundancy while facilitating the identification of TR families from the input dataset. Related TRs may reflect structural or functional homology of their corresponding protein sequences. The current implementation of CC only compares TRs of equal period.</p>
         </sec>
         <sec>
            <st>
               <p>Output</p>
            </st>
            <p>XSTREAM automatically generates HTML files in a format similar to the output from Tandem Repeats Finder (TRF) <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. HTML output 1 contains a TR summary table and list of TR information, including sequence positions, period, and copy number. The range of sequence positions for each TR is hyperlinked to HTML output 2, which displays TR multiple alignments and consensus sequences. In the case of a multiple sequence input, XSTREAM generates HTML output 3, which reports a list of all input sequences containing reported TRs. An additional output option is a colored TR schematic, in PNG or HTML format, that represents the modular architectures of TR-containing sequences. The main user-definable output parameters of XSTREAM are presented in Table <tblr tid="T1">1</tblr>. A list of all user-defined parameters can be found on the XSTREAM webserver <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>User-defined parameters</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Definition</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Default Value</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Minimum character identity <it>i</it></p>
                     </c>
                     <c ca="center">
                        <p>0.7 for proteins</p>
                        <p>0.8 for nucleotides</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Minimum consensus matching <it>I</it></p>
                     </c>
                     <c ca="center">
                        <p>0.8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Minimum copy number <it>MinC</it></p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Minimum period <it>MinP</it></p>
                     </c>
                     <c ca="center">
                        <p>3 for proteins</p>
                        <p>10 for nucleotides</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Maximum period <it>MaxP</it></p>
                     </c>
                     <c ca="center">
                        <p>Half of input sequence length</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Maximum consecutive gaps <it>g </it>(see Appendix)</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Maximum indel error (see Appendix)</p>
                     </c>
                     <c ca="center">
                        <p>0.5</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Shown in this table are seven important user-adjustable parameters used by XSTREAM. These parameters function to limit the extent of TR degeneracy as well as to restrict the TR period and copy number of reported TRs. Default parameter values were empirically chosen to preferentially identify and model long degenerate repeat regions rather than shorter repetitive regions with higher sequence identity (e.g., where <it>I </it>= 1.0 and <it>g </it>= 0). We acknowledge that alternative architectures may exist for some complex repetitive domains. By including these and additional modifiable parameters, XSTREAM provides considerable user control over TR degeneracy and output filtration.</p>
               </tblfn>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>XSTREAM was coded using Java Standard Edition 5.0. To evaluate our implementation, we demonstrated and validated key features of XSTREAM using a variety of input datasets. First, a run time analysis shows the practicality of XSTREAM for TR detection in whole genomic sequence data. Second, multiple sequence alignments, merging, and nesting are demonstrated using anecdotal output examples. Third, the ability of XSTREAM to detect protein TR domains is validated using published results from five protozoan parasite genomes. Finally, we present schematic diagrams illustrating the utility of XSTREAM for graphically depicting modular architectures of TR proteins. In all cases, default parameter values were used unless stated otherwise (see Table <tblr tid="T1">1</tblr>). All tests and data collection were carried out using a Windows XP PC with a 64-bit AMD Athlon dual core 1.8 Ghz processor and 2 Gb RAM.</p>
         <p>A principle attribute of XSTREAM is practical running time for large sequence datasets. To measure how running time varies with differing input sequence lengths and parameter values, we used XSTREAM to analyze DNA sequences. We chose DNA over protein sequences simply because DNA sequences cover a substantially larger range of sequence lengths than proteins, thus enabling a more accurate assessment of running time. XSTREAM was run on DNA sequences ranging from 0.23 Mbp to 202 Mbp, either with gaps (<it>g </it>= 3) or without gaps (<it>g </it>= 0). For these analyses, sequences were examined in two sets. Shorter sequences, &lt; 10 Mbp, were processed with minimum TR domain length <it>minD </it>= 20 and minimum period <it>MinP </it>= 1, and no period restrictions. For longer sequences, we used <it>minD </it>= 50 and <it>MinP </it>= 10, and due to memory limitations, maximum period was set to 100 kbp. In addition, for periods 10 &#8211; 999 we used a divide-and-conquer approach (see Appendix) with fragment length = 1 Mbp. As shown in Table <tblr tid="T2">2</tblr>, running time increased approximately linearly with increasing sequence length for all DNA sequences with or without gaps (R<sup>2 </sup>> 0.99). Next, the effect of increasing dataset size on running time was examined by analyzing four Swiss-Prot datasets ranging in size from 40,292 to 230,150 non-redundant protein sequences, and setting <it>minD </it>= 10 and <it>MinP </it>= 1. As expected, since XSTREAM processes each protein sequence individually, running time scaled linearly (R<sup>2 </sup>> 0.998), as indicated in Table <tblr tid="T2">2</tblr>. A running time of less than 7. 5 min for the detection of degenerate TRs (using <it>g </it>= 3) from the Swiss-Prot 50.5 dataset clearly demonstrates the practicality of XSTREAM for multi-genome data mining.</p>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Running Time Analysis</p>
            </caption>
            <tblbdy cols="5">
               <r>
                  <c ca="center">
                     <p>
                        <b>Source</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Length, Mbp</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Time, min <it>g </it>= 3</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Time, min <it>g </it>= 0</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Longest period</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>S. cerevisiae </it>Chr. I</p>
                  </c>
                  <c ca="center">
                     <p>0.23</p>
                  </c>
                  <c ca="center">
                     <p>0.25</p>
                  </c>
                  <c ca="center">
                     <p>0.12</p>
                  </c>
                  <c ca="center">
                     <p>135 (17.9)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>S. cerevisiae </it>Chr. VIII</p>
                  </c>
                  <c ca="center">
                     <p>0.56</p>
                  </c>
                  <c ca="center">
                     <p>0.58</p>
                  </c>
                  <c ca="center">
                     <p>0.29</p>
                  </c>
                  <c ca="center">
                     <p>1998 (2)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>H. sapiens </it>&#946; TCR</p>
                  </c>
                  <c ca="center">
                     <p>0.68</p>
                  </c>
                  <c ca="center">
                     <p>0.77</p>
                  </c>
                  <c ca="center">
                     <p>0.36</p>
                  </c>
                  <c ca="center">
                     <p>340 (2)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>S. cerevisiae </it>Chr. XII</p>
                  </c>
                  <c ca="center">
                     <p>1.0</p>
                  </c>
                  <c ca="center">
                     <p>1.2</p>
                  </c>
                  <c ca="center">
                     <p>0.49</p>
                  </c>
                  <c ca="center">
                     <p>9137 (2)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>M. magneticum </it>AMB-1</p>
                  </c>
                  <c ca="center">
                     <p>4.9</p>
                  </c>
                  <c ca="center">
                     <p>6.4</p>
                  </c>
                  <c ca="center">
                     <p>2.2</p>
                  </c>
                  <c ca="center">
                     <p>1158 (4.2)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>H. sapiens </it>Chr. I contig</p>
                  </c>
                  <c ca="center">
                     <p>9.8</p>
                  </c>
                  <c ca="center">
                     <p>13.5</p>
                  </c>
                  <c ca="center">
                     <p>4.7</p>
                  </c>
                  <c ca="center">
                     <p>18557 (2.1)</p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>
                        <b>Source</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Length, Mbp</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Time, min <it>g </it>= 3</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Time, min <it>g </it>= 0</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Longest period</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>H. sapiens </it>Chr. XXI</p>
                  </c>
                  <c ca="center">
                     <p>33.0</p>
                  </c>
                  <c ca="center">
                     <p>34.4</p>
                  </c>
                  <c ca="center">
                     <p>16.4</p>
                  </c>
                  <c ca="center">
                     <p>3379 (2)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>R. norvegicus</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>80.7</p>
                  </c>
                  <c ca="center">
                     <p>86.7</p>
                  </c>
                  <c ca="center">
                     <p>39.1</p>
                  </c>
                  <c ca="center">
                     <p>2715 (2)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>H. sapiens </it>Chr. X</p>
                  </c>
                  <c ca="center">
                     <p>127.6</p>
                  </c>
                  <c ca="center">
                     <p>134.7</p>
                  </c>
                  <c ca="center">
                     <p>64.1</p>
                  </c>
                  <c ca="center">
                     <p>4863 (2)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>M. musculus </it>Chr I</p>
                  </c>
                  <c ca="center">
                     <p>202.5</p>
                  </c>
                  <c ca="center">
                     <p>239.1</p>
                  </c>
                  <c ca="center">
                     <p>90.0</p>
                  </c>
                  <c ca="center">
                     <p>3773 (2)</p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>
                        <b>Source</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>No. of Proteins</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Time, min <it>g </it>= 3</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Time, min <it>g </it>= 0</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b># TRs (# TRPs)</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Swiss-Prot v.30</p>
                  </c>
                  <c ca="center">
                     <p>40292</p>
                  </c>
                  <c ca="center">
                     <p>1.5</p>
                  </c>
                  <c ca="center">
                     <p>0.55</p>
                  </c>
                  <c ca="center">
                     <p>2428 (3771)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Swiss-Prot v.38</p>
                  </c>
                  <c ca="center">
                     <p>80000</p>
                  </c>
                  <c ca="center">
                     <p>2.6</p>
                  </c>
                  <c ca="center">
                     <p>1.1</p>
                  </c>
                  <c ca="center">
                     <p>3762 (7012)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Swiss-Prot v.45</p>
                  </c>
                  <c ca="center">
                     <p>163633</p>
                  </c>
                  <c ca="center">
                     <p>5.4</p>
                  </c>
                  <c ca="center">
                     <p>2.4</p>
                  </c>
                  <c ca="center">
                     <p>5302 (12359)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Swiss-Prot v.50.5</p>
                  </c>
                  <c ca="center">
                     <p>230150</p>
                  </c>
                  <c ca="center">
                     <p>7.3</p>
                  </c>
                  <c ca="center">
                     <p>3.5</p>
                  </c>
                  <c ca="center">
                     <p>6444 (17097)</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>Running times for the analysis of different input sequence datasets are shown, with the gap parameter <it>g </it>= 3, or <it>g </it>= 0. The following DNA sequences were downloaded from NCBI: <it>S. cerevisiae </it>Chromosomes I (gi 85666109), VIII (gi 82795252), and XII (gi 85666119), <it>H. sapiens </it>Chromosomes X (gi 89033689) and XXI (89058287), Chromosome I contig (gi 29789880), and the &#946; T-cell receptor locus (gi 114841177), <it>R. norvegicus </it>Chromosome XVI (gi 109504251), <it>M. musculus </it>Chromosome I (gi 83274080), and the <it>M. magneticum </it>AMB-1 (gi 82943940) genome. Sequences at the top (0.23 &#8211; 9.8 Mbp) were run with <it>minD </it>= 20, minP = 1, and all possible maximum periods. Longer DNA sequences (33 &#8211; 202.5 Mbp) were run with <it>minD </it>= 50, <it>minP </it>= 10, and (due to memory limitations) maximum period = 100 kbp; divide-and-conquer (see Appendix) was used for periods &lt; 1000 (fragment length = 1 Mbp). For each longest period found, the copy number is shown in parentheses. These data show a linear relationship between running time and increasing input sequence length (R<sup>2 </sup>> 0.99). Running times for analysis of 4 Swiss-Prot datasets, using <it>minD </it>= 10 and <it>minP </it>= 1, shown at the bottom, including the number of TRs detected (using consensus comparison, see Appendix) and the number of TR-containing proteins found (in parentheses). XSTREAM running time scaled linearly with increasing Swiss-Prot dataset size (R<sup>2 </sup>> 0.998).</p>
            </tblfn>
         </tbl>
         <p>In addition to efficient TR detection, other important capabilities of XSTREAM are demonstrated with the data shown in Figures <figr fid="F2">2</figr>, <figr fid="F3">3</figr>, <figr fid="F4">4</figr> and Table <tblr tid="T3">3</tblr>. A multiple alignment of a degenerate TR domain found in the <it>C. elegans </it>hypothetical protein CE22309 is presented in Figure <figr fid="F2">2</figr>. Shown above the alignment are the standard numerical properties reported by XSTREAM for each TR domain: sequence position, period, copy number, and consensus error. Each alignment is additionally described by a consensus sequence (below the dashed double line) and a consensus error string (below the consensus).</p>
         <tbl id="T3">
            <title>
               <p>Table 3</p>
            </title>
            <caption>
               <p>Extreme examples of DNA TRs detected by XSTREAM</p>
            </caption>
            <tblbdy cols="5">
               <r>
                  <c ca="center">
                     <p>
                        <b>Genomic Sequence</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Period</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Copy#</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Consensus Error</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Position</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>CE Chr III gi 86563600</p>
                  </c>
                  <c ca="center">
                     <p>94</p>
                  </c>
                  <c ca="center">
                     <p>403.6</p>
                  </c>
                  <c ca="center">
                     <p>0.05</p>
                  </c>
                  <c ca="center">
                     <p>7405280&#8211;7443237</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>At Chr I gi 42592260</p>
                  </c>
                  <c ca="center">
                     <p>158</p>
                  </c>
                  <c ca="center">
                     <p>453.7</p>
                  </c>
                  <c ca="center">
                     <p>0.1</p>
                  </c>
                  <c ca="center">
                     <p>14929399&#8211;15001291</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>At Chr I gi 42592260</p>
                  </c>
                  <c ca="center">
                     <p>45653</p>
                  </c>
                  <c ca="center">
                     <p>2.0</p>
                  </c>
                  <c ca="center">
                     <p>0.05</p>
                  </c>
                  <c ca="center">
                     <p>14346314&#8211;14437643</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>3415</p>
                  </c>
                  <c ca="center">
                     <p>8.5</p>
                  </c>
                  <c ca="center">
                     <p>0.01</p>
                  </c>
                  <c ca="center">
                     <p>12767448&#8211;12796444</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>Anecdotal examples of very high copy number and very long period DNA TRs from chromosome I of <it>A. thaliana </it>and chromosome III of <it>C. elegans </it>are shown.</p>
            </tblfn>
         </tbl>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Multiple Alignment of TR domain from <it>C. elegans</it></p>
            </caption>
            <text>
               <p><b>Multiple Alignment of TR domain from <it>C. elegans</it></b>. Standard TR properties are shown above the multiple alignment of a proline/glycine-rich TR domain in the <it>C. elegans </it>hypothetical protein sequence CE22309 from wormpep173 <url>http://www.sanger.ac.uk/Projects/C_elegans/WORMBASE/</url>. 'Positions' denotes the corresponding input sequence index range of this TR domain and 'Copy N' denotes copy number. The consensus error is 0.13 because <it>nG </it>= 99, <it>cG </it>= 29, <it>mG </it>= 583, and <it>tot </it>= 1595 (see <it>Consensus Building </it>in Appendix). Gap characters are shown in red to emphasize the high indel content of this TR. Below the dashed double line is the consensus sequence followed by the consensus error string shown in blue. Columns of the alignment with 100% character identity have no symbol in the consensus error string. The symbols ':' and '*' denote a column with greater than or equal to 50% character identity and a column with less than 50% character identity respectively.</p>
            </text>
            <graphic file="1471-2105-8-382-2"/>
         </fig>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Discontinuous Domain Merging of TR from <it>A. thaliana</it></p>
            </caption>
            <text>
               <p><b>Discontinuous Domain Merging of TR from <it>A. thaliana</it></b>. Successful merging of non-overlapping TR regions is shown by a TR domain from <it>A. thaliana </it>predicted gene product gi 9293925. Characters in the intervening degenerate sequence space that do not match the consensus are each represented by 'x'. This TR has a period of 9, a copy number of 8.67, a consensus error of 0.09 [<it>nG </it>= 6, <it>cG </it>= 1, <it>mG </it>= 9, <it>tot </it>= 88 (95-7 x's) (see <it>Consensus Building </it>and <it>Merging </it>in Appendix)], and is located at sequence positions 1 &#8211; 85.</p>
            </text>
            <graphic file="1471-2105-8-382-3"/>
         </fig>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>Example of a Nested TR Architecture</p>
            </caption>
            <text>
               <p><b>Example of a Nested TR Architecture</b>. A nested TR of two hierarchical levels is illustrated with an example from <it>T. brucei </it>(copy number = 7.78, period = 138, positions = 651 &#8211; 1738). Since a nested TR is by definition, a TR within another TR, the level of nesting depth corresponds to the number of TR domains that encapsulate a particular nested TR. This example shows nested TRs in two representations: the compressed consensus sequence with nested TRs denoted within brackets, and a graphical depiction of the hierarchical structure and distribution of nested TRs, with the consensus represented by the brown bottom bar, and increasing levels of nesting represented by additional bars moving upward.</p>
            </text>
            <graphic file="1471-2105-8-382-4"/>
         </fig>
         <p>The TR example shown in Figure <figr fid="F2">2</figr> also highlights the utility of the merging feature of XSTREAM when applied to overlapping domains with different periods. Without merging, this TR domain would be reported as several distinct TR fragments. The merging of two non-overlapping TR domains from an <it>A. thaliana </it>hypothetical protein (gi 9293925) is illustrated in Figure <figr fid="F3">3</figr>. This example illustrates the utility of incorporating a highly degenerate intervening sequence to define a larger TR domain that, without merging, would have been divided into two discontinuous regions (x's denote non-matching characters). As in proteins, DNA TRs may also contain extensive degeneracy. The high copy number TR domains shown in Table <tblr tid="T3">3</tblr> represent additional successful applications of XSTREAM's merging feature. Taken together, the merging of (non)overlapping TR regions allows XSTREAM to successfully model the architectures of TR domains that have accumulated extensive substitution and/or indel mutations, or that have arisen through convergent evolutionary mechanisms.</p>
         <p>In addition to extensive degeneracy, TRs may have very long periods and nested architectures. XSTREAM implements a novel long-period filtering procedure (see Appendix) to find TRs with periods &#8805;1000. The utility of this method is demonstrated by some of the DNA examples in Table <tblr tid="T2">2</tblr> and by the long-period <it>A. thaliana </it>DNA repeats in Table <tblr tid="T3">3</tblr>. XSTREAM also incorporates a strategy to find and describe nested TR architectures, represented by the regular expression [<it>x</it>,<it>n</it>], with <it>n </it>denoting the number of tandem copies of substring <it>x</it>. An example of TR nesting that shows two levels of nesting is presented in Figure <figr fid="F4">4</figr>. Included in the figure is a block diagram illustrating the hierarchical patterning that epitomizes nested TRs. Taken together, these merging, long-period filtration, and nesting features make XSTREAM a useful tool for detection and architecture modeling of TR domains in both nucleotide and protein sequences.</p>
         <p>To validate the utility of XSTREAM for detecting TR-containing proteins, we analyzed the proteomes of five parasite genomes, and compared our output to the TR proteins identified in these same genomes by TRF <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Protein sequence datasets for these parasites were downloaded <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> and processed using <it>minP </it>= 1, <it>minD </it>= 90 and minimum copy number <it>minC </it>= 2, or 3. These parameter values were chosen to emulate the TR criteria used in <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> to find TR domains in gene sequences of at least ~250 bp. Setting <it>minD </it>= 90 amino acids for XSTREAM corresponds to a slightly more stringent 270 bp minimum. Table <tblr tid="T4">4</tblr> summarizes the TRs found by XSTREAM, using <it>minC </it>= 3 or <it>minC </it>= 2, and by TRF <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Using <it>minC </it>= 3, XSTREAM identified more TR containing proteins in all parasites except <it>T. annulata</it>. In <it>L. infantum</it>, the causative agent of Leishmaniasis and the focus of the Goto et al. studies <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>, XSTREAM found seven TR proteins that they did not identify, while three of the TR proteins found by TRF were not detected by XSTREAM. Upon closer examination of the three "missed" proteins, each was found to have a TR domain with copy number less than 3, which would not be reported by XSTREAM using <it>minC </it>= 3. When XSTREAM was rerun with <it>minC </it>= 2, all 64 of the previously identified <it>L. infantum </it>TR proteins <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> were found, along with 14 additional TR containing proteins that are schematically diagrammed in Figure <figr fid="F5">5</figr> to illustrate the significant diversity of TR domain architectures within these 14 proteins.</p>
         <tbl id="T4">
            <title>
               <p>Table 4</p>
            </title>
            <caption>
               <p>Number of TR proteins detected in protozoan parasite genomes by XSTREAM and TRF</p>
            </caption>
            <tblbdy cols="4">
               <r>
                  <c ca="left">
                     <p>
                        <b>Species</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>XSTREAM <it>MinC </it>= 3</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>XSTREAM: <it>MinC </it>= 2</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>TRF</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>L. infantum</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>68 (3, 7)</p>
                  </c>
                  <c ca="center">
                     <p>78 (0, 14)</p>
                  </c>
                  <c ca="center">
                     <p>64</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>L. major</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>65</p>
                  </c>
                  <c ca="center">
                     <p>74</p>
                  </c>
                  <c ca="center">
                     <p>59</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>T. brucei</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>115</p>
                  </c>
                  <c ca="center">
                     <p>135</p>
                  </c>
                  <c ca="center">
                     <p>73</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>P. falciparum</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>252</p>
                  </c>
                  <c ca="center">
                     <p>263</p>
                  </c>
                  <c ca="center">
                     <p>169</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>T. annulata</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>10</p>
                  </c>
                  <c ca="center">
                     <p>20</p>
                  </c>
                  <c ca="center">
                     <p>11</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>Numbers in each column represent the number of different TR-containing proteins detected using <it>minP </it>= 1 and <it>minD </it>= 90 amino acids for XSTREAM, and a minimum score of 500 for TRF. Within the parentheses, the number on the left represents the number of genes identified in [18] that were not identified by XSTREAM and the number on the right represents the number of genes identified by XSTREAM that were not identified by [18]. Comparison of output on an individual protein basis was only possible for <it>L. infantum </it>as Goto et al. (2007) did not report identified proteins for the other parasites.</p>
            </tblfn>
         </tbl>
         <fig id="F5">
            <title>
               <p>Figure 5</p>
            </title>
            <caption>
               <p>14 <it>L. infantum </it>TR Proteins Found by XSTREAM</p>
            </caption>
            <text>
               <p><b>14 <it>L. infantum </it>TR Proteins Found by XSTREAM. </b>A colored repeat distribution schematic generated by XSTREAM showing 14 <it>L. infantum </it>TR-containing proteins found by XSTREAM and not by Goto et al [18]. All protein sequence lengths are normalized, and shown from top to bottom in order of decreasing TR period. TR copies are separated by a vertical black line. Each color corresponds to a specific TR domain. In cases where TR domains of adjacent protein sequences share the same color, such TRs were grouped into the same class by the consensus comparison function (see Appendix).</p>
            </text>
            <graphic file="1471-2105-8-382-5"/>
         </fig>
         <p>Since TR domains can constitute variable fractions of the parent protein sequence (Figure <figr fid="F5">5</figr>), XSTREAM incorporates the simple concept of <it>TR Content</it>, defined as the ratio of the TR domain length to the input sequence length, as an additional metric for comparing modular proteins. Use of this metric allows XSTREAM to filter output using any arbitrary level of TR content, a feature that is illustrated using the protein sequence dataset from <it>A. thaliana </it>(TAIR6_pep_20060907). The Arabidopsis proteome was analyzed using parameter values <it>MinP </it>= 1 and TR Content &#8805; 0.7. The relatively small number of proteins with &#8805;70% TR content resulting from this analysis are schematically depicted in Figure <figr fid="F6">6</figr>. This output clearly reveals the modular architectures of two large, well-described <it>A. thaliana </it>protein families (polyubiquitins with period = 76, and proline-rich extensin-like proteins with period = 25) along with that of additional TR proteins.</p>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>TR Proteins from <it>A. thaliana</it></p>
            </caption>
            <text>
               <p><b>TR Proteins from <it>A. thaliana</it></b>. A colored repeat distribution schematic generated by XSTREAM showing the 57 TR-containing proteins from <it>A. thaliana </it>(TAIR6_pep_20060907) with <it>minP </it>= 1 and minimum TR content = 0.7. These protein sequences are ordered by decreasing period from top to bottom. The longest period is shown in the top left panel and the shortest is shown in the bottom right panel. Notice two large classes of protein sequences (polyubiquitins and proline-rich extensin-like family proteins) as determined by grouping their TR domains with the consensus comparison module (see Appendix).</p>
            </text>
            <graphic file="1471-2105-8-382-6"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>The use of <it>a priori </it>computational methods to search genome databases for repetitive elements has revealed an abundance of both DNA and peptide repeats in nature, many of which occur in tandemly repeated patterns <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B8">8</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>. The detection and analysis of repeated peptide sequences has received considerable attention in recent years, including the recent publication of a large protein repeats database <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Despite the potential importance of such repetitive sequences, the available repeat detection software suffers from both time complexity and output redundancy problems. To address these issues, and to facilitate the detection and modeling of TR structures in general, we developed a new software tool called XSTREAM.</p>
         <p>The utility of XSTREAM for efficient and effective detection of degenerate tandem repeats in large input sequence datasets was demonstrated by testing and validation. Practical performance was confirmed by showing that XSTREAM running time can scale linearly with both increasing sequence lengths (up to 202.5 Mbp of DNA sequence) and increasing dataset sizes (up to 230,150 protein sequences). XSTREAM invokes no period limitations and can thus detect TRs with very long periods, as illustrated by the ~45 kbp tandem duplication identified in chromosome I of <it>A. thaliana </it>(Table <tblr tid="T2">2</tblr>). With the implemented merging heuristic, XSTREAM can also identify TR domains with intermittent regions of high degeneracy, such as the TR from <it>C. elegans </it>chromosome III with period 94 and copy number >400 (Table <tblr tid="T2">2</tblr>), and the proline/glycine-rich protein from <it>C. elegans </it>shown in Figure <figr fid="F2">2</figr>. In addition, by searching for nested TR structures, XSTREAM detects TRs within TRs (Figure <figr fid="F4">4</figr>), a useful feature for gaining insights into the evolution of complex TR architectures.</p>
         <p>Output redundancy is a problem inherent in repeat detection that has often been ignored. For example, using a SW approach, Katti et al. <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> searched Swiss-Prot 38 for TRs with periods between 1 and 20, and compiled the TRIPS database of TRs and their corresponding protein sequence identifiers <url>http://www.ncl-india.org/trips</url>. In many cases, TRs with different periods were reported that occupy the same protein sequence space. The output of another repeat finding tool <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> also demonstrates the importance of redundancy removal. The ProtRepeatsDB tool <url>http://bioinfo.icgeb.res.in/repeats</url> was designed for comparing repeated peptides from many organisms. Though aware of redundancy problems, the strategy implemented by Kalita et al. falls short of providing concise repeat output in numerous cases. For example, ProtRepeatsDB reported 1312 and 568 distinct perfect peptide repeats in the UBQ3 and UBQ12 polyubiquitin sequences from <it>A. thaliana</it>, respectively. Unexpectedly, the canonical period 76 TRs known to characterize polyubiquitins were absent. Such highly redundant outputs illustrate the importance of the redundancy removal tactics incorporated into XSTREAM. By invoking several strategies (see Redundancy Elimination in Appendix), including the use of <it>irreducible </it>TR periods <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, XSTREAM produces non-redundant TR output. Analysis of the <it>A. thaliana </it>proteome by XSTREAM, for example, reports the UBQ3 and UBQ12 sequences only once, with an irreducible, period 76 TR covering virtually the entire protein sequences.</p>
         <p>The recent analysis of five protozoan parasite genomes using TRF <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> provided a reasonable reference for testing XSTREAM on genome-scale datasets. Using <it>minD </it>= 90 to mimic the TR domain criterion used by Goto et al, XSTREAM detected significantly more TR proteins from all parasite genomes, including all 64 of the previously identified <it>L. infantum </it>TR proteins <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Further analysis of these 64 TR protein domains revealed that the TR domains identified by both algorithms were comparable in size (data not shown).</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>By testing XSTREAM on a variety of sequence data, we demonstrated the utility of this new genome data-mining tool for identifying TRs with diverse periods and domain sizes, varied levels of degeneracy, and complex architectures. These capabilities should facilitate potentially significant applications. For example, TRs present in parasitic pathogens are known to elicit important immunological responses that may provide antigenic protection (e.g., <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>). New computational approaches for detecting TR proteins might thus be useful for identifying novel protein antigens useful for diagnostics and vaccine development <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. Secondly, since TR domains are characteristic of modular structural proteins, use of XSTREAM may lead to the <it>in silico </it>discovery of phylogenetically diverse proteins with novel biomaterials and biomimetic applications.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>Project Name: XSTREAM</p>
         <p>Project home page and availability: <url>http://jimcooperlab.mcdb.ucsb.edu/xstream</url></p>
         <p>Operating system(s): Platform independent</p>
         <p>Programming language: Java</p>
         <p>Any restrictions to use by non-academics: yes, contact author JBC for details</p>
      </sec>
      <sec>
         <st>
            <p>List of abbreviations used</p>
         </st>
         <p>TR, tandem repeat; TRF, Tandem Repeats Finder <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>; DP, dynamic programming; GRDP, gap-restricted dynamic programming; SSA, sequence self-alignment; SW, sliding window; SE, seed extension; WDP, wrap-around dynamic programming; CC, consensus comparison; ET, edge trimming; CW, comparison wobble; <it>minP</it>, minimum period; <it>minC</it>, minimum copy number; <it>minD</it>, minimum TR domain length; HPS, heuristic partitioning strategy</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The author(s) declares that there are no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>AMN conceived of, designed, implemented, tested, and validated XSTREAM, and wrote the manuscript. JBC conceived of, tested, and validated XSTREAM, and wrote the manuscript. Both authors approved the final manuscript.</p>
      </sec>
      <sec>
         <st>
            <p>Appendix</p>
         </st>
         <sec>
            <st>
               <p>Preliminary Notations</p>
            </st>
            <p>&#8226; <it>S </it>= input sequence, which takes values from alphabet {A,C,G,T} for nucleotide sequences and alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} for proteins</p>
            <p>&#183; |<it>S</it>| = length of <it>S</it></p>
            <p>&#183; <it>S</it>[<it>j</it>] = the character at index <it>j </it>in <it>S </it>with <it>j </it>&#8805; 0</p>
            <p>&#183; <it>S</it>[<it>i</it>, <it>j</it>] = the subsequence in <it>S </it>from index <it>i </it>to index <it>j </it>inclusively</p>
            <p>&#8226; <it>Xi </it>= TR domain <it>i</it></p>
            <p>&#183; |<it>Xi</it>| = length of entire TR domain <it>Xi</it></p>
            <p>&#183; <it>Xi</it>[<it>j</it>] = repeat copy <it>j </it>in <it>Xi </it>with <it>j </it>&#8805; 0</p>
            <p>&#183; |<it>Xi</it>[<it>j</it>]| = length of copy <it>j</it></p>
            <p>&#183; |<it>Xi</it>[]| = size of array <it>Xi</it>[]</p>
            <p>&#183; <it>XiS </it>= lowest index of <it>Xi</it>; starting position in <it>S</it></p>
            <p>&#183; <it>XiE </it>= highest index of <it>Xi</it>; ending position in <it>S</it></p>
            <p>&#183; <it>XiSE </it>= index range [<it>XiS</it>, <it>XiE</it>]</p>
            <p>&#183; <it>Ei </it>= copy number (exponent) of <it>Xi</it></p>
            <p>&#183; <it>Ci </it>= consensus sequence of <it>Xi</it></p>
            <p>&#183; <it>Pi </it>= period of <it>Xi </it>= period of <it>Ci</it></p>
            <p>&#183; <it>CEi </it>= consensus error of <it>Xi </it>=</p>
            <p>- <it>Without gaps: </it># of mismatching characters to consensus/total # of characters in aligned <it>Xi</it></p>
            <p>- <it>With gaps: </it>see <it>Consensus Building</it></p>
            <p>&#183; <it>Ii </it>= indel error of <it>Xi </it>= # of gaps in aligned <it>Xi</it>/total # of characters in aligned <it>Xi</it></p>
            <p>&#183; <it>Ri </it>= referential repeat copy of <it>Xi</it>: used during TR domain expansion and maximality</p>
            <p>&#8226; {<it>X</it>} = {<it>X</it><sub>0</sub>, <it>X</it><sub>1</sub>,..., <it>X</it><sub><it>n</it></sub>} = set of all identified TR domains</p>
         </sec>
         <sec>
            <st>
               <p>Pre-Processing</p>
            </st>
            <p>To find repeats of various periods in any FASTA-formatted input sequence <it>S</it>, XSTREAM looks, by default, for exact repetitions (seeds) of lengths 3 and 5. Length 7 is also used if |<it>S</it>| &#8805; 2000. Seed lengths are user-adjustable. XSTREAM records the distance between each pair of adjacent seeds, |<it>p </it>- <it>q</it>|, where the lowest index in <it>S </it>of each seed in the pair is represented by <it>p </it>and <it>q </it>respectively, and <it>p </it>&lt;<it>q</it>. All seed positions and distances between adjacent seeds are stored and accessed using a hash table. In addition, XSTREAM records in an integer array <it>M, </it>the hashcodes and sequence indices for all seeds of minimum length <it>L</it>, where <it>L </it>= 3 by default. For instance, a seed of length <it>L </it>starting in position 5 in <it>S </it>would have its hashcode stored in <it>M</it>[5]. The utility of <it>M </it>is explained shortly.</p>
         </sec>
         <sec>
            <st>
               <p>TR Detection</p>
            </st>
            <sec>
               <st>
                  <p>Seed Extension</p>
               </st>
               <p>XSTREAM traverses the distance list in order of increasing distance, and for each set of identical distances, moves down <it>S </it>in order of increasing indices. For a given seed pair, let <it>p</it>, <it>q </it>be defined the same as previously and let <it>x</it>, <it>y </it>be the starting positions of two sequence iterators, where <it>x </it>= <it>p </it>+ <it>L</it>, <it>y </it>= <it>q </it>+ <it>L</it>. Further, let <it>d </it>= |<it>p </it>- <it>q</it>|, <it>p</it>* = <it>p </it>+ <it>d </it>- 1, and <it>q</it>* = <it>q </it>+ <it>d </it>+ <it>&#949; </it>- 1 where <it>0 </it>&#8804; <it>&#949; </it>&#8804; <it>g </it>(for explanation of <it>g</it>, see Gap-Restricted Dynamic Programming below; &#949; is explained shortly) and <it>q</it>* &lt; |<it>S</it>|. Because the seeds of each matching pair are of length <it>L</it>, <it>x </it>and <it>y </it>iterate through <it>S </it>in the regions <it>S</it>[<it>p </it>+ <it>L</it>, <it>p</it>*] and <it>S</it>[<it>q </it>+ <it>L</it>, <it>q</it>*]. Note that in the case <it>L </it>= 3, the minimum copy number is 2 for all periods except periods 1 and 2, which cannot have copy number less than 4 and 2.5 respectively. We now refer to array <it>M</it>, which was constructed during seed detection. To bypass individual character comparison, <it>M </it>is interrogated for matching hashcodes. If <it>M</it>[<it>x</it>] = <it>M</it>[<it>y</it>] and (<it>x </it>+ <it>L</it>) = <it>p</it>* and (<it>y </it>+ <it>L</it>) &#8804; <it>q</it>*, <it>x </it>and <it>y </it>are incremented by <it>L </it>(since each hashcode in <it>M </it>corresponds to a repeat of length <it>L</it>), and a match of <it>L </it>characters is recorded. By comparing hashcodes instead of substrings and by allowing jumping in blocks of <it>L </it>characters, usage of <it>M </it>can decrease XSTREAM running time. If <it>M</it>[<it>x</it>] = <it>M</it>[<it>y</it>] and <it>x </it>&#8804; <it>p</it>* &lt; (<it>x </it>+ <it>L</it>), a match of length <it>min</it>(<it>L</it>, <it>p</it>* - (<it>x </it>- 1)) is recorded, and SE terminates. If <it>M</it>[<it>x</it>] &#8800; <it>M</it>[<it>y</it>] and <it>g </it>= 0, XSTREAM compares the character pair in <it>S </it>at <it>S</it>[<it>x</it>] and <it>S</it>[<it>y</it>]. Whether or not <it>S</it>[<it>x</it>] = <it>S</it>[<it>y</it>], if (<it>x </it>+ 1) &#8804; <it>p</it>* and (<it>y </it>+ 1) = <it>q</it>*, <it>x </it>and <it>y </it>are incremented by 1, and XSTREAM returns to hashcode comparison using <it>M</it>.</p>
               <p>If the case arises where <it>M</it>[<it>x</it>] &#8800; <it>M</it>[<it>y</it>] and <it>g </it>> 0, a novel procedure termed "comparison wobble" (CW) is invoked. CW allows for efficient approximation of indels using array <it>M </it>and parameter <it>g</it>. This procedure is one-sided, in that it fixes <it>x </it>and allows for variations in <it>y</it>, denoted by <it>y</it>*. We place the following restrictions on <it>y</it>*:</p>
               <p>i) |<it>y</it>* - <it>y</it>| &#8804; <it>g</it></p>
               <p>ii) <it>y</it>* &lt; |<it>S</it>|</p>
               <p>iii) If <it>y</it>* &lt;<it>y</it>, then (<it>y </it>- <it>y</it>*) &#8804; <it>L </it>AND (<it>y </it>- <it>y</it>*) &lt;<it>d</it>. We enforce this constraint to avoid comparing subsequences at the same pair of positions in <it>S </it>more than once.</p>
               <p>iv) <it>y</it>* > &#937;, where &#937; = highest index in <it>S</it>[<it>q </it>+ <it>L</it>, <it>q</it>*] with matching character from the current seed extension &#8211; e.g. if last match was <it>M</it>[15], then &#937; = 15 + <it>L </it>- 1; if last match was <it>S</it>[15], then &#937; = 15. This rule prohibits matching redundancy.</p>
               <p>If &#8707;<it>y</it>* such that M[<it>x</it>] = <it>M</it>[<it>y</it>*], XSTREAM records a match of <it>min</it>(<it>L</it>, <it>p</it>* - (<it>x </it>- 1)), increments <it>x </it>by <it>L</it>, sets <it>y </it>&#8592; (<it>y</it>* + <it>L</it>), and if <it>x </it>&#8804; <it>p</it>*, returns to standard SE (see above paragraph). Because <it>y </it>&#8592; (<it>y</it>* + <it>L</it>), it is possible that <it>y </it>moves beyond <it>q </it>+ <it>d </it>- 1, hence the need for <it>&#949;</it>. In addition, if a match is found when <it>y</it>* &lt;<it>y </it>(prior to updating <it>y</it>), the mismatch record is adjusted to take into account any currently matching characters that were initially found to be non-matching. If <it>M</it>[<it>x</it>] &#8800; <it>M</it>[&#8704;<it>y</it>*], XSTREAM transitions to single character comparison using <it>S</it>, and then if space permits, returns to standard comparison using <it>M</it>. An example of seed extension with CW is shown in Figure <figr fid="F7">7</figr>.</p>
               <fig id="F7">
                  <title>
                     <p>Figure 7</p>
                  </title>
                  <caption>
                     <p>Seed Extension Example</p>
                  </caption>
                  <text>
                     <p><b>Seed Extension Example</b>. Extension of the seed pair 'KYR' is illustrated using the input sequence <it>S </it>= PQKYRSACYKYRACYFG (|<it>S</it>| = 19) with parameter values <it>L </it>= 3 and <it>g </it>= 1. A tracing of this SE example is shown for the sequence iterator values (<it>x</it>, <it>y</it>) and the compared subwords in <it>S</it>. The SE subroutine used in each step is indicated in parentheses, where <it>M </it>= hashcode array and CW = consensus wobble.</p>
                  </text>
                  <graphic file="1471-2105-8-382-7"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>TR Domain Expansion</p>
               </st>
               <p>Seed extension operates on seed pairs, and therefore, if successful, only yields putative TRs of copy number 2. To further extend each potential TR <it>Xi</it>, XSTREAM implements two procedures, although the second one is used only if <it>g </it>> 0. First, <it>x </it>is reset to <it>p</it>. In this way the copy in <it>Xi </it>with the lowest index serves as the character comparison reference repeat <it>Ri</it>. The value given to <it>q </it>depends upon whether XSTREAM is attempting to extend <it>Xi </it>downstream or upstream of <it>Xi</it>'s current sequence region. If downstream, <it>q </it>is incremented by <it>d</it>. If upstream, <it>q </it>is initially set to <it>p </it>- <it>d</it>, and decremented by <it>d </it>thereafter. The first method for domain expansion is exactly the same as seed extension except <it>x </it>= <it>p</it>, <it>y </it>= <it>q</it>, and the evaluated regions in <it>S </it>are <it>S</it>[<it>p</it>, <it>p</it>*] and <it>S</it>[<it>q</it>, <it>q</it>*], where 0 &#8804; <it>q </it>&#8804; (|<it>S</it>| - <it>d</it>). If this procedure is successful, the new copy is added to <it>Xi</it>. If unsuccessful and if <it>g </it>> 0, XSTREAM invokes the second procedure, which uses GRDP (see Gap Restricted Dynamic Programming below) on the same regions in <it>S</it>. GRDP is better, albeit slower, than CW at identifying indel regions. Upon completion of GRDP, the number of matching characters in the alignment is determined and if that number is high enough, the new copy is added to <it>Xi</it>. Following success by either expansion method, <it>q </it>is updated and domain expansion is performed again. If <it>i </it>is not satisfied, domain expansion ceases, and the current candidate TR domain is sent to the maximality function.</p>
            </sec>
            <sec>
               <st>
                  <p>Maximality</p>
               </st>
               <p>The maximality procedure makes use of <it>Ri</it>, with <it>p </it>remaining equal to the lowest index of <it>Ri</it>. This method finds the longest valid prefix and suffix of <it>Ri </it>by searching downstream and upstream of <it>Xi </it>respectively. A DP sequence alignment scoring scheme is used, with match = 2, mismatch = -4, and gaps = -4 (user modifiable). Let <it>l </it>= <it>XiS</it>, <it>r </it>= <it>XiE</it>, <it>left </it>= <it>l </it>- <it>min</it>(<it>Pi</it>, <it>l</it>), and <it>right </it>= <it>r </it>+ <it>min</it>(<it>Pi</it>, |<it>S</it>| - (<it>r </it>+ 1)). Further, let <it>Q</it><sub>1 </sub>= <it>S</it>[<it>left</it>, <it>l </it>- 1], <it>Q</it><sub>2 </sub>= <it>S</it>[<it>r </it>+ 1, <it>right</it>], <it>RiQ</it><sub>1 </sub>= <it>S</it>[(<it>p </it>+ <it>Pi</it>) - <it>min</it>(<it>Pi</it>, <it>l</it>), <it>p </it>+ <it>Pi </it>- 1], and <it>RiQ</it><sub>2 </sub>= <it>S</it>[<it>p</it>, <it>p </it>+ <it>min</it>(<it>Pi</it>, |<it>S</it>| - (<it>r </it>+ 1)) - 1]. Since XSTREAM needs to find the character pair that corresponds to the highest score, it reverses the order of characters for both <it>Q</it><sub>1 </sub>and <it>RiQ</it><sub>1 </sub>prior to alignment. If <it>g </it>> 0, GRDP is used to align <it>Q</it><sub>1 </sub>with <it>RiQ</it><sub>1 </sub>and <it>Q</it><sub>2 </sub>with <it>RiQ</it><sub>2</sub>. If <it>g </it>= 0, the sequences are aligned so that the members of each sequence pair overlap 100%. XSTREAM uses the DP scoring scheme regardless of whether GRDP is used. The highest scoring indices in <it>Q</it><sub>1</sub>, <it>Q</it><sub>2 </sub>are denoted <it>Q</it><sub>1</sub>* and <it>Q</it><sub>2</sub>* respectively. If, at index <it>Q</it><sub>1</sub>*, the score exceeds 0, <it>Xi </it>is extended upstream by (<it>Q</it><sub>1</sub>* + 1) characters, and if the score for index <it>Q</it><sub>2</sub>* is greater than 0, <it>Xi </it>is extended downstream by (<it>Q</it><sub>2</sub>* + 1) characters.</p>
            </sec>
            <sec>
               <st>
                  <p>Copy Number Computation</p>
               </st>
               <p>For a given <it>Xi</it>, using the indicator function I (I[true] = 1; I[false] = 0):</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2105-8-382-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mo>&#8226;</m:mo>
                              <m:mi>E</m:mi>
                              <m:mi>i</m:mi>
                              <m:mo>=</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munder>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mo>&#8704;</m:mo>
                                       <m:mi>j</m:mi>
                                       <m:mtext>&#160;</m:mtext>
                                       <m:mo>&#8712;</m:mo>
                                       <m:mi>X</m:mi>
                                       <m:mi>i</m:mi>
                                    </m:mrow>
                                 </m:munder>
                                 <m:mrow>
                                    <m:mtext>I[|</m:mtext>
                                    <m:mi>X</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mtext>[</m:mtext>
                                    <m:mi>j</m:mi>
                                    <m:mtext>]|</m:mtext>
                                    <m:mo>&#8805;</m:mo>
                                    <m:mi>P</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mtext>]</m:mtext>
                                    <m:mo>+</m:mo>
                                    <m:mtext>I[|</m:mtext>
                                    <m:mi>X</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mtext>[</m:mtext>
                                    <m:mi>j</m:mi>
                                    <m:mtext>]|</m:mtext>
                                    <m:mo>&lt;</m:mo>
                                    <m:mi>P</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mtext>]</m:mtext>
                                    <m:mo>&#8901;</m:mo>
                                    <m:mtext>(|</m:mtext>
                                    <m:mi>X</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mtext>[</m:mtext>
                                    <m:mi>j</m:mi>
                                    <m:mtext>]|/</m:mtext>
                                    <m:mi>P</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mtext>)&#160;</m:mtext>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGfbqrcqWGPbqAcqGH9aqpdaaeqbqaaiabbMeajjabbUfaBjabbYha8jabdIfayjabdMgaPjabbUfaBjabdQgaQjabb2faDjabbYha8jabgwMiZkabdcfaqjabdMgaPjabb2faDjabgUcaRiabbMeajjabbUfaBjabbYha8jabdIfayjabdMgaPjabbUfaBjabdQgaQjabb2faDjabbYha8jabgYda8iabdcfaqjabdMgaPjabb2faDjabgwSixlabbIcaOiabbYha8jabdIfayjabdMgaPjabbUfaBjabdQgaQjabb2faDjabbYha8jabb+caViabdcfaqjabdMgaPjabbMcaPiabbccaGaWcbaGaeyiaIiIaemOAaOMaeeiiaaIaeyicI4SaemiwaGLaemyAaKgabeqdcqGHris5aaaa@6DA2@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>Computing <it>Ei </it>in this way demands that <it>Ei </it>&#8804; |<it>Xi</it>[]|. Both gap and masked ('x', see Merging) characters are not considered during <it>copy number computation</it>. Ei is updated whenever XSTREAM changes <it>XiS</it>, <it>XiE</it>, <it>Pi</it>, or <it>Xi</it>'s multiple alignment.</p>
            </sec>
            <sec>
               <st>
                  <p>Sequence Masking</p>
               </st>
               <p>After each successful seed extension, XSTREAM masks the sequence space corresponding to the newly detected TR domain in order to reduce both running time and repeat redundancy (see Redundancy Elimination I and Two-stage TR Detection below). Afterward, the next seed pair, if one exists, is extended.</p>
            </sec>
            <sec>
               <st>
                  <p>Period Offset</p>
               </st>
               <p>If <it>g </it>> 0 and comparison wobble is successfully used, then the period <it>Pi </it>for a given TR <it>Xi </it>may need adjustment. To approximate a better period, <it>Pi</it>*, we turn to the offset <it>y</it>* - <it>y </it>for every CW success for a given <it>Xi</it>. Let <it>So </it>= &#931;(<it>y</it>* - <it>y</it>), for all successful extensions, i.e. <it>Xi</it>[&#8704;<it>j</it>] &#8800; <it>Ri</it>. Then, <it>Pi</it>* = <it>Pi </it>+ (<it>So</it>/<it>Ei</it>), and <it>Pi </it>&#8592; <it>Pi</it>*. Therefore, <it>Pi </it>is updated using the average period offset. This function is important for TR domain parsing when <it>g </it>> 0, since <it>Pi </it>is used to derive a temporary <it>Ci</it>, which is needed for TR domain alignment.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>TR Characterization</p>
            </st>
            <sec>
               <st>
                  <p>TR Domain Parsing</p>
               </st>
               <p>In order to best characterize any TR domain <it>Xi</it>, its copies are aligned to one another and used to create a consensus sequence <it>Ci</it>. We describe our consensus derivation procedure shortly. To align <it>Xi</it>, it must be partitioned into its repetitive parts. For the case <it>g </it>= 0, starting from <it>XiS</it>, <it>S</it>[<it>XiS</it>, <it>XiE</it>] is cut into as many tandem fragments of length <it>Pi </it>as possible. Because of maximality, <it>Xi</it>'s last copy may have length less than <it>Pi</it>. Multiple alignment of <it>Xi </it>is achieved by simply stacking all copies in the order they occur in <it>S</it>. If <it>g </it>> 0, partitioning of <it>Xi </it>is much more complex. To preserve practical running time for the case <it>g </it>> 0, we use one of two segmentation tactics. Both methods require a putative consensus sequence <it>Ci </it>for a given <it>Xi</it>. XSTREAM therefore initially partitions <it>Xi </it>in the same way as when <it>g </it>= 0. Afterward, <it>Xi </it>is aligned using a multiple alignment algorithm that we describe shortly. Following alignment, a transient <it>Ci </it>is derived. We now compare/contrast XSTREAM's two partitioning procedures for the case <it>g </it>> 0.</p>
               <p>WDP can optimally parse a TR domain <it>Xi </it>in O(<it>mn</it>) time given a representative copy of length <it>m </it>(i.e. <it>Ci</it>), where <it>m </it>= <it>Pi </it>and <it>n </it>= |<it>Xi</it>| <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. This time complexity is practical up until <it>mn </it>is very large. Since XSTREAM has no period limitations, we developed a heuristic partitioning strategy (HPS) that uses GRDP. When <it>mn </it>> 1,000,000 and <it>m </it>> <it>g</it>, XSTREAM invokes HPS; otherwise, WDP is used. Our version of WDP requires two passes through the DP matrix and therefore computes 2 <it>mn </it>scores, whereas GRDP computes &lt; (2<it>g + </it>1)<it>n </it>scores. To ensure that HPS makes less DP matrix computations than WDP, we require <it>m </it>> (<it>g </it>+ 1/2), which is equivalent to <it>m </it>> <it>g </it>since <it>m</it>, <it>g </it>only take integer values.</p>
               <p>As mentioned, both partitioning strategies require <it>Ci</it>. WDP aligns <it>Ci </it>to the domain <it>D </it>= <it>S</it>[<it>XiS</it>, <it>XiE</it>]. Afterward, <it>D </it>is cut between every adjacent instance of <it>Ci</it>. HPS works by first building a concatamer of <it>Ci </it>comprised of <it>n </it>copies of <it>Ci</it>, where <it>n </it>= |<it>Xi</it>|/|<it>Ci</it>|. Because <it>n </it>may take a non-integer value, the consensus concatamer can have more or fewer copies than an optimal partitioning of <it>Xi</it>. After pairwise alignment to <it>D </it>using GRDP, |<it>Xi</it>| is segmented in the same way as described for WDP.</p>
            </sec>
            <sec>
               <st>
                  <p>Multiple Alignment</p>
               </st>
               <p>XSTREAM employs the STAR alignment algorithm for multiple sequence alignment. The center sequence is computed using GRDP exclusively. We elected to use GRDP over standard DP because the number of pairwise alignments that are needed increases as a function of (<it>floor</it>(<it>Ei</it>))<sup>2 </sup>(we use the floor function since <it>Ei </it>may be non-integer), in which case the last copy is excluded from being a center sequence. Because our version of STAR does not use standard DP, it will not always compute an optimal center sequence. Nevertheless, to maximize the practicality of XSTREAM for large dataset analyses, we decided that the order of magnitude performance gain provided by GRDP outweighs the possible decrease in multiple alignment quality. Since GRDP requires input sequences of the same length, we temporarily replicate <it>Xi</it>, denoted by <it>Xi</it>*, and add the dash character '-' to the rightmost end of all copies of <it>Xi</it>* where |<it>Xi</it>*[<it>j</it>]| &lt;<it>max</it>(|<it>Xi</it>[&#8704;<it>j</it>]|) until |<it>Xi</it>*[&#8704;<it>j</it>]| = <it>max</it>(|<it>Xi</it>[&#8704;<it>j</it>]|). We then find the center using <it>Xi</it>*. Following center sequence determination, the TR multiple alignment is constructed using the conventional STAR alignment strategy. Because practical running time is emphasized in our implementation, pairwise sequence comparisons during STAR Alignment may be computed in a non-optimal manner using GRDP.</p>
            </sec>
            <sec>
               <st>
                  <p>Consensus Building</p>
               </st>
               <p>XSTREAM's consensus derivation procedure makes use of the majority rule. That is, for the multiple alignment of a given <it>Xi</it>, the majority character in each column of the alignment is selected. If no majority exists, then, by and large, the topmost character is chosen. However, if |<it>Xi</it>[]| = 2, and if within a given column, one character is a gap and the other is a non-gap, the gap character is added to the consensus. If, on the other hand, |<it>Xi</it>[]| > 2, and if within a column, a gap character is tied in number with one or more non-gap characters, the topmost non-gap character is added to the consensus.</p>
               <p>To compute the consensus error <it>CEi </it>for a given <it>Xi</it>, we keep track of four variables:</p>
               <p>i) The non-gap counter, denoted <it>nG</it>, tallies every non-gap character that does not match its corresponding consensus character.</p>
               <p>ii) The majority gap counter, <it>mG</it>, records the number of gaps in all columns where the majority character is a gap.</p>
               <p>iii) A user-modifiable constant, <it>g</it>* (=3, by default), specifies the maximum number of consecutive gaps in an alignment row that can be counted toward <it>CEi</it>. For each row of the alignment, we count the number of successive gaps that do not match the consensus until either that number equals <it>g</it>*, a non-gap character is reached, or the consensus contains a gap. We resume counting gaps the next time a gap is encountered in a column where the consensus character is a non-gap. Let <it>cG </it>equal the final count.</p>
               <p>iv) Let <it>tot </it>= total number of characters in the multiple alignment of <it>Xi</it>, including gaps.</p>
               <p>We set <it>CEi </it>= (<it>nG </it>+ <it>cG</it>)/(<it>tot </it>- <it>mG</it>). The quantity <it>mG </it>is subtracted from <it>tot </it>so that gaps in columns with a gap majority do not decrease <it>CEi</it>. Further, the addition of <it>cG </it>to the numerator functions to limit the extent to which gaps increase <it>CEi</it>. We dampen the role gaps play in <it>CEi </it>since they are artificial characters. In addition, we force <it>Pi </it>to equal the number of non-gap characters in <it>Ci</it>, and therefore, if necessary, <it>Pi </it>is updated.</p>
            </sec>
            <sec>
               <st>
                  <p>Edge Trimming</p>
               </st>
               <p>For each <it>Xi</it>, <it>Edge Trimming </it>(ET) moves downstream from <it>XiS </it>and upstream from <it>XiE</it>, deleting characters that mismatch with <it>Ci </it>until the first matching character pair is found from each direction. <it>Xi </it>is realigned if truncation is successful from the top-left, since otherwise we would start the alignment with one or more gaps. If ET is only successful from the bottom right, no realignment is necessary. In this case, XSTREAM removes both the flagged bottom right portion of the alignment as well as any columns that contain all gaps. If ET is a success from either direction, <it>Ci </it>is rebuilt. For each <it>Xi</it>, ET is iteratively invoked until either |<it>Xi</it>[]| = 2 or both edges of <it>Xi </it>agree with <it>Ci</it>.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Post-Processing</p>
            </st>
            <sec>
               <st>
                  <p>Merging</p>
               </st>
               <p>XSTREAM iterates through {<it>X</it>} in order of increasing period. Given <it>Pi</it>, &#8704;<it>i </it>&#8712; {<it>X</it>}, the following routine is executed:</p>
               <p>(1) Define <it>Xtra </it>as <it>min</it>(2<it>Pi </it>- 1, <it>Pi </it>+ <it>min</it>(<it>&#963;</it>, <it>g </it>+ (1 - <it>i</it>)&#183;<it>Pi</it>), <it>max</it>(<it>Pj</it>) &#8704;<it>j </it>&#8712; {<it>X</it>}), where by default, <it>&#963; </it>= 50. <it>Xtra </it>dictates the breadth of periods from which to draw TRs for merging. The conditions restricting <it>Xtra </it>were chosen to avoid messy and insensible TR domain characterizations as well as to maintain practical running time.</p>
               <p>(2) Let TR set {<it>B</it>} = <it>Xj</it>, &#8704;<it>j </it>&#8712; {<it>X</it>}, where <it>i </it>&#8800; <it>j </it>and <it>Pi </it>&#8804; <it>Pj </it>&#8804; <it>Xtra</it>. Set {<it>X</it>} &#8592; ({<it>X</it>} - {<it>B</it>}). Note: from step (3) to step (10), we only refer to TRs from {<it>B</it>}.</p>
               <p>(3) Sort {<it>B</it>} in increasing order of <it>XjS</it>, &#8704;<it>j </it>&#8712; {<it>B</it>}.</p>
               <p>(4) Starting with <it>m </it>= 0, we examine <it>Xm </it>and <it>Xn</it>, &#8704;<it>m</it>, <it>n </it>&#8712; {<it>B</it>}, where <it>n </it>= <it>m </it>+ 1</p>
               <p>(5) Let <it>Q </it>denote the maximum allowable sequence space between two combinable TRs, and set <it>Q </it>= <it>min</it>(<it>&#956;</it><sub>1</sub>, <it>&#956;</it><sub>2</sub>&#183;<it>Pm</it>)&#183;<it>Pm</it>. By default, <it>&#956;</it><sub>1 </sub>= 10 and <it>&#956;</it><sub>2 </sub>= 0.25.</p>
               <p>(6) <b>if </b>|<it>XmSE </it>&#8745; <it>XnSE</it>| &#8800; &#216; <b>or </b>0 &lt; (<it>XnS </it>- <it>XmE</it>) &#8804; <it>Q</it>, compute similarity <it>s </it>of <it>Cm </it>and <it>Cn </it>using the consensus comparison function (refer to consensus comparison section).</p>
               <p><b>else </b>go to step (11).</p>
               <p>(7) <b>if </b><it>s </it>&#8805; <it>i</it>, merge <it>Xm </it>and <it>Xn</it>.</p>
               <p><b>else </b>go to step (11).</p>
               <p>(8) <b>if </b>|<it>XmSE </it>&#8745; <it>XnSE</it>| &#8800; &#216;, perform the following procedure: From step (6) we obtained the index <it>CnP </it>(refer to consensus comparison section) corresponding to the best cyclical permutation of <it>Cn </it>when aligned to <it>Cm</it>. We repartition <it>Xn </it>by slicing its alignment vertically at <it>CnP</it>, thus ensuring <it>Xn </it>is in phase with <it>Xm </it>before consolidation. We then merge <it>Xm </it>and <it>Xn</it>, forming <it>Xmn </it>= (<it>Xm </it>&#8746; <it>Xn </it>- <it>Xm </it>&#8745; <it>Xn</it>). Go to step (10).</p>
               <p>(9) <b>if </b>|<it>XmSE </it>&#8745; <it>XnSE</it>| = &#216;, perform the same procedure as in step (8) with the exception that the sequence space between <it>Xm </it>and <it>Xn </it>must be incorporated into <it>Xmn</it>:</p>
               <p>i) Let <it>z </it>equal the index in <it>S </it>that corresponds to the character in <it>Xn</it>[0] that is in the same alignment column as <it>CnP</it>. Let sequence <it>k </it>= <it>S</it>[<it>XmE </it>+ 1, <it>z </it>- 1].</p>
               <p>ii) Add <it>Xm </it>in its original form to <it>Xmn</it>.</p>
               <p>iii) Tile <it>k </it>in accordance with <it>Cm</it>. To do this, cut <it>k </it>into as many consecutive fragments {<it>f</it>} of length <it>Pm </it>as possible. Start cutting <it>k </it>from the end with the lowest index.</p>
               <p>iv) Given <it>fi</it>, &#8704;<it>i </it>&#8712; {<it>f</it>} (tile fragments in order of increasing indices in <it>k</it>),</p>
               <p indent="1"><b>if </b>|<it>fi</it>| = <it>Pm</it>, use the consensus comparison module to compute similarity <it>s </it>of <it>fi </it>and <it>Cm</it>.</p>
               <p indent="2"><b>if </b><it>s </it>&lt;<it>&#951;</it>, where <it>&#951; </it>&lt;<it>i </it>and <it>&#951; </it>= .5 by default, replace all characters in <it>fi </it>that do not match to <it>Cm </it>with 'x' and add <it>fi </it>to <it>Xmn</it>.</p>
               <p indent="2"><b>else </b>cut <it>fi </it>at the index corresponding to its best cyclical permutation, resulting in <it>fi</it><sub>1 </sub>and <it>fi</it><sub>2</sub>.</p>
               <p indent="3"><b>if </b>(|<it>Xmn</it>[<it>max</it>(<it>j</it>)]| + |<it>fi</it><sub>1</sub>|) &#8804; (<it>g </it>+ <it>Pm</it>), append <it>fi</it><sub>1 </sub>to <it>Xmn</it>'s last row.</p>
               <p indent="3"><b>else </b><it>fi</it><sub>1 </sub>becomes a new row in <it>Xmn</it>.</p>
               <p indent="3">Regardless of what happens to <it>fi</it><sub>1</sub>, since <it>fi</it><sub>2 </sub>is in phase with <it>Cm</it>, <it>fi</it><sub>2 </sub>becomes a new row in <it>Xmn</it>.</p>
               <p indent="1"><b>else if </b>|<it>fi</it>| &lt;<it>Pm</it>, add <it>fi </it>to <it>Xmn </it>in the same manner as <it>fi</it><sub>1 </sub>(above).</p>
               <p>v) Following the incorporation of <it>k</it>, add <it>Xn </it>to <it>Xmn </it>in the same way as in step (8).</p>
               <p>(10) Remove all gap characters from <it>Xmn</it>, perform multiple alignment on <it>Xmn </it>(without parsing) and derive consensus. We do not include the 'x' character (see (9 <it>iv</it>)) in the calculations of <it>Emn</it>, <it>Cemn </it>and <it>Imn</it>. <b>if </b><it>Xmn </it>meets TR retention criteria, set <it>Xm </it>&#8592; <it>Xmn </it>and {<it>B</it>} &#8592; {<it>B</it>} - <it>Xn</it>.</p>
               <p>(11) <b>if </b><it>m </it>&lt; |<it>B</it>| - 2, increment <it>m </it>by [0 if merging successful; 1 otherwise] and go to step (4).</p>
               <p><b>else </b>set {<it>X</it>} &#8592; {<it>X</it>} &#8746; {<it>B</it>}.</p>
            </sec>
            <sec>
               <st>
                  <p>Finishing Touches</p>
               </st>
               <p>The following TR domain refinement procedures are invoked in the order presented:</p>
               <p>(1) Maximality &#8211; Rerun the maximality function on each <it>Xi</it>, but set <it>Ri </it>&#8592; <it>Ci</it>. We invoke maximality again because using <it>Ci </it>as a reference copy may allow for additional expansion of <it>Xi</it>.</p>
               <p>(2) Realignment &#8211; For each TR in {<it>X</it>}, make a copy of <it>Xi</it>, denoted <it>Xi</it>*, and perform multiple alignment on <it>Xi</it>* using <it>Ci </it>as the center sequence. <it>Ci </it>is not included in the final alignment of <it>Xi</it>*. If <it>CEi</it>* &lt;<it>CEi</it>, we set <it>Xi </it>&#8592; <it>Xi</it>*.</p>
               <p>(3) Reducibility &#8211; Rerun redundancy elimination procedure II (see below) on every realigned TR in {<it>X</it>}.</p>
               <p>(4) Overlap Removal &#8211; If allowed by user, send {<it>X</it>} to redundancy elimination algorithm III (see below).</p>
               <p>(5) Nesting &#8211; By default, send {<it>X</it>} to nesting procedure (see below).</p>
            </sec>
            <sec>
               <st>
                  <p>Consensus Comparison</p>
               </st>
               <p>For clustering different TRs, we compare their consensus sequences. In order to effectively compare consensus sequences we take into account TR phase variation &#8211; the same TR can have different starting points, leading to consensus sequences of different phases. More formally, every irreducible <it>Xi </it>can occur in <it>Pi </it>cyclical permutations, and if a given TR <it>Xj </it>has <it>Ej </it>&#8805; 2 + (<it>Pj </it>- 1)/<it>Pj</it>, then <it>Xj </it>has <it>Pj </it>valid consensus sequence phases. Therefore, we must evaluate up to <it>Pi </it>consensus alignments for every pair of TRs with period <it>Pi </it>that also satisfies the same copy number condition as <it>Xj</it>. For simplification, we treat all TRs the same, regardless of copy number. Given a pair of TRs, <it>Xi </it>and <it>Xj </it>where <it>Pi </it>= <it>Pj</it>, XSTREAM fixes <it>Ci </it>and aligns as many phases of <it>Cj </it>to <it>Ci </it>as are needed to establish similarity. If <it>Consensus Comparison </it>is called from the merging procedure, all phases of <it>Cj </it>are aligned to <it>Ci </it>to locate the best-aligned cyclical permutation. The leftmost character in the highest scoring phase of <it>Cj</it>, denoted by <it>CjP</it>, is used during TR merging. Otherwise, only sufficient similarity is needed, and thus XSTREAM may align less than all phases of <it>Cj</it>. If <it>g </it>> 0, all gaps are removed from <it>Ci</it>, <it>Cj </it>prior to alignment. All alignments of <it>Ci, Cj </it>are computed using GRDP. For each alignment, XSTREAM counts the number of matching characters and stores the highest match count so far in <it>N</it>. If <it>N</it>/<it>Pi </it>&#8804; <it>i</it>, XSTREAM groups <it>Xi </it>and <it>Xj</it>. The time complexity of comparing <it>Ci </it>and <it>Cj </it>is O(<it>Pi</it><sup>2</sup>) because of <it>Pi </it>alignments and O(<it>Pi</it>) alignment time. For every newly established TR group, the consensus sequence with the lowest index in {<it>X</it>} becomes the group head or referential consensus, and is used for all subsequent comparisons. The time complexity for performing all consensus comparisons of the same period without considering alignment time is O(|<it>X</it>|<sup>2</sup>). Therefore, the total time complexity of <it>Consensus Comparison </it>is O(|<it>X</it>|<sup>2</sup><it>Pi</it><sup>2</sup>).</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Gap-Restricted Dynamic Programming</p>
            </st>
            <p>A major obstacle to efficient alignment of gapped TRs is dynamic programming (DP), which, for global pairwise-sequence alignment, has time complexity O(<it>n</it><sup>2</sup>), where <it>n </it>= TR period. Because optimal alignment of TR copies may, in some cases, place a temporal burden on the user, we explored heuristic options. We decided to implement a non-optimal variant of pairwise global sequence alignment DP, which we call gap-restricted DP (GRDP). GRDP requires a user-modifiable parameter, <it>g</it>, which governs the maximum number of consecutive gaps that can be used during GRDP pairwise alignment. Because of <it>g</it>, the maximum traceable width of the DP matrix is held constant for all periods, is equal to 2<it>g </it>+ 1, and is symmetrically distributed with respect to the main diagonal. As a result, GRDP has space complexity &#952;(<it>n</it>) and time complexity &#952;(<it>n</it>), enabling a 1:1 correspondence between increasing period and running time. The following recursion describes GRDP:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-8-382-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>s</m:mi>
                           <m:mi>c</m:mi>
                           <m:mi>o</m:mi>
                           <m:mi>r</m:mi>
                           <m:mi>e</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>i</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>j</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mi>max</m:mi>
                           <m:mo>&#8289;</m:mo>
                           <m:mrow>
                              <m:mo>{</m:mo>
                              <m:mrow>
                                 <m:mtable columnalign="left">
                                    <m:mtr columnalign="left">
                                       <m:mtd columnalign="left">
                                          <m:mrow>
                                             <m:mi>s</m:mi>
                                             <m:mi>c</m:mi>
                                             <m:mi>o</m:mi>
                                             <m:mi>r</m:mi>
                                             <m:mi>e</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>i</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo>,</m:mo>
                                             <m:mi>j</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>+</m:mo>
                                             <m:mi>g</m:mi>
                                             <m:mi>a</m:mi>
                                             <m:mi>p</m:mi>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr columnalign="left">
                                       <m:mtd columnalign="left">
                                          <m:mrow>
                                             <m:mi>s</m:mi>
                                             <m:mi>c</m:mi>
                                             <m:mi>o</m:mi>
                                             <m:mi>r</m:mi>
                                             <m:mi>e</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>i</m:mi>
                                             <m:mo>,</m:mo>
                                             <m:mi>j</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>+</m:mo>
                                             <m:mi>&#952;</m:mi>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr columnalign="left">
                                       <m:mtd columnalign="left">
                                          <m:mrow>
                                             <m:mi>s</m:mi>
                                             <m:mi>c</m:mi>
                                             <m:mi>o</m:mi>
                                             <m:mi>r</m:mi>
                                             <m:mi>e</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>i</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo>,</m:mo>
                                             <m:mi>j</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>+</m:mo>
                                             <m:mi>g</m:mi>
                                             <m:mi>a</m:mi>
                                             <m:mi>p</m:mi>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                 </m:mtable>
                              </m:mrow>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGZbWCcqWGJbWycqWGVbWBcqWGYbGCcqWGLbqzcqGGOaakcqWGPbqAcqGGSaalcqWGQbGAcqGGPaqkcqGH9aqpcyGGTbqBcqGGHbqycqGG4baEdaGabaqaauaabaqadeaaaeaacqWGZbWCcqWGJbWycqWGVbWBcqWGYbGCcqWGLbqzcqGGOaakcqWGPbqAcqGHsislcqaIXaqmcqGGSaalcqWGQbGAcqGGPaqkcqGHRaWkcqWGNbWzcqWGHbqycqWGWbaCaeaacqWGZbWCcqWGJbWycqWGVbWBcqWGYbGCcqWGLbqzcqGGOaakcqWGPbqAcqGGSaalcqWGQbGAcqGHsislcqaIXaqmcqGGPaqkcqGHRaWkiiGacqWF4oqCaeaacqWGZbWCcqWGJbWycqWGVbWBcqWGYbGCcqWGLbqzcqGGOaakcqWGPbqAcqGHRaWkcqaIXaqmcqGGSaalcqWGQbGAcqGHsislcqaIXaqmcqGGPaqkcqGHRaWkcqWGNbWzcqWGHbqycqWGWbaCaaaacaGL7baaaaa@779E@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Note that depending on <it>g</it>, we place constraints on where each score possibility can be computed. The parameters <it>gap </it>and <it>&#952; </it>denote gap penalty and match/mismatch values respectively. By default, all DP procedures use values <it>gap </it>= -4, <it>mismatch </it>= -4, and <it>match </it>= 2. An example of GRDP alignment is shown in Figure <figr fid="F8">8</figr>. Also, note that if <it>g </it>= 0, XSTREAM completely disallows gaps, and thus the decision to allow insertions/deletions (indels) is left up to the user. By default, <it>g </it>= 3. In addition, our implementation of GRDP requires input sequences of equal length. In cases where input sequences have different lengths, both sequences are made the same length by appending gap characters to the shorter sequence. Any columns with two gaps in the resulting pairwise alignment are removed. Since standard DP is practical in many situations, several functions of XSTREAM toggle GRDP on and off depending on projections of time complexity. GRDP is used in four major functions: TR domain expansion, TR parsing, multiple alignment, and consensus comparison.</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Sequence Alignment using GRDP</p>
               </caption>
               <text>
                  <p><b>Sequence Alignment using GRDP</b>. The matrix on the left represents GRDP sequence alignment of sequences 'ATTCGA' and 'ATCGAT' with <it>g </it>= 2 and space complexity O(<it>n</it><sup>2</sup>). Since <it>g </it>places an upper bound on traceable matrix width, we only use O(<it>n</it>) space, as shown with the matrix on the right. Notice that because the width of the matrix on the right is 2 <it>g </it>+ 1, it accommodates all of the relevant information from the matrix on the left. The resulting pairwise alignment is also shown.</p>
               </text>
               <graphic file="1471-2105-8-382-8"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Redundancy Elimination</p>
            </st>
            <p>XSTREAM implements three strategies to eliminate two types of TR redundancy &#8211; reducible TR periods and TR domain overlap.</p>
            <p>I) XSTREAM searches for TRs in order of increasing period. As TRs are found, their corresponding sequence space is flagged, preventing further searching in processed sequence regions. This tactic combats both kinds of redundancy and reduces running time.</p>
            <p>II) To combat reducible TR periods, XSTREAM is rerun on the consensus sequence of each TR domain from the input sequence. (see Figure <figr fid="F1">1</figr>) If the consensus sequence <it>Ci </it>of <it>Xi </it>contains a TR domain <it>xi </it>that spans <it>Ci</it>'s entire length, XSTREAM repartitions <it>Xi </it>using the consensus of <it>xi</it>, resulting in <it>Xi</it>*, whose period is an even multiple of <it>Xi</it>'s period. <it>Xi</it>* is retained and <it>Xi </it>erased (<it>Xi</it>*&#8594;<it>Xi</it>) if <it>Xi</it>* passes the user-adjustable TR filtration criteria.</p>
            <p>III) The following redundancy elimination method, invoked by default, functions to remove TR domain overlap. The user can control the execution and parameters of this method because it may not always be desirable to remove TR domain overlap and because we are convinced that the amount of reasonable overlap among TR domains is an arbitrary matter. We now state the rules that determine whether for a given TR pair <it>Xi </it>and <it>Xj</it>, XSTREAM deletes one or neither. The rules are enforced in the order they are presented; i.e. rule set (i) must fail to move to rule set (ii) and so on. Let <it>I </it>= |<it>XiSE </it>&#8745; <it>XjSE| </it>(length of intersection of TR domains <it>i</it>, <it>j</it>). By default, <it>&#945; </it>= .9, <it>&#946; </it>= .75, <it>&#947; </it>= .9, and <it>&#948; </it>= .6.</p>
            <p>i) <b>if </b>(<it>&#945;</it>&#183;|<it>Xi</it>|) &#8804; <it>I </it>&#8804; |<it>Xi</it>| <b>and </b>|<it>Xi</it>| &#8804; |<it>Xj</it>|</p>
            <p>&#160;&#160;&#160;<b>if </b>|<it>Xi</it>| &lt; (<it>&#946;</it>&#183;|<it>Xj</it>|)</p>
            <p>&#160;&#160;&#160;&#160;&#160;&#160;delete <it>Xi</it></p>
            <p>&#160;&#160;&#160;<b>else if </b>|<it>Xi</it>| &lt; (<it>&#947;</it>&#183;<it>|Xj</it>|) <b>and </b>(<it>Ei </it>&lt;<it>Ej </it><b>or </b><it>CEi </it>> <it>CEj</it>)</p>
            <p>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;delete <it>Xi</it></p>
            <p>&#160;&#160;&#160;&#160;&#160;&#160;<b>else if </b>|<it>Xi</it>| &#8805; (<it>&#947;</it>&#183;<it>|Xj</it>|) <b>and </b><it>Ei </it>&lt;<it>Ej</it></p>
            <p>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;delete <it>Xi</it></p>
            <p>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<b>else </b>delete <it>Xj</it></p>
            <p>ii) Same as (i) but swap <it>i </it>and <it>j</it></p>
            <p>iii) <b>if </b><it>I </it>&#8805; (<it>&#948;</it>&#183;<it>max</it>(|<it>Xi</it>|, |<it>Xj</it>|))</p>
            <p>&#160;&#160;&#160;<b>if </b><it>CEi </it>&#8805; <it>CEj </it><b>and </b><it>Ei </it>&#8804; <it>Ej</it></p>
            <p>&#160;&#160;&#160;&#160;&#160;&#160;delete <it>Xi</it></p>
            <p>&#160;&#160;&#160;<b>else </b>delete <it>Xj</it></p>
            <p>iv) <b>if </b><it>I </it>&#8805; (<it>&#948;</it>&#183;<it>min</it>(|<it>Xi</it>|, |<it>Xj</it>|))</p>
            <p>&#160;&#160;&#160;delete <it>min</it>(|<it>Xi</it>|, |<it>Xj</it>|)</p>
         </sec>
         <sec>
            <st>
               <p>Two-Stage TR Detection</p>
            </st>
            <p>As shown in Table <tblr tid="T1">1</tblr>, XSTREAM allows the user to restrict the TR period range. If <it>MinP </it>&lt;<it>T </it>and <it>MaxP </it>&#8805; <it>T</it>, TR detection proceeds in two phases, where phase I examines periods = <it>T</it>, and phase II examines periods &lt;<it>T</it>. By default, <it>T </it>= 10. This procedure reduces the frequency of inconsistent results. We now describe our reasoning.</p>
            <p>As mentioned in Redundancy Elimination I, TRs are identified in order of increasing period and sequence space is masked for every successful seed extension. Because of these two facts and because the value of <it>MinP </it>can be altered, it is possible to differentially characterize the same TR domain <it>Xi</it>, or perhaps miss <it>Xi </it>altogether, for the case <it>Pi </it>&#8805; <it>max</it>[all tested <it>MinP </it>values]. This problem can occur because as XSTREAM moves up the period ladder toward <it>Pi</it>, different stretches of sequence space may be removed in and around <it>Xi </it>for different values of <it>Min</it>. We determined empirically that by first scanning upward from a short period, such as 10, we could greatly mitigate this problem. To illustrate, see Figure <figr fid="F2">2</figr> for an example of a TR domain containing many short period TRs. Without Two-Stage TR Detection, this period 152 TR domain would not be reported since most of its sequence space would be masked by its constituent TRs.</p>
            <p>Following completion of phase I, all masked sequence space is reset to unused, thereby allowing shorter period TRs to be found independently of longer period TRs. Redundancy removal strategies II and III are invoked later and will remove any redundancy caused by XSTREAM's two-stage TR detection procedure.</p>
         </sec>
         <sec>
            <st>
               <p>Long Period TR Filter</p>
            </st>
            <p>To ensure pragmatic running time for all possible periods, we implemented a heuristic that governs seed extension for periods greater than or equal to 1000 characters. If |<it>S| </it>&#8805; 2000, during seed detection, an additional hashcode array <it>M</it>* is kept, which stores hashcodes and sequence positions for seeds of maximum length <it>L</it>*, which by default is 7. Then, for every pair of seeds with distance &#8805; 1000, XSTREAM initially invokes a filtration step, which jumps across <it>M</it>* a user-defined number of times <it>t </it>and looks for matching hashcodes. This method is identical to seed extension as described earlier, except that <it>S </it>is not used and <it>x </it>is incremented by <it>floor</it>(<it>d</it>/<it>t</it>) after each hashcode comparison. Thus, if <it>g </it>> 0, CW can be invoked. XSTREAM runs standard seed extension and TR domain expansion (using <it>M</it>* and <it>L</it>*) on periods &#8805; 1000 if and only if <it>t</it>* matches are recorded during the filtering phase, where <it>t</it>* = <it>t</it>/3. Therefore, seed pairs with distances &#8805; 1000 are subjected to a quick and preliminary filter, which although imperfect, drastically reduces running time for input sequences on the chromosome size scale. By default, <it>t </it>= 20.</p>
         </sec>
         <sec>
            <st>
               <p>Nesting</p>
            </st>
            <p>Within each TR consensus sequence, XSTREAM searches for nested TRs &#8211; TRs that occur within TRs. This is a novel feature in the domain of protein analysis software and may provide important information about primary sequence architectures and peptide TR evolution. For a given <it>Xi</it>, we define a nested TR as a TR present in <it>Ci </it>that does not span <it>Ci</it>'s entire length. Since TR degeneracy can complicate identifying nested structures, XSTREAM only looks for nested TRs in consensus sequences. Our procedure detects nested TRs of unlimited nesting depth, with no gaps and no mismatches. This algorithm employs a top-down approach to locating TRs, as opposed to the bottom-up method used by XSTREAM. A top-down approach is useful for nested TRs because it identifies the longest period TR first, then in a recursive manner, restarts the algorithm within that TR, and continually digs deeper until no more TRs can be found. By working off the greedy assumption that the longest period TRs are the best candidates for nesting, we avoid issues of TR overlap inherent in the bottom-up strategy. The main drawback to our nesting method is its time complexity, which is O(<it>n</it><sup>3</sup>), where <it>n </it>= <it>Pi</it>. We therefore restrict this method to TRs from {<it>X</it>} with periods &#8804; 1000 and only find nested TRs with periods &#8804; 300. We set the minimum nested TR period at 1 for proteins and 2 for nucleotide sequences. The time complexity is O(<it>n</it><sup>3</sup>) due to the worst-case scenario of comparing subsequences of all possible sizes in all possible sequence regions.</p>
         </sec>
         <sec>
            <st>
               <p>Divide and Conquer</p>
            </st>
            <p>XSTREAM implements a user-adjustable divide and conquer procedure to reduce memory consumption. If enabled, the input sequence is segmented into overlapping fragments of length <it>l </it>prior to TR detection. The last fragment of the input sequence may be of length &lt;<it>l</it>. Overlapping regions have length <it>l</it>*, which is equivalent to the maximum detectable TR period. After all fragments are processed, the set of identified TRs are directed to the merging procedure, which functions to both extend TRs across fragment boundaries and consolidate overlapping regions. By default, <it>l </it>= 100,000 and <it>l</it>* = 10,000.</p>
         </sec>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Support was provided by a Biotechnology Training grant from UC Discovery, and seed funds from Dean of the Division of Mathematical, Life and Physical Sciences at UCSB. We acknowledge the assistance of Julian Peeters and Roseanne Krauter for genome data downloading and early testing, Gregory Peters for development of the web interface, David Newman for the use of Enterprise Architect version 4.10.739, and Stephen Poole, Terrence Smith, and Arnab Bhattacharya for critically reading the manuscript.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>An algorithm for approximate tandem repeats</p>
            </title>
            <aug>
               <au>
                  <snm>Landau</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Schmidt</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Sokol</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>J Comp Biol</source>
            <pubdate>2001</pubdate>
            <volume>8</volume>
            <fpage>1</fpage>
            <lpage>18</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1089/106652701300099038</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Tandem repeats over the edit distance</p>
            </title>
            <aug>
               <au>
                  <snm>Sokol</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Benson</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Tojeira</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <fpage>E30</fpage>
            <lpage>E35</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl309</pubid>
                  <pubid idtype="pmpid" link="fulltext">17237101</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Fourteen and counting: unraveling trinucleotide repeat diseases</p>
            </title>
            <aug>
               <au>
                  <snm>Cummings</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Zoghbi</snm>
                  <fnm>HY</fnm>
               </au>
            </aug>
            <source>Hum Molec Genet</source>
            <pubdate>2000</pubdate>
            <volume>9</volume>
            <fpage>909</fpage>
            <lpage>916</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/hmg/9.6.909</pubid>
                  <pubid idtype="pmpid" link="fulltext">10767314</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Complex recombination events at the hypermutable minisatellite CEB1 (D2S90</p>
            </title>
            <aug>
               <au>
                  <snm>Buard</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Vergnaud</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>The EMBO J</source>
            <pubdate>1994</pubdate>
            <volume>13</volume>
            <fpage>3203</fpage>
            <lpage>3210</lpage>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Intragenic tandem repeats generate functional variability</p>
            </title>
            <aug>
               <au>
                  <snm>Verstrepen</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Jansen</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lewitter</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Fink</snm>
                  <fnm>GR</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2005</pubdate>
            <volume>37</volume>
            <fpage>986</fpage>
            <lpage>990</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1462868</pubid>
                  <pubid idtype="pmpid" link="fulltext">16086015</pubid>
                  <pubid idtype="doi">10.1038/ng1618</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>A Census of Protein Repeats</p>
            </title>
            <aug>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Pellegrini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Yeates</snm>
                  <fnm>TO</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>J Molec Biol</source>
            <pubdate>1998</pubdate>
            <volume>293</volume>
            <fpage>151</fpage>
            <lpage>160</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1006/jmbi.1999.3136</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Protein Repeats: Structures, Functions, and Evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Andrade</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Perez-Iratxeta</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
            </aug>
            <source>J Struc Biol</source>
            <pubdate>2001</pubdate>
            <volume>134</volume>
            <fpage>117</fpage>
            <lpage>131</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1006/jsbi.2001.4392</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Phylogenetic Differences in Content and Intensity of Periodic Proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Gatherer</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>McEwan</snm>
                  <fnm>NR</fnm>
               </au>
            </aug>
            <source>J Molec Evol</source>
            <pubdate>2005</pubdate>
            <volume>60</volume>
            <fpage>447</fpage>
            <lpage>461</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-004-0189-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">15883880</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <aug>
               <au>
                  <snm>Dickerson</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Geis</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>The Structure and Action of Proteins</source>
            <publisher>Harper &amp; Row</publisher>
            <pubdate>1969</pubdate>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Molecular architecture and evolution of a modular spider silk protein gene</p>
            </title>
            <aug>
               <au>
                  <snm>Hayashi</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lewis</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>287</volume>
            <fpage>1477</fpage>
            <lpage>1479</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.287.5457.1477</pubid>
                  <pubid idtype="pmpid" link="fulltext">10688794</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>The Extensins</p>
            </title>
            <aug>
               <au>
                  <snm>Tierney</snm>
                  <fnm>ML</fnm>
               </au>
               <au>
                  <snm>Varner</snm>
                  <fnm>JE</fnm>
               </au>
            </aug>
            <source>Plant Physiol</source>
            <pubdate>1987</pubdate>
            <volume>84</volume>
            <fpage>1</fpage>
            <lpage>2</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1056515</pubid>
                  <pubid idtype="pmpid" link="fulltext">16665379</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Mussel Adhesive Plaque Protein Gene is a Novel Member of Epidermal Growth Factor-like Gene Family</p>
            </title>
            <aug>
               <au>
                  <snm>Inoue</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Takeuchi</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Miki</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Odo</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>J Biol Chem</source>
            <pubdate>1995</pubdate>
            <volume>270</volume>
            <fpage>6698</fpage>
            <lpage>6701</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/jbc.270.12.6698</pubid>
                  <pubid idtype="pmpid" link="fulltext">7896812</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>A potential mediator of collagenous block copolymer gradients in mussel byssal threads</p>
            </title>
            <aug>
               <au>
                  <snm>Qin</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Waite</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>10517</fpage>
            <lpage>10522</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">27926</pubid>
                  <pubid idtype="pmpid" link="fulltext">9724735</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.18.10517</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Copper binding to the octarepeats of the prion protein. Affinity, specificity, folding, and co-operativity; insights from circular dichroism</p>
            </title>
            <aug>
               <au>
                  <snm>Garnet</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Viles</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <source>J Biol Chem</source>
            <pubdate>2003</pubdate>
            <volume>278</volume>
            <fpage>6795</fpage>
            <lpage>6802</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/jbc.M209280200</pubid>
                  <pubid idtype="pmpid" link="fulltext">12454014</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Global analysis of tandem aromatic octapeptide repeats: The significance of the aromatic-glycine motif</p>
            </title>
            <aug>
               <au>
                  <snm>Gazit</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>880</fpage>
            <lpage>883</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.6.880</pubid>
                  <pubid idtype="pmpid" link="fulltext">12075024</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>FG-Rich Repeats of Nuclear Pore Proteins Form a Three-Dimensional Meshwork with Hydrogel-Like Properties</p>
            </title>
            <aug>
               <au>
                  <snm>Frey</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Richter</snm>
                  <fnm>RP</fnm>
               </au>
               <au>
                  <snm>G&#246;rlich</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2006</pubdate>
            <volume>314</volume>
            <fpage>815</fpage>
            <lpage>817</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1132516</pubid>
                  <pubid idtype="pmpid" link="fulltext">17082456</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Cloning, characterization, and serodiagnostic evaluation of Leishmania infantum tandem repeat proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Goto</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Coler</snm>
                  <fnm>RN</fnm>
               </au>
               <au>
                  <snm>Guderian</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mohamath</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Reed</snm>
                  <fnm>SG</fnm>
               </au>
            </aug>
            <source>Infect Immun</source>
            <pubdate>2006</pubdate>
            <volume>74</volume>
            <fpage>3939</fpage>
            <lpage>3945</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1489730</pubid>
                  <pubid idtype="pmpid" link="fulltext">16790767</pubid>
                  <pubid idtype="doi">10.1128/IAI.00101-06</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Bioinformatic Identification of Tandem Repeat Antigens of the <it>Leishmania donovani </it>complex</p>
            </title>
            <aug>
               <au>
                  <snm>Goto</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Coler</snm>
                  <fnm>RN</fnm>
               </au>
               <au>
                  <snm>Reed</snm>
                  <fnm>SG</fnm>
               </au>
            </aug>
            <source>Infect Immun</source>
            <pubdate>2007</pubdate>
            <volume>75</volume>
            <fpage>846</fpage>
            <lpage>851</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1828517</pubid>
                  <pubid idtype="pmpid" link="fulltext">17088350</pubid>
                  <pubid idtype="doi">10.1128/IAI.01205-06</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Interspersed blocks of repetitive and charged amino acids in a dominant immunogen of <it>Plasmodium falciparum</it></p>
            </title>
            <aug>
               <au>
                  <snm>Stahl</snm>
                  <fnm>HD</fnm>
               </au>
               <au>
                  <snm>Crewther</snm>
                  <fnm>PE</fnm>
               </au>
               <au>
                  <snm>Anders</snm>
                  <fnm>RF</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>GV</fnm>
               </au>
               <au>
                  <snm>Coppel</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Bianco</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Mitchell</snm>
                  <fnm>GF</fnm>
               </au>
               <au>
                  <snm>Kemp</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1985</pubdate>
            <volume>82</volume>
            <fpage>543</fpage>
            <lpage>547</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">397076</pubid>
                  <pubid idtype="pmpid" link="fulltext">3881769</pubid>
                  <pubid idtype="doi">10.1073/pnas.82.2.543</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Zinc Fingers and Other Metal-binding Domains</p>
            </title>
            <aug>
               <au>
                  <snm>Berg</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>J Biol Chem</source>
            <pubdate>1990</pubdate>
            <volume>265</volume>
            <fpage>6513</fpage>
            <lpage>6516</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">2108957</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Simple sequence repeats in proteins and their significance for network evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Michelle</snm>
                  <fnm>Simon</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>2005</pubdate>
            <volume>345</volume>
            <fpage>113</fpage>
            <lpage>118</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.gene.2004.11.023</pubid>
                  <pubid idtype="pmpid" link="fulltext">15716087</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Classification of Proteins Based on Minimal Modular Repeats: Lessons from Nature in Protein Design</p>
            </title>
            <aug>
               <au>
                  <snm>Barney</snm>
                  <fnm>BM</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <fpage>473</fpage>
            <lpage>482</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr050103m</pubid>
                  <pubid idtype="pmpid" link="fulltext">16512661</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A Fast Algorithm for Genome-Wide Analysis of Proteins With Repeated Sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Pellegrini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Yeates</snm>
                  <fnm>TO</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>1999</pubdate>
            <volume>35</volume>
            <fpage>440</fpage>
            <lpage>446</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/(SICI)1097-0134(19990601)35:4&lt;440::AID-PROT7>3.0.CO;2-Y</pubid>
                  <pubid idtype="pmpid" link="fulltext">10382671</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Rapid Automatic Detection and Alignment of Repeats in Protein Sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Heger</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Holm</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2000</pubdate>
            <volume>41</volume>
            <fpage>224</fpage>
            <lpage>237</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/1097-0134(20001101)41:2&lt;224::AID-PROT70>3.0.CO;2-Z</pubid>
                  <pubid idtype="pmpid" link="fulltext">10966575</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Tracking repeats using significance and transitivity</p>
            </title>
            <aug>
               <au>
                  <snm>Szklarczyk</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <fpage>i311</fpage>
            <lpage>i317</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth911</pubid>
                  <pubid idtype="pmpid" link="fulltext">15262814</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>ProtRepeatsDB: a database of amino acid repeats in genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Kalita</snm>
                  <fnm>MK</fnm>
               </au>
               <au>
                  <snm>Ramasamy</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Duraisamy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Chauhan</snm>
                  <fnm>VS</fnm>
               </au>
               <au>
                  <snm>Gupta</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>336</fpage>
            <lpage>347</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1538635</pubid>
                  <pubid idtype="pmpid" link="fulltext">16827924</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-336</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Amino acid repeat patterns in proteins sequences: Their diversity and structural functional implications</p>
            </title>
            <aug>
               <au>
                  <snm>Katti</snm>
                  <fnm>MV</fnm>
               </au>
               <au>
                  <snm>Sami-Subbu</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Ranjekar</snm>
                  <fnm>PK</fnm>
               </au>
               <au>
                  <snm>Gupta</snm>
                  <fnm>VS</fnm>
               </au>
            </aug>
            <source>Protein Science</source>
            <pubdate>2000</pubdate>
            <volume>9</volume>
            <fpage>1203</fpage>
            <lpage>1209</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10892812</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>REPPER &#8211; repeats and their periodicities in fibrous proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Gruber</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>S&#246;ding</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lupas</snm>
                  <fnm>AN</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>W239</fpage>
            <lpage>W243</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1160166</pubid>
                  <pubid idtype="pmpid" link="fulltext">15980460</pubid>
                  <pubid idtype="doi">10.1093/nar/gki405</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Tandem repeats finder: a program to analyze DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Benson</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>573</fpage>
            <lpage>580</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">148217</pubid>
                  <pubid idtype="pmpid" link="fulltext">9862982</pubid>
                  <pubid idtype="doi">10.1093/nar/27.2.573</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Beyond tandem repeats: complex pattern structures and distant regions of similarity</p>
            </title>
            <aug>
               <au>
                  <snm>Hauth</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Joseph</snm>
                  <fnm>DA</fnm>
               </au>
            </aug>
            <source>Bionformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>s31</fpage>
            <lpage>s37</lpage>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Sequence Alignment with Tandem Duplication</p>
            </title>
            <aug>
               <au>
                  <snm>Benson</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>1997</pubdate>
            <volume>4</volume>
            <fpage>351</fpage>
            <lpage>367</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9278065</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>XSTREAM Web Interface</p>
            </title>
            <url>http://jimcooperlab.mcdb.ucsb.edu/xstream</url>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Wellcome Trust Sanger Institute</p>
            </title>
            <url>http://www.sanger.ac.uk</url>
         </bibl>
      </refgrp>
   </bm>
</art>
