<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2008-9-1-r11</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Method</dochead>
      <bibl>
         <title>
            <p>CPSARST: an efficient circular permutation search tool applied to the detection of novel protein structural relationships</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Lo</snm>
               <fnm>Wei-Cheng</fnm>
               <insr iid="I1"/>
               <email>b861636@life.nthu.edu.tw</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Lyu</snm>
               <fnm>Ping-Chiang</fnm>
               <insr iid="I1"/>
               <email>pclyu@life.nthu.edu.tw</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Hsinchu 30013, Taiwan</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>R11</fpage>
         <url>http://genomebiology.com/2008/9/1/R11</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18201387</pubid>
               <pubid idtype="doi">10.1186/gb-2008-9-1-r11</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>11</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>19</day>
               <month>11</month>
               <year>2007</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>18</day>
               <month>1</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>18</day>
               <month>01</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Lo et al.; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>A circular permutation search engine</p>
      </shorttitle>
      <shortabs>
         <p>CPSARST (Circular Permutation Search Aided by Ramachandran Sequential Transformation) is an efficient database search tool that provides a new way for rapidly detecting novel relationships among proteins.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>Circular permutation of a protein can be visualized as if the original amino- and carboxyl termini were linked and new ones created elsewhere. It has been well-documented that circular permutants usually retain native structures and biological functions. Here we report CPSARST (Circular Permutation Search Aided by Ramachandran Sequential Transformation) to be an efficient database search tool. In this post-genomics era, when the amount of protein structural data is increasing exponentially, it provides a new way to rapidly detect novel relationships among proteins.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010001">Biochemistry and structural biology</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Circular permutation (CP) in a protein structure is the rearrangement of the amino acid sequence such that the amino- and carboxy-terminal regions are interchanged <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. It can be visualized as if the original termini of the polypeptide were linked and new ones created elsewhere <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Since the first observation of naturally occurring circular permutations in plant lectins <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, a substantial number of natural examples have been reported, including some bacterial &#946;-glucanases, swaposins, glucosyltransferases, &#946;-glucosidases, SLH domains, transaldolases, C2 domains (for a review, see <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>), FMN-binding proteins <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, double-&#966; &#946;-barrels <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, glutathione synthetases <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, DNA and other methyltransferases <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B10">10</abbr></abbrgrp>, ferredoxins <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, and proteinase inhibitors <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. In most of the cases, circular permutants (CPs) have conserved function or enzymatic activity <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B14">14</abbr></abbrgrp>, sometimes with increased functional diversity <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>.</p>
         <p>To reveal the influences of CP on the structure, function and folding mechanism of proteins, many artificial CPs have been generated, inclusive of trypsin inhibitor, anthranilate isomerase, dihydrofolate reductase, T4 lysozyme, ribonucleases, aspartate transcarbamoylase, the &#945;-spectrin SH3 domain, the <it>Escherichia coli </it>DsbA protein, ribosomal protein S6 and <it>Bacillus </it>&#946;-glucanase <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. The outcomes have indicated that three-dimensional structure seems remarkably insensitive to CP <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and CPs generally retain their biological functions <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>, although the structural stabilities, the folding nuclei, transition states or pathways might be altered <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>. Since CP generally preserves protein structure and function, with sometimes increased stability or activity, it has been applied to trigger crystallization <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, improve enzyme activities <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, determine critical elements <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>, and create novel fusion proteins, the tethered sites of which are not confined to the native termini <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>, such as the famous fluorescent calcium sensor <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>.</p>
         <p>In spite of these interesting properties and applications, there is still much uncertainty about the genetic mechanisms, the evolutionary importance and the natural prevalence of CP <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B18">18</abbr><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. CPs can arise from posttranslational modifications <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B31">31</abbr></abbrgrp> but a majority may arise from genetic events <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. There have been several genetic and evolutionary mechanisms proposed, for instance, duplication/deletion models <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B32">32</abbr></abbrgrp>, duplication-by-permutation models <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B33">33</abbr></abbrgrp>, fusion/fission models <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B30">30</abbr></abbrgrp>, and plasmid-mediated 'cut and paste' <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. However, which plays the major role or what proportion each mechanism contributes to the evolution of CPs and protein families remains uncertain. Besides, because of the disagreement between definitions of CPs, conflicting conclusions can be observed. In general, previous studies that considered the whole protein as the unit that undergoes CP concluded that CP is rare in nature <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B14">14</abbr><abbr bid="B30">30</abbr></abbrgrp> while those viewing the domain as the unit that undergoes CP suggested CP to be frequent <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B29">29</abbr><abbr bid="B34">34</abbr></abbrgrp>.</p>
         <p>In this post-genomic era, the amount of protein structure data is increasing exponentially, and plenty of information should be extractable to reveal the natural prevalence and evolutionary mechanism of CP; however, CP search tools are still very rare. It has been indicated that traditional sequence comparison methods are linearly sequential in nature and inefficient at identifying CP <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B35">35</abbr></abbrgrp>. Three-dimensional structural comparisons may identify more evolutionarily far-related CPs <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>; nevertheless, conventional methods such as DALI <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> and CE <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> are also inefficient due to their sequential nature <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. To detect CP, the most exact approach is to use an algorithm that generates all possible CPs of one protein and subsequently aligns them with another protein to find an alignment better than the linear alignment <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B38">38</abbr></abbrgrp>, although this is apparently very time-consuming. A few brilliant approaches have been developed to achieve higher efficiency. Uliel <it>et al</it>. <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B38">38</abbr></abbrgrp> proposed a heuristic method based on duplicating one of the two protein sequences followed by manual verifications. Though being much faster, it still takes several CPU months to survey tens of thousands of sequences. The requirement of manual examinations also makes it unrealistic for searching large datasets <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Weiner <it>et al</it>. <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> condensed amino acid sequences into tiny domain strings to achieve an extremely high speed, scanning hundreds of thousands of sequences in hours; however, without suitable domain annotations or when a CP disrupts a domain, false negatives occur. Structural alignment methods applicable to the identification of CPs have also been developed. For instance, Jung and Lee <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> developed SHEBA to screen the SCOP database. They suggested that CPs are very frequent and many have symmetric structures. However, since internal symmetry may introduce noise into the detection of CPs <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, certain false positive predictions can be produced. Regardless of the capability of detecting distantly related CPs, a pair-wise comparison by structure-based CP-detecting algorithms may take from seconds to minutes <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, making routine database searches infeasible.</p>
         <sec>
            <st>
               <p>Overview of CPSARST</p>
            </st>
            <p>Here we present CPSARST (Circular Permutation Search Aided by Ramachandran Sequential Transformation), an efficient tool for searching for CPs. It describes three-dimensional protein structures as one-dimensional text strings by using a Ramachandran sequential transformation (RST) algorithm <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>, which transforms protein structures through a Ramachandran (RM) map organized by nearest-neighbor clustering. This linear encoding methodology converts complicated and time-consuming structural comparison problems into string comparisons that can be done very rapidly. CPSARST has also achieved high efficiency by duplicating the query structure and working through a 'double filter-and-refine' strategy. These approaches are illustrated in Figure <figr fid="F1">1</figr>. A web service and a stand-alone Java program of CPSARST are available at <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>. CPSARST not only inherits the speed advantages of sequence-based methods but retains sensitivity to detect distantly related CPs mostly detectable only by structure-based methods. To the best of our knowledge, it is the first structural similarity search method that makes large scale all-against-all database searches for CP achievable and practicable. We suppose that this procedure can be applied to reveal the evolutionary importance of CP and detect novel protein structural relationships. Several novel CP relationships have been detected by CPSARST and are reported in this article; also, some rational estimations of the prevalence of CP in protein structural databases have been made by doing all-against-all database searches of non-redundant Protein Data Bank (PDB) and SCOP.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Flowchart of CPSARST</p>
               </caption>
               <text>
                  <p>Flowchart of CPSARST. CPSARST uses a 'double filter-and-refine' strategy combining a fast screening and an accurate refinement step, each having two different rounds. In the screening stage, the three-dimensional structure of the query protein is transformed into a one-dimensional structural string by a RST algorithm [40]. This query string is subjected to two rounds of database searches. In round 1, it is searched against a pre-transformed structural string database by a heuristic method. In round 2, it is duplicated prior to the database search. Results of the two rounds are filtered; hits with meaningfully improved similarity scores are considered as CP candidates (colored red). In the refinement stage, candidates are analyzed by an accurate structural alignment algorithm, FAST [63], with and without CP manipulation, to determine their reliabilities and to retrieve permutation sites more precisely. After filtering out improbable cases, final answers with detailed information are output. The example used in this figure is a real case with simplified hit lists.</p>
               </text>
               <graphic file="gb-2008-9-1-r11-1"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Performance on random circular permutants</p>
            </st>
            <p>Although CPSARST basically uses structurally meaningful RM strings to search protein databases, its algorithm is actually applicable to amino acid sequences. To evaluate their amino acid sequence-based algorithm, Uliel <it>et al</it>. performed <it>in silico </it>random CP followed by various levels of regular mutations (substitutions, insertions and deletions) on a number of proteins <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. We adapted this approach in a more thorough manner and developed a random CP dataset containing 20,000 chains (RCP dataset; see Materials and methods) to assess the performance of CPSARST with amino acid sequences. Two parameters were monitored: the proportion of cases in which the exact permutation site was retrieved; and the percentage distance of the retrieved permutation site to the exact one, which is defined as:</p>
            <p>
               <display-formula id="M1">
                  <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="gb-2008-9-1-r11-i1">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>D</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>%</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mtext>Number&#160;of&#160;residues&#160;off&#160;the&#160;exact&#160;permutation&#160;site</m:mtext>
                              </m:mrow>
                              <m:mrow>
                                 <m:mtext>Sequence&#160;length</m:mtext>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo>&#215;</m:mo>
                           <m:mn>100</m:mn>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaacaWGebGaaiikaiaacwcacaGGPaGaeyypa0tcfa4aaSaaaeaacaqGobGaaeyDaiaab2gacaqGIbGaaeyzaiaabkhacaqGGaGaae4BaiaabAgacaqGGaGaaeOCaiaabwgacaqGZbGaaeyAaiaabsgacaqG1bGaaeyzaiaabohacaqGGaGaae4BaiaabAgacaqGMbGaaeiiaiaabshacaqGObGaaeyzaiaabccacaqGLbGaaeiEaiaabggacaqGJbGaaeiDaiaabccacaqGWbGaaeyzaiaabkhacaqGTbGaaeyDaiaabshacaqGHbGaaeiDaiaabMgacaqGVbGaaeOBaiaabccacaqGZbGaaeyAaiaabshacaqGLbaabaGaae4uaiaabwgacaqGXbGaaeyDaiaabwgacaqGUbGaae4yaiaabwgacaqGGaGaaeiBaiaabwgacaqGUbGaae4zaiaabshacaqGObaaaOGaey41aqRaaGymaiaaicdacaaIWaaaaa@721D@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>As shown in Figure <figr fid="F2">2a</figr>, the percentage of exact matched cases retrieved by CPSARST remains over 80% until the sequence identities fall between 40% and 30%. When we made a 50% exact matches cut, the results indicated CPSARST ensures that at least 50% of the retrieved cases are exact as long as the sequence identities are higher than 22%.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Performance on RCPs</p>
               </caption>
               <text>
                  <p>Performance on RCPs. The methodology of CPSARST is not only applicable to structurally meaningful RM strings but also to amino acid sequences. Random CP followed by various degrees of random substitutions, insertions and deletions were performed on 100 amino acid sequences. The performance of CPSARST was monitored by <b>(a) </b>the percentage of cases in which the exact permutation site was retrieved, and <b>(b) </b>the percentage distance of the retrieved permutation site to the exact one. The dashed line in (a) represents a 50% cut, above which more than half of the permutation sites were exactly predicted. When it only depends on amino acid sequences to detect CP, CPSARST can be reliable even if the identity is as low as 20%. UFAU stands for the CP-detecting method developed by Uliel <it>et al</it>. [38].</p>
               </text>
               <graphic file="gb-2008-9-1-r11-2"/>
            </fig>
            <p>The curve of the percentage distance of CPSARST has a half hyperbolic shape (Figure <figr fid="F2">2b</figr>). Provided that the sequence identity is > 20%, the percentage distance will be &lt; 1%. Combining these data, we suggest that when our approach is applied to amino acid sequences, it will be reliable in detecting CPs with sequence identities as low as about 20%.</p>
         </sec>
         <sec>
            <st>
               <p>Accuracy evaluations with engineered circular permutants</p>
            </st>
            <p>Since there are many artificial CPs, each with a definite parent protein, a known permutation site, and sometimes some regular mutations, they provide a good resource to assess the performance of a CP search method. We used keyword searches to find the engineered CPs recorded in the PDB <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, and subjected them to CPSARST searches. As summarized in Table <tblr tid="T1">1</tblr>, among the 15 non-redundant cases, all the parent proteins were successfully retrieved. Their average percentage distance is only 0.08%, which means that the CP sites identified are very close to the exact ones, demonstrating the high accuracy of CPSARST for engineered CPs.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Retrieved parent proteins of engineered CPs by CPSARST</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="left">
                        <p>PDB entry</p>
                     </c>
                     <c ca="center">
                        <p>Chain</p>
                     </c>
                     <c ca="center">
                        <p>Size</p>
                     </c>
                     <c ca="left">
                        <p>Function</p>
                     </c>
                     <c ca="left">
                        <p>Parent structure/recorded CP site</p>
                     </c>
                     <c ca="left">
                        <p>Retrieved structure/determined CP site</p>
                     </c>
                     <c ca="center">
                        <p><it>D </it>(%)*</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1AJK">1AJK</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A,B</p>
                     </c>
                     <c ca="center">
                        <p>214</p>
                     </c>
                     <c ca="left">
                        <p>Circularly permuted (1-3,1-4)-beta-D-glucan 4-glucanohydrolase H</p>
                     </c>
                     <c ca="left">
                        <p>2AYH/84</p>
                     </c>
                     <c ca="left">
                        <p>2AYH/84</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1AJO">1AJO</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A,B</p>
                     </c>
                     <c ca="center">
                        <p>214</p>
                     </c>
                     <c ca="left">
                        <p>Circularly permuted (1-3,1-4)-beta-D-glucan 4-glucanohydrolase H</p>
                     </c>
                     <c ca="left">
                        <p>2AYH/127</p>
                     </c>
                     <c ca="left">
                        <p>2AYH/127</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1ALQ">1ALQ</ext-link>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>266</p>
                     </c>
                     <c ca="left">
                        <p>CP254 beta-lactamase</p>
                     </c>
                     <c ca="left">
                        <p>3BLM/254</p>
                     </c>
                     <c ca="left">
                        <p>3BLM/254</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1BD7">1BD7</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A,B</p>
                     </c>
                     <c ca="center">
                        <p>176</p>
                     </c>
                     <c ca="left">
                        <p>Circularly permuted BB2-crystallin</p>
                     </c>
                     <c ca="left">
                        <p>1BLBC/87</p>
                     </c>
                     <c ca="left">
                        <p>1BLBC/87</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1CPM">1CPM</ext-link>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>214</p>
                     </c>
                     <c ca="left">
                        <p>Glucanase</p>
                     </c>
                     <c ca="left">
                        <p>2AYH/59</p>
                     </c>
                     <c ca="left">
                        <p>2AYH/59</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1CPN">1CPN</ext-link>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>208</p>
                     </c>
                     <c ca="left">
                        <p>Glucanase</p>
                     </c>
                     <c ca="left">
                        <p>2AYH/59</p>
                     </c>
                     <c ca="left">
                        <p>2AYH/59</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1FW8">1FW8</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A</p>
                     </c>
                     <c ca="center">
                        <p>416</p>
                     </c>
                     <c ca="left">
                        <p>Phosphoglycerate kinase</p>
                     </c>
                     <c ca="left">
                        <p>3PGK/72</p>
                     </c>
                     <c ca="left">
                        <p>3PGK/73</p>
                     </c>
                     <c ca="center">
                        <p>0.24</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1G2B">1G2B</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A</p>
                     </c>
                     <c ca="center">
                        <p>62</p>
                     </c>
                     <c ca="left">
                        <p>Spectrin alpha chain</p>
                     </c>
                     <c ca="left">
                        <p>1SHG/47</p>
                     </c>
                     <c ca="left">
                        <p>1SHG/47</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1N02">1N02</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A</p>
                     </c>
                     <c ca="center">
                        <p>102</p>
                     </c>
                     <c ca="left">
                        <p>Cyanovirin-N</p>
                     </c>
                     <c ca="left">
                        <p>2EZM/50</p>
                     </c>
                     <c ca="left">
                        <p>2EZM/51</p>
                     </c>
                     <c ca="center">
                        <p>0.98</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1P5C">1P5C</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A-D</p>
                     </c>
                     <c ca="center">
                        <p>167</p>
                     </c>
                     <c ca="left">
                        <p>Lysozyme</p>
                     </c>
                     <c ca="left">
                        <p>1LW9A/12</p>
                     </c>
                     <c ca="left">
                        <p>1LW9A/12</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1SWF">1SWF</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A-D</p>
                     </c>
                     <c ca="center">
                        <p>128</p>
                     </c>
                     <c ca="left">
                        <p>Circularly permuted core-streptavidin E51/A46</p>
                     </c>
                     <c ca="left">
                        <p>1STP/51</p>
                     </c>
                     <c ca="left">
                        <p>1STP/51</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1SWG">1SWG</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A-D</p>
                     </c>
                     <c ca="center">
                        <p>128</p>
                     </c>
                     <c ca="left">
                        <p>Circularly permuted core-streptavidin E51/A46</p>
                     </c>
                     <c ca="left">
                        <p>1STP/51</p>
                     </c>
                     <c ca="left">
                        <p>1STP/51</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1TUC">1TUC</ext-link>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>63</p>
                     </c>
                     <c ca="left">
                        <p>alpha-Spectrin</p>
                     </c>
                     <c ca="left">
                        <p>1SHG/20</p>
                     </c>
                     <c ca="left">
                        <p>1SHG/20</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1TUD">1TUD</ext-link>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>62</p>
                     </c>
                     <c ca="left">
                        <p>alpha-Spectrin</p>
                     </c>
                     <c ca="left">
                        <p>1SHG/48</p>
                     </c>
                     <c ca="left">
                        <p>1SHG/48</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <ext-link ext-link-type="pdb" ext-link-id="1UN2">1UN2</ext-link>
                        </p>
                     </c>
                     <c ca="center">
                        <p>A</p>
                     </c>
                     <c ca="center">
                        <p>197</p>
                     </c>
                     <c ca="left">
                        <p>Thiol-disulfide interchange protein</p>
                     </c>
                     <c ca="left">
                        <p>1A2J/100</p>
                     </c>
                     <c ca="left">
                        <p>1A2J/100</p>
                     </c>
                     <c ca="center">
                        <p>0.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Average</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.08</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>*Percentage distance of the retrieved permutation site to the exact one. See text for definition.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Pair-wise comparisons of naturally occurring circular permutants</p>
            </st>
            <p>To our knowledge, current CP-detecting methods based on structural comparisons work in only a pair-wise fashion. Although CPSARST is a database search procedure, it can be simplified to perform pair-wise comparisons (see Materials and methods). Here, we used naturally occurring CP candidates to test the performance of CPSARST. These candidate pairs were detected by doing all-against-all searches against a non-redundant PDB dataset (see below for details) and then filtering out engineered permutants. The 'structural diversity' defined by Lu <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> that integrates the concepts of normalized alignment size and root mean square distance (RMSD) was used to evaluate the quality of pair-wise comparisons:</p>
            <p>
               <display-formula id="M2">
                  <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="gb-2008-9-1-r11-i2">
                     <m:semantics>
                        <m:mrow>
                           <m:mtext>structure&#160;diversity</m:mtext>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mtext>RMSD</m:mtext>
                              </m:mrow>
                              <m:mrow>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mtext>(</m:mtext>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mtext>alignment&#160;size</m:mtext>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mrow>
                                                   <m:mtext>avg(N</m:mtext>
                                                </m:mrow>
                                                <m:mtext>q</m:mtext>
                                             </m:msub>
                                             <m:msub>
                                                <m:mrow>
                                                   <m:mtext>,N</m:mtext>
                                                </m:mrow>
                                                <m:mtext>s</m:mtext>
                                             </m:msub>
                                             <m:mtext>)</m:mtext>
                                          </m:mrow>
                                       </m:mfrac>
                                       <m:mtext>)</m:mtext>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mtext>1</m:mtext>
                                       <m:mtext>.5</m:mtext>
                                    </m:mrow>
                                 </m:msup>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaacaqGZbGaaeiDaiaabkhacaqG1bGaae4yaiaabshacaqG1bGaaeOCaiaabwgacaqGGaGaaeizaiaabMgacaqG2bGaaeyzaiaabkhacaqGZbGaaeyAaiaabshacaqG5bGaeyypa0tcfa4aaSaaaeaacaqGsbGaaeytaiaabofacaqGebaabaGaaeikamaalaaabaGaaeyyaiaabYgacaqGPbGaae4zaiaab6gacaqGTbGaaeyzaiaab6gacaqG0bGaaeiiaiaabohacaqGPbGaaeOEaiaabwgaaeaacaqGHbGaaeODaiaabEgacaqGOaGaaeOtamaaBaaabaGaaeyCaaqabaGaaeilaiaab6eadaWgaaqaaiaabohaaeqaaiaabMcaaaGaaeykamaaCaaabeqaaiaabgdacaqGUaGaaeynaaaaaaaaaa@5FDB@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where avg(N<sub>q</sub>, N<sub>s</sub>) is the average size of the query and subject protein. Lower structural diversities stand for higher structural alignment qualities of the assessed methods. The results are listed in Tables <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr>. In terms of structural diversity, the performance of CPSARST is better than that of SHEBA <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and is comparable to SAMO <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. In addition, CPSARST is 9.3 times faster than SAMO in these pair-wise comparisons (Table <tblr tid="T2">2</tblr>). Protein size has no effect on the alignment qualities of these structure-based methods while the running time increases as the size becomes larger. This increase in running time is lowest for CPSARST, apparently much lower than that of SAMO. Sequence identities greatly influence the performance, especially for SHEBA (Table <tblr tid="T3">3</tblr>). The differences in structural diversities calculated by CPSARST and SAMO are not obvious until the sequence identity of the CP pair becomes lower than 20%.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Performance of pair-wise comparisons for natural candidate CP pairs over various protein sizes</p>
               </caption>
               <tblbdy cols="8">
                  <r>
                     <c ca="left">
                        <p>Length of the query protein (residues)</p>
                     </c>
                     <c ca="center">
                        <p>No. of candidate CP pairs</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>CPSARST</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>SHEBA</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>SAMO</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Structural diversity</p>
                     </c>
                     <c ca="center">
                        <p>Average running time (s)</p>
                     </c>
                     <c ca="center">
                        <p>Structural diversity</p>
                     </c>
                     <c ca="center">
                        <p>Average running time (s)</p>
                     </c>
                     <c ca="center">
                        <p>Structural diversity</p>
                     </c>
                     <c ca="center">
                        <p>Average running time (s)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>&#8804; 100</p>
                     </c>
                     <c ca="center">
                        <p>135</p>
                     </c>
                     <c ca="center">
                        <p>5.269</p>
                     </c>
                     <c ca="center">
                        <p>0.245</p>
                     </c>
                     <c ca="center">
                        <p>6.600</p>
                     </c>
                     <c ca="center">
                        <p>0.506</p>
                     </c>
                     <c ca="center">
                        <p>4.024</p>
                     </c>
                     <c ca="center">
                        <p>0.765</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>100-150</p>
                     </c>
                     <c ca="center">
                        <p>223</p>
                     </c>
                     <c ca="center">
                        <p>6.629</p>
                     </c>
                     <c ca="center">
                        <p>0.381</p>
                     </c>
                     <c ca="center">
                        <p>10.255</p>
                     </c>
                     <c ca="center">
                        <p>0.767</p>
                     </c>
                     <c ca="center">
                        <p>4.359</p>
                     </c>
                     <c ca="center">
                        <p>2.243</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>150-200</p>
                     </c>
                     <c ca="center">
                        <p>464</p>
                     </c>
                     <c ca="center">
                        <p>6.105</p>
                     </c>
                     <c ca="center">
                        <p>0.520</p>
                     </c>
                     <c ca="center">
                        <p>12.730</p>
                     </c>
                     <c ca="center">
                        <p>0.955</p>
                     </c>
                     <c ca="center">
                        <p>4.591</p>
                     </c>
                     <c ca="center">
                        <p>3.554</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>200-250</p>
                     </c>
                     <c ca="center">
                        <p>177</p>
                     </c>
                     <c ca="center">
                        <p>4.410</p>
                     </c>
                     <c ca="center">
                        <p>0.922</p>
                     </c>
                     <c ca="center">
                        <p>10.683</p>
                     </c>
                     <c ca="center">
                        <p>1.390</p>
                     </c>
                     <c ca="center">
                        <p>3.499</p>
                     </c>
                     <c ca="center">
                        <p>6.793</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>250-300</p>
                     </c>
                     <c ca="center">
                        <p>39</p>
                     </c>
                     <c ca="center">
                        <p>6.645</p>
                     </c>
                     <c ca="center">
                        <p>1.063</p>
                     </c>
                     <c ca="center">
                        <p>11.092</p>
                     </c>
                     <c ca="center">
                        <p>1.774</p>
                     </c>
                     <c ca="center">
                        <p>4.277</p>
                     </c>
                     <c ca="center">
                        <p>10.820</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>> 300</p>
                     </c>
                     <c ca="center">
                        <p>30</p>
                     </c>
                     <c ca="center">
                        <p>6.918</p>
                     </c>
                     <c ca="center">
                        <p>1.894</p>
                     </c>
                     <c ca="center">
                        <p>6.976</p>
                     </c>
                     <c ca="center">
                        <p>2.224</p>
                     </c>
                     <c ca="center">
                        <p>4.423</p>
                     </c>
                     <c ca="center">
                        <p>22.345</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Average</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.838</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>1.269</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>7.753</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Performance of pair-wise comparisons for natural candidate CP pairs over various sequence identities</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Identity (%)</p>
                     </c>
                     <c ca="center">
                        <p>No. of candidate CP pairs</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Structural diversity</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>CPSARST</p>
                     </c>
                     <c ca="center">
                        <p>SHEBA</p>
                     </c>
                     <c ca="center">
                        <p>SAMO</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>&#8804; 10</p>
                     </c>
                     <c ca="center">
                        <p>823</p>
                     </c>
                     <c ca="center">
                        <p>6.309</p>
                     </c>
                     <c ca="center">
                        <p>11.180</p>
                     </c>
                     <c ca="center">
                        <p>4.396</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>10-20</p>
                     </c>
                     <c ca="center">
                        <p>152</p>
                     </c>
                     <c ca="center">
                        <p>5.864</p>
                     </c>
                     <c ca="center">
                        <p>13.881</p>
                     </c>
                     <c ca="center">
                        <p>4.994</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>20-30</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>3.581</p>
                     </c>
                     <c ca="center">
                        <p>4.506</p>
                     </c>
                     <c ca="center">
                        <p>3.363</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>30-40</p>
                     </c>
                     <c ca="center">
                        <p>33</p>
                     </c>
                     <c ca="center">
                        <p>1.868</p>
                     </c>
                     <c ca="center">
                        <p>3.284</p>
                     </c>
                     <c ca="center">
                        <p>2.210</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>40-50</p>
                     </c>
                     <c ca="center">
                        <p>40</p>
                     </c>
                     <c ca="center">
                        <p>1.755</p>
                     </c>
                     <c ca="center">
                        <p>3.096</p>
                     </c>
                     <c ca="center">
                        <p>1.544</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>> 50</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>1.385</p>
                     </c>
                     <c ca="center">
                        <p>2.247</p>
                     </c>
                     <c ca="center">
                        <p>1.520</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>CPSARST runs very rapidly in pair-wise comparisons. When searching databases, its speed will be even higher since it does not work in a pair-wise manner but with a 'double filter-and-refine' strategy. Chen had estimated that using SAMO to compare two proteins mostly took around ten seconds <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. Searching the current PDB (approximately 90,000 polypeptides) by one-against-all comparisons will, therefore, require over 15,000 minutes. However, CPSARST can do this one-against-all comparison in 1.7 minutes (see below). As shown by these naturally occurring cases, CPSARST achieves a high speed with a reasonable compromise in alignment accuracy.</p>
         </sec>
         <sec>
            <st>
               <p>Protein structural database searches</p>
            </st>
            <p>To examine the database searching performance of CPSARST, two non-redundant protein databases were used, the 90% sequence identity subsets of PDB (January 2007) and the ASTRAL SCOP dataset (v.1.71) <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>, which were abbreviated as nrPDB-90 (14,422 polypeptides) and nrSCOP-90 (11,688 domains), respectively (see Additional data files 1 and 2 for lists of entry IDs). As summarized in Table <tblr tid="T4">4</tblr>, the all-against-all survey of large protein databases like nrPDB-90 took 65.7 hours. Since there were approximately 200 million protein pairs for this database (14,422 &#215; 14,422), these data demonstrated that CPSARST could scan around 52,800 pairs per minute. At this speed, a full search of the current PDB could be finished in 1.7 minutes per query protein. In comparison with 6.4 minutes required by the sequence-based UFAU method (developed by S Uliel, A Fliess, A Amir and R Unger) <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> and 15,000 minutes by the structure-based SAMO <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, CPSARST runs fairly fast. Besides, CPSARST gives the user two parameters, expectation value (E-value) and CP score, to evaluate the significance of the retrieved information.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Statistics of protein structural database searches</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Database</p>
                     </c>
                     <c ca="center">
                        <p>nrPDB-90</p>
                     </c>
                     <c ca="center">
                        <p>nrSCOP-90</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>No. of proteins</p>
                     </c>
                     <c ca="center">
                        <p>14,422</p>
                     </c>
                     <c ca="center">
                        <p>11,688</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>No. of candidate pairs</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>Detected by amino acid sequence</p>
                     </c>
                     <c ca="center">
                        <p>5,020</p>
                     </c>
                     <c ca="center">
                        <p>1,802</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>Detected only by Ramachandran string</p>
                     </c>
                     <c ca="center">
                        <p>252,287</p>
                     </c>
                     <c ca="center">
                        <p>196,533</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>Confirmed after the refinement stage</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>Total</p>
                     </c>
                     <c ca="center">
                        <p>2,911</p>
                     </c>
                     <c ca="center">
                        <p>4,228</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>Symmetric CP</p>
                     </c>
                     <c ca="center">
                        <p>682</p>
                     </c>
                     <c ca="center">
                        <p>1,161</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total no. of protein pairs</p>
                     </c>
                     <c ca="center">
                        <p>208.0 &#215; 10<sup>6</sup></p>
                     </c>
                     <c ca="center">
                        <p>136.6 &#215; 10<sup>6</sup></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total running time (minutes)</p>
                     </c>
                     <c ca="center">
                        <p>3,942</p>
                     </c>
                     <c ca="center">
                        <p>1,974</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>No. of protein pairs scanned per minute</p>
                     </c>
                     <c ca="center">
                        <p>52,764</p>
                     </c>
                     <c ca="center">
                        <p>69,204</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>As a database search method, CPSARST provides a list of hits ranked by the statistically meaningful E-value. Given that a hit has a similarity score <it>S</it>, the E-value is the number of different alignments with scores equivalent to or better than <it>S </it>that are expected to occur in this particular database search by chance <abbrgrp><abbr bid="B45">45</abbr><abbr bid="B46">46</abbr><abbr bid="B47">47</abbr></abbrgrp>. A lower E-value indicates a higher significance for the score. This statistical significance is a useful indicator of the reliability of the search results.</p>
            <p>To determine the extent to which two proteins are related by a CP, we used the CP scoring scheme described by Vesterstrom and Taylor <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. The minimum value of this CP score is -1 for a pair of completely linearly aligned proteins, and its maximum value is 1 for a perfect CP alignment. In general, a small positive CP score indicates that only a small fraction of the protein is permutated while a larger one reveals that the CP site is closer to the middle of the polypeptide chain.</p>
            <p>In the survey of nrPDB-90 and nrSCOP-90, we had set the RMSD cutoff as 5 &#197;, the E-value cutoff as 0.1 and the CP score threshold as 0.2. Under these criteria, 2,911 and 4,228 candidate pairs were identified in nrPDB-90 and nrSCOP-90, respectively. For nrPDB-90, the 2,911 candidate pairs consisted of 1,822 different polypeptides, that is 12.6% (1,822 of 14,422) of the polypeptides have CP relationships with at least one other polypeptide. For nrSCOP-90, the proportion is 17.6% (2,060 of 11,688).</p>
         </sec>
         <sec>
            <st>
               <p>Novel circular permutation family detected by CPSARST</p>
            </st>
            <p>After visual inspections of superimposed CP pairs detected by CPSARST, we found that it is possible for proteins with very different functions and divergent amino acid sequences to share CP relationships structurally, forming novel CP families, which are difficult to identify using conventional comparison methods. For instance, although glycine betaine-binding proteins (GBBPs), molybdate-binding proteins and <it>Klebsiella aerogenes </it>cysteine regulon transcriptional activator CysB share similar overall structures when judged by the naked eye, their sequence identity is low (&lt; 24%; calculated by FASTA <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>) and structural relatedness is hard to detect by conventional methods (Figure <figr fid="F3">3</figr>). CPSARST detected CP relationships among GBBPs themselves and among these three groups of proteins. To our knowledge, these CP relationships have not been reported previously. Figure <figr fid="F3">3</figr> illustrates that the functional and evolutionary relationships among these proteins cannot be correctly determined by their raw sequences; their ligand-interacting residues are not well-aligned and proteins with more similar functions are separated while those with less similar functions cluster together in the phylogram tree. However, the circularly permuted sequences retrieved by CPSARST can be well-aligned and the phylogram tree agrees with the functional relatedness among these proteins. A superimposition of six of these proteins is also shown in Figure <figr fid="F3">3</figr> to demonstrate their structural similarity and the conserved position of their ligand binding pockets.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>A novel CP family detected by CPSARST</p>
               </caption>
               <text>
                  <p>A novel CP family detected by CPSARST. Entries 2b4lA ([PDB:<ext-link ext-link-type="pdb" ext-link-id="2B4L">2B4L</ext-link>], chain A), 1r9lA ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1R9L">1R9L</ext-link>], chain A) and 1sw1A ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1SW1">1SW1</ext-link>], chain A) are GBBPs. Entries 1atg ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1ATG">1ATG</ext-link>]) and 1amf ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1AMF">1AMF</ext-link>]) are molybdate-binding proteins (MoBPs) and 1al3 ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1AL3">1AL3</ext-link>]) is the cysteine regulon transcriptional activator CysB from <it>Klebsiella aerogenes</it>. Any pair of these proteins share &lt; 24% sequence identity (calculated by FASTA [48]). <b>(a) </b>Multiple sequence alignment of these GBBPs, MoBPs and CysB does not well reveal their functional and evolutionary relationships. Residues interacting with the ligands [65-67] are colored red; they are rather scattered. GBBPs and MoBPs are basically ligand transporters while CysB is a transcriptional regulator; however, the phylogram tree built from this alignment correlates CysB and MoBPs into the same branch and the three GBBPs are separated into two branches; these evolutionary relationships do not agree with their functional relatedness. <b>(b) </b>Multiple circularly permuted sequence alignment and structural superimposition of these six proteins. The numbers after '_cp' following PDB entry IDs stand for the residue numbers of the new amino termini after circular permutations, which are indicated by colored arrows. The ligand-interacting residues are better clustered in this alignment (gray regions) and the phylogram tree agrees well with the functional relatedness. The image of the superimposed proteins shows that these proteins have similar overall structures and the positions of their ligand-binding pockets are conserved (ligands are shown as yellow stick models); the colors used in this image are the same as in the alignment text and phylogram tree. Structures shown in this report were all drawn by using PyMOL [68]. Multiple sequence alignments and the tree building were performed by Clustal W [69].</p>
               </text>
               <graphic file="gb-2008-9-1-r11-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Circular permutants detected by CPSARST</p>
            </st>
            <p>We examined the candidate pairs detected by CPSARST with RMSD &#8804; 3.5 &#197; by visual inspection of superimposed structures and found that approximately 55%, 25% and 20% are mainly alpha, mainly beta, and alpha-beta structures, respectively. These CP pairs are listed, each with a superimposed image, in Additional data file 3; many well-known CP cases are listed, such as some lectins, glucanases, transaldolases, methyltransferases, ferredoxins, protease inhibitors and GTPases. Furthermore, a large number of these CP relationships have not been reported yet, for example, chorismate mutases ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1CSM">1CSM</ext-link>] versus [PDB:<ext-link ext-link-type="pdb" ext-link-id="2AO2">2AO2</ext-link>]); some (approximately 20%) even involve hypothetical proteins, implying that CPSARST can be applied to suggest possible functions for hypothetical proteins.</p>
            <p>Rat Rab3A is a small G protein with GTPase activity <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>. CPSARST detected that it has a CP relationship with a conserved hypothetical protein YlqF from <it>Bacillus subtilis</it>, the structure of which was determined by the New York Structural Genomics Research Consortium. When we searched with YlqF against the PDB using the DALI server <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>, a number of isomerases, elongation factors, G proteins, transferases and other hypothetical proteins with inconvincible quality of structural alignments, i.e. small alignment sizes and large RMSD, were returned (Additional data file 4). However, CPSARST detected that many G proteins superimpose well with YlqF, suggesting that it may possess GTP binding/GTPase activity (Table <tblr tid="T5">5</tblr>). Figure <figr fid="F4">4</figr> shows that DALI can only partially align Rab3A and YlqF (alignment size, 96; RMSD, 2.9 &#197;), while CPSARST successfully detects the CP relationship between them (alignment size, 130; RMSD, 3.2 &#197;).</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>CP relationship between GTPase and hypothetical protein YlqF</p>
               </caption>
               <text>
                  <p>CP relationship between GTPase and hypothetical protein YlqF. Rab3A ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1ZBD">1ZBD</ext-link>], chain A) is a small G protein with GTPase activity [49] while YlqF ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1PUJ">1PUJ</ext-link>], chain A) is a conserved hypothetical protein from <it>B. subtilis</it>. <b>(a) </b>These two proteins can be structurally aligned by DALI [36] only partially (left); however, CPSARST detects their CP relationship (right). If the 64 residue amino-terminal region of Rab3A (in cyan text) is permuted to the carboxul terminus, it can be extensively aligned to YlqF with an RMSD of 3.2 &#197; (right). The transparent cyan and pink arrows indicate the amino termini of Rab3A and YlqF, respectively. <b>(b) </b>The superimposition of Rab3A and YlqF made by CPSARST (cross-eye stereo view). Colors are the same as in (a). Residues shown as cyan/pink and blue/red spacefill models are the amino and carboxyl termini, respectively.</p>
               </text>
               <graphic file="gb-2008-9-1-r11-4"/>
            </fig>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Top 20 CP relationships detected from the nrPDB-90 dataset for hypothetical protein YlqF*</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>No.</p>
                     </c>
                     <c ca="left">
                        <p>PDB entry/size</p>
                     </c>
                     <c ca="left">
                        <p>E-value</p>
                     </c>
                     <c ca="center">
                        <p>RMSD/Alignment size</p>
                     </c>
                     <c ca="left">
                        <p>Function</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1ZBD">1ZBD</ext-link>/203</p>
                     </c>
                     <c ca="left">
                        <p>4.00E-13</p>
                     </c>
                     <c ca="center">
                        <p>3.17/130</p>
                     </c>
                     <c ca="left">
                        <p>Rabphilin-3A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1KY2">1KY2</ext-link>/182</p>
                     </c>
                     <c ca="left">
                        <p>4.00E-13</p>
                     </c>
                     <c ca="center">
                        <p>3.07/122</p>
                     </c>
                     <c ca="left">
                        <p>GTP-binding</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="2F7S">2F7S</ext-link>/217</p>
                     </c>
                     <c ca="left">
                        <p>4.00E-13</p>
                     </c>
                     <c ca="center">
                        <p>3.52/125</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-27B protein YPT7P</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>4</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="2NZJ">2NZJ</ext-link>/175</p>
                     </c>
                     <c ca="left">
                        <p>8.00E-13</p>
                     </c>
                     <c ca="center">
                        <p>2.94/123</p>
                     </c>
                     <c ca="left">
                        <p>GTP-binding protein REM 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>5</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1T91">1T91</ext-link>/207</p>
                     </c>
                     <c ca="left">
                        <p>9.00E-13</p>
                     </c>
                     <c ca="center">
                        <p>3.06/123</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-7</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>6</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1X3S">1X3S</ext-link>/195</p>
                     </c>
                     <c ca="left">
                        <p>2.00E-12</p>
                     </c>
                     <c ca="center">
                        <p>2.80/117</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-18</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>7</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1YU9">1YU9</ext-link>/175</p>
                     </c>
                     <c ca="left">
                        <p>6.00E-12</p>
                     </c>
                     <c ca="center">
                        <p>2.70/123</p>
                     </c>
                     <c ca="left">
                        <p>GTP-binding protein, GTPase domain</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="2EW1">2EW1</ext-link>/201</p>
                     </c>
                     <c ca="left">
                        <p>6.00E-12</p>
                     </c>
                     <c ca="center">
                        <p>2.74/128</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-30</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>9</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="2GF9">2GF9</ext-link>/189</p>
                     </c>
                     <c ca="left">
                        <p>7.00E-12</p>
                     </c>
                     <c ca="center">
                        <p>2.89/126</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-3D</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>10</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1YVD">1YVD</ext-link>/169</p>
                     </c>
                     <c ca="left">
                        <p>8.00E-12</p>
                     </c>
                     <c ca="center">
                        <p>2.12/123</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-22A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>11</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1PUI">1PUI</ext-link>/210</p>
                     </c>
                     <c ca="left">
                        <p>1.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>3.00/130</p>
                     </c>
                     <c ca="left">
                        <p>Probable GTP-binding protein engB</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>12</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="2O52">2O52</ext-link>/200</p>
                     </c>
                     <c ca="left">
                        <p>1.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>2.92/127</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-4B</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>13</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1U8Y">1U8Y</ext-link>/168</p>
                     </c>
                     <c ca="left">
                        <p>1.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>2.81/110</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Ral-A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>14</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1HUQ">1HUQ</ext-link>/164</p>
                     </c>
                     <c ca="left">
                        <p>1.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>2.80/123</p>
                     </c>
                     <c ca="left">
                        <p>Rab5C, GTPase domain</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>15</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="2HUP">2HUP</ext-link>/201</p>
                     </c>
                     <c ca="left">
                        <p>1.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>3.11/129</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-43</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>16</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1FZQ">1FZQ</ext-link>/181</p>
                     </c>
                     <c ca="left">
                        <p>1.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>2.58/123</p>
                     </c>
                     <c ca="left">
                        <p>ADP-ribosylation factor-like protein 3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>17</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="2OCB">2OCB</ext-link>/180</p>
                     </c>
                     <c ca="left">
                        <p>3.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>2.78/121</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-9B</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>18</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1OIV">1OIV</ext-link>/191</p>
                     </c>
                     <c ca="left">
                        <p>4.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>2.81/121</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein Rab-11A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>19</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="2FN4">2FN4</ext-link>/181</p>
                     </c>
                     <c ca="left">
                        <p>4.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>3.11/129</p>
                     </c>
                     <c ca="left">
                        <p>Ras-related protein R-Ras</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>20</p>
                     </c>
                     <c ca="left">
                        <p><ext-link ext-link-type="pdb" ext-link-id="1Z0F">1Z0F</ext-link>/179</p>
                     </c>
                     <c ca="left">
                        <p>6.00E-11</p>
                     </c>
                     <c ca="center">
                        <p>3.04/121</p>
                     </c>
                     <c ca="left">
                        <p>Rab14, member Ras oncogene family</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>*YlqF ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1PUJ">1PUJ</ext-link>], chain A) is a conserved hypothetical protein from <it>B. subtilis</it>. This structure was determined by the New York Structural Genomics Research Consortium (NYSGRC).</p>
               </tblfn>
            </tbl>
            <p>Jung and Lee <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> suggested that when a pair of proteins can be well-aligned, with or without CP of the sequences, they are symmetric CPs. Considering this definition, proteins containing repeats or duplications will be included. However, Uliel <it>et al</it>. <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> supposed that these should be differentiated from true CPs. In our point of view, the certification of a CP relationship between symmetric proteins is conditional upon the observation of a reasonable increase in sequence homology after the CP. For instance, <it>B. subtilis </it>thiaminase I <abbrgrp><abbr bid="B51">51</abbr></abbrgrp> and <it>Variovorax sp. Pal2 </it>phosphonopyruvate hydrolase <abbrgrp><abbr bid="B52">52</abbr></abbrgrp> are a pair of symmetric TIM-barrel proteins detected by CPSARST that superimpose well, with (alignment size, 151; RMSD, 2.4 &#197;) or without (alignment size, 158; RMSD, 2.7 &#197;) CP. Their sequence identity rises from 10.1% to 24.3% upon CP. As shown in Figure <figr fid="F5">5</figr>, their ligand-interacting residues are not well-aligned without CP while, for each protein, these functionally important residues can be aligned with physiochemically related amino acids on the other protein with CP. Therefore, we suggest that this is a true CP case.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Symmetric CP with significant sequence clues</p>
               </caption>
               <text>
                  <p>Symmetric CP with significant sequence clues. Proteins with symmetric structure may have symmetric CPs [29]. <it>B. subtilis </it>thiaminase I ([PDB:<ext-link ext-link-type="pdb" ext-link-id="1YAD">1YAD</ext-link>]) [51] and <it>Variovorax sp. Pal2 </it>phosphonopyruvate hydrolase ([PDB:<ext-link ext-link-type="pdb" ext-link-id="2DUA">2DUA</ext-link>]) [52] shown here are symmetric TIM-barrel proteins. Although their structures can be well-aligned both by linear and CP alignments, significant sequence conservation is observed only in the latter. <b>(a) </b>Linear alignment performed by DALI [36]. The upper text demonstrates that the sequence identity calculated from these structurally aligned residues is 10.1%. Ligand-interacting residues in both proteins are highlighted green; four of them are aligned with identical or physiochemically similar amino acids (gray highlighted strips). The lower image is the superimposition of these two structures. Terminal unaligned regions are shown as ribbons to make the spatial closeness of the termini more easily observable. In this linear alignment, the amino termini of the two proteins are close to each other, as are the carboxyl termini. <b>(b) </b>CP relationship detected by CPSARST. After CP, the sequence identity significantly rises to 24.3% and there are nine ligand-interacting residues aligned with identical or similar amino acids. The amino- and carboxy-terminal halves of 1yadA bounded by the putative CP site are colored cyan and blue, respectively. The orientation of 1yadA in the superimposed image is the same as that in (a). In this CP alignment, the amino and carboxyl termini of the two proteins are separated, a feature of symmetric CP.</p>
               </text>
               <graphic file="gb-2008-9-1-r11-5"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Detecting circular permutants with low sequence identities</p>
            </st>
            <p>Generally speaking, although protein similarity search methods based on amino acid sequence alignments are much faster than those based on structural comparisons, they are less sensitive in detecting remote homology <abbrgrp><abbr bid="B53">53</abbr></abbrgrp>. In the case of detecting CP, sequence-based methods have met great challenges because of the evolutionary complexity and diversity of circular permutants. Except the post-translational modification model, all the other proposed mechanisms for CP involve at least two stages of genetic modifications in evolution (see Background), implying that the formation of CP may require a long period during which other common mutations (substitutions, insertions and deletions) can accumulate to such an extent that the circular permutants have much diverged from the parent protein in sequence. Therefore, sequence-based methods may be limited in identifying distantly related CPs. For instance, Uliel <it>et al</it>. used an amino acid sequence-based heuristic algorithm to screen the entire Swiss-Prot database (version 34.0; approximately 80,000 proteins) and the Pfam database <abbrgrp><abbr bid="B54">54</abbr></abbrgrp> for CP pairs, and identified only 32 cases <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. However, in the same year, Jung and Lee <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> used a structure-based algorithm to survey a protein dataset (3,035 domains) collected from SCOP and reported that approximately 47% (1,433 of 3,035) of the domains each had at least one circular permutant. Furthermore, they discovered that less than 0.3% of the abundant symmetric CPs have > 30% sequence identities. Although this large difference is partially caused by the fact that Uliel <it>et al</it>. used more stringent criteria to identify CP, it basically indicates that amino acid sequence-based methods can miss many distantly related CPs <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>.</p>
            <p>Among the CP candidate pairs detected by CPSARST in nrSCOP-90, 27.5% can be considered as symmetric CPs (Table <tblr tid="T4">4</tblr>). Similar to the observation of Jung and Lee, few of these symmetric CPs (2.6%) have sequence identities > 30%. Furthermore, although 91% of the naturally occurring CP pairs listed in Table <tblr tid="T2">2</tblr> have sequence identities &#8804; 20%, CPSARST shows good performance when compared with other structure-based methods. These data demonstrate that CPSARST is able to detect CPs with low sequence identities.</p>
         </sec>
         <sec>
            <st>
               <p>Speed improvements</p>
            </st>
            <p>In most cases, it is not easy to achieve high accuracy and speed simultaneously for a database search method; instead, some compromising balance is usually reached. Judging from the fact that using previous structure-based CP-detecting methods such as SAMO to search the current PDB requires more than 15,000 minutes <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> per query, it is reasonable that speed should be weighted more than accuracy in the field of CP searching, especially in this post-genomic era when the amount of protein structural data is increasing rapidly. CPSARST has been shown to achieve accuracy substantially higher than sequence-based UFAU (Figure <figr fid="F2">2</figr>) and comparable to structure-based SAMO (Tables <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr>); as to the speed, it can scan 52,800 database proteins per minute (Table <tblr tid="T4">4</tblr>), approximately 4 and 8,824 times faster than UFAU and SAMO, respectively. This improvement in speed is achieved by two features: it transforms three-dimensional information of protein structures into one-dimensional text strings and, thus, converts structural comparison problems into text sequence alignment problems, which can be solved much more rapidly; and, in both the screening and refinement stages, CPSARST does not stick to the absolute qualities of the alignments. By focusing on the relative qualities between two rounds of alignments, it can rapidly sieve out useful information. We call this strategy 'double filter-and-refine'. Here we propose that it is efficient, flexible and applicable to other biological research fields, especially where the data analyses require large-scale computational power.</p>
         </sec>
         <sec>
            <st>
               <p>The prevalence and definition of circular permutation</p>
            </st>
            <p>Previous studies have made conflicting conclusions; some presumed that CP is rare in nature <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B14">14</abbr><abbr bid="B30">30</abbr></abbrgrp> - approximately 5% as indicated by Vogel and Morea <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> - while others supposed that CP is frequent <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B29">29</abbr><abbr bid="B34">34</abbr></abbrgrp> - approximately 47% as estimated by Jung and Lee <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. In our observation, studies based on structural analyses usually discovered more CPs than sequence-based ones; besides, studies that consider the whole protein as the unit that undergoes CP would conclude that CP is rare whereas those viewing the domain as the unit that undergoes CP would suggest CP to be frequent.</p>
            <p>As we have discussed, it is reasonable that more cases of CP are detected by structural comparison than by amino acid sequence alignment. However, although proteins with similar structures are usually functionally related <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>, when a pair of structurally and functionally similar proteins share extremely low sequence identity, we still cannot exclude the possibility that they are just the products of convergent evolution <abbrgrp><abbr bid="B56">56</abbr><abbr bid="B57">57</abbr><abbr bid="B58">58</abbr></abbrgrp> and do not share the same origin. In the case of identifying CP, it is noteworthy that even if a pair of proteins shows a high extent of CP topologically, it does not directly mean that an evolutionary CP event has indeed taken place. Therefore, we argue that detecting CP only by structure would result in too many false positives when judged from the point of view of molecular evolution. This is why we have set up a user-adjustable sequence identity filter in the web service of CPSARST <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> (see Materials and methods). When this filter was not enabled, the prevalence of CP estimated by CPSARST was 12.6-17.6% (see Results). When we considered that a real CP should have a higher sequence identity in the CP alignment than in the linear alignment, around one-fourth of the candidate pairs counted in Table <tblr tid="T4">4</tblr> was filtered out, lowering the estimated prevalence of CP to 9.0-13.0%.</p>
            <p>The fact that the frequency of CP estimated by CPSARST is only one-third of that estimated by Jung and Lee <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> is probably because of the more stringent criteria used by CPSARST. We set the RMSD cutoff as 5 &#197;, the CP score threshold as 0.2 and the least permutation size as 20% for a pair of proteins to be considered as CP candidates; similar criteria were not seen in the report of Jung and Lee. Also, considering their methodology, there is a large likelihood that proteins containing repeats and duplications are regarded as CPs, many of which have been treated as false cases by Uliel <it>et al</it>. <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> and us (see Materials and methods). When we loosened the criteria to 6 &#197; (RMSD cutoff), 0.1 (CP score threshold) and 10% (least permutation size), and did not filter out proteins containing repeats, the CP prevalence estimated by CPSARST was 34.7-36.7% (see Additional data file 5 for statistics), similar to Jung and Lee's estimation. However, since they did not provide any supplementary list of their CP candidates, we are unable to check our speculation.</p>
            <p>To our knowledge, all the currently available CP-detecting methods are more sensitive to global CP (the unit undergoing CP is the whole protein) than partial CP (the CP is within a region of the protein), as is CPSARST. To detect partial CP, domain databases such as SCOP and Pfam are usually used as the target databases instead of the PDB and Swiss-Prot. Although considering the domain as the unit undergoing CP, that is, partial CP, can identify more candidates (as shown in Table <tblr tid="T4">4</tblr>), some scientists have argued that these cases should be considered as 'swaps' rather than CPs <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. This controversy is another cause of the conflicting conclusions about the prevalence of CP.</p>
            <p>To sum up, despite the conflicting conclusions made by previous studies, there seem to be rational explanations for this situation. We suppose that the identification of CPs requires a precise definition of CP depending on the purpose of the study. In our opinion, if evolutionary importance and mechanisms are concerned, global CP with reasonable sequence identity limitation will be suitable, while partial CP without limitation of sequence identity in the definition may help scientists to discover novel functional relationships among proteins and to reveal the principles of protein folding.</p>
         </sec>
         <sec>
            <st>
               <p>Possible applications of CPSARST</p>
            </st>
            <p>The performance of CPSARST suggests that it is an efficient approach to the detection of CPs in large protein structural datasets; routine bank-against-bank searches are thus achievable. The multiple indexes produced by CPSARST, for example, the structural similarity score, statistically meaningful E-value, sequence identity, alignment size, RMSD and CP score, are beneficial to develop automated procedures such as a functional assignment system for novel hypothetical proteins. Also, information retrieved by bank-against-bank searches can be organized into a CP database.</p>
            <p>Since the first observation of CP in plant lectins <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, many natural and artificial cases have been studied and several CP detecting methods have been developed; however, there is still no CP database and no standard procedure for evaluating CP detection methods. We suppose that a well-organized CP database will help move this field forward. It could provide a standard for the evaluations of CP-related programs, such as CP search tools and predictors of viable CP sites <abbrgrp><abbr bid="B59">59</abbr></abbrgrp>, and provide information to reveal the evolutionary mechanisms of CP.</p>
            <p>CP has been applied to X-ray crystallography <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, modification of enzymes <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, creation of novel fusion proteins <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B28">28</abbr></abbrgrp>, and construction of protein switches and sensors <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>. All these applications depend on a proper choice of position to create CP. A CP database offering plenty of materials for the discovery of the rules by which Nature selects CP sites should be advantageous to the technical applications of CP.</p>
            <p>Although interesting, there is still much uncertainty about the evolutionary mechanisms and importance of CP <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B18">18</abbr><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. Weiner <it>et al</it>. <abbrgrp><abbr bid="B60">60</abbr></abbrgrp> have proposed that the frequency of incomplete or intermediate CP may help determine the major mechanism of CP. The 'double filter-and-refine' strategy of CPSARST is very flexible. With extended boundary criteria, CPSARST can specifically detect incomplete or intermediate CP. The ability of CPSARST to perform rapid bank-against-bank searches by structural comparisons gives it the potential to reveal how, why and to what extent Nature achieves protein evolutionary and functional diversity by using CP.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have developed an efficient circular permutation search method, CPSARST, which linearly encodes protein structures as text strings and achieves a structural similarity searching speed thousands of times as high as related algorithms. When tested with engineered CPs, CPSARST successfully retrieved all the natural proteins with accurate permutation site predictions. Its ability to identify natural CPs is also comparable to other structure-based CP-detecting methods. Its high efficiency makes routine database surveys and bank-against-bank searches achievable. After all-against-all searches of non-redundant PDB and SCOP, we have found that most candidate CP pairs share sequence identity &lt; 20%, explaining why previous sequence-based CP-detecting methods have identified much less CP cases than structure-based algorithms. Based on these search results, we have suggested that the identification of CPs requires a suitable definition of CP depending on the purpose of the study. If global CP with reasonable sequence identity limitation is considered as true CP, the prevalence of CP in protein structural databases is estimated to be 16% by CPSARST, whereas the prevalence of partial CP without limitation of sequence identity in the definition is estimated to be 36%. Several new CP cases have been detected and reported here, inclusive of a novel CP family consisting of microbial GBBPs, molybdate-binding proteins and a cysteine regulon transcriptional activator. In this post-genomics era, when the amount of protein structural data is increasing exponentially, CPASRST can provide a new way to rapidly detect novel relationships among proteins and help to reveal how Nature achieves protein evolutionary and functional diversity by using CP. Its web service and stand-alone Java program are available at <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <p>All the developments and experiments were performed on an IBM e-server 336 machine with dual 3.2GHz Intel processors, 1 GB RAM and linux operating system.</p>
         <sec>
            <st>
               <p>Linear encoding of protein structures</p>
            </st>
            <p>CPSARST describes three-dimensional protein structures as one-dimensional strings by using a RST algorithm <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. The torsion angles (<it>&#966;</it>,<it>&#948;</it>) of a number of proteins were plotted onto a 10&#176; &#215; 10&#176; dissected RM map. The 1,296 cells on this map were then clustered into 22 groups by nearest-neighbor clustering <abbrgrp><abbr bid="B61">61</abbr></abbrgrp> based on their spot numbers and angular distances. These groups were assigned a set of English letters called 'Ramachandran codes'. Coordinates of a protein structure could be accordingly transformed into a text string. The scoring matrix for these codes was produced by using a 'regenerative approach' <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. This linear encoding system converts complicated and time-consuming structural comparison problems into sequence comparisons, which can be done very rapidly. It has been applied to protein structural similarity searching and achieved speeds hundreds of thousands of times higher than CE with an acceptable compromise of accuracy <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. The structural string generated by RST is different from the amino acid sequence in nature; therefore, we termed it 'Ramachandran sequence' or 'Ramachandran string'.</p>
         </sec>
         <sec>
            <st>
               <p>Generation and analyses of random circular permutants</p>
            </st>
            <p>A hundred polypeptide sequences each longer than 100 residues and sharing &lt; 40% sequence identities were randomly selected from the PDB to perform <it>in silico </it>circular permutations. Regular mutations, i.e. substitutions, insertions and deletions, were introduced in the ratio 150:1:1 to generate random CPs, resulting in 100 levels of decreasing sequence identities/similarities for every polypeptide sequence. The collection of these computer-generated random CPs is called the RCP dataset.</p>
            <p>The substitution rates of various amino acids used to generate the RCP dataset were calculated by analyzing a large number of multiple alignment blocks, the sequences of which shared &lt; 45% identity, as described previously <abbrgrp><abbr bid="B62">62</abbr></abbrgrp>. Since every sequence in the RCP dataset was evolved independently to avoid any possible bias, we supposed that it is suitable for the evaluation of CP detection methods. RCP has two subsets, the identity subset and similarity subset, each containing 10,000 CP pairs (100 parent sequences &#215; 100 circular permutants). They are listed in Additional data file 6.</p>
            <p>Comparisons between each parent sequence and its CPs in the identity subset of the RCP dataset were performed by the traditional heuristic method blast <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>. Two parameters were monitored to assess the performance: the percentage of cases in which the exact permutation site was retrieved; and the average percentage distance of the found permutation site to the exact one (see Results). Another two parameters were monitored to optimize the filter for RM sequence searches: the ratio of similarity scores and the negative logarithm in base 10 (-log<sub>10</sub>) of the E-value ratios, before and after the duplication of query sequences (see Additional data file 7 for the results). We found that all the score ratios are equal to or higher than 1, indicating that when the sequence of a CP is duplicated (DL), it always aligns to its parent sequence better than the normal length (NL). As to the E-value ratios, that is, -log<sub>10</sub>(<it>E-value</it><sub><it>DL</it></sub>/<it>E-value</it><sub><it>NL</it></sub>), approximately 80% of them are larger than 2, which stands for a 10<sup>2</sup>-fold improvement of the significance of the similarity score after duplicating the query sequence (see Results for detailed information about E-values).</p>
         </sec>
         <sec>
            <st>
               <p>Screening of circular permutant candidates</p>
            </st>
            <p>It has been supposed that using heuristic methods like blast to search for CPs is difficult because an unambiguous reconstruction of the alignment results is problematic <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. CPSARST, however, overcomes this problem by duplicating the query structure, doing two rounds (with and without the duplication) of database searches, and analyzing the results mutually. The hits with improved alignment qualities are picked as CP candidates, the permutation sites of which can be easily determined from the alignment results of duplicated sequences. In the screening stage, the search results of RM strings were filtered with simple criteria referring to previous studies and our experimental results on RCP amino acid sequences mentioned above. The permutation site should be at between 20% and 80% along the length of the query protein, ensuring a significant permutation size (20%). It has been supposed that a tiny permutation size is unlikely a real CP <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. The size of the candidate could be different from that of the query protein by at most 50% because proteins of very different sizes are improbable candidates for CPs <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. The similarity score of the duplicated query string (<it>Score</it><sub><it>DL</it></sub>) should be higher than that of the normal query string (<it>Score</it><sub><it>NL</it></sub>), and the -log<sub>10 </sub>value of the E-value ratio should be larger than -0.5 (see Additional data file 7 for detailed information about these settings):</p>
            <p>
               <display-formula id="M3">
                  <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="gb-2008-9-1-r11-i3">
                     <m:semantics>
                        <m:mrow>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>S</m:mi>
                                 <m:mi>c</m:mi>
                                 <m:mi>o</m:mi>
                                 <m:mi>r</m:mi>
                                 <m:msub>
                                    <m:mi>e</m:mi>
                                    <m:mrow>
                                       <m:mi>D</m:mi>
                                       <m:mi>L</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>S</m:mi>
                                 <m:mi>c</m:mi>
                                 <m:mi>o</m:mi>
                                 <m:mi>r</m:mi>
                                 <m:msub>
                                    <m:mi>e</m:mi>
                                    <m:mrow>
                                       <m:mi>N</m:mi>
                                       <m:mi>L</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo>></m:mo>
                           <m:mn>1</m:mn>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaajuaGdaWcaaqaaiaadofacaWGJbGaam4BaiaadkhacaWGLbWaaSbaaeaacaWGebGaamitaaqabaaabaGaam4uaiaadogacaWGVbGaamOCaiaadwgadaWgaaqaaiaad6eacaWGmbaabeaaaaGccqGH+aGpcaaIXaaaaa@3F58@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>
               <display-formula id="M4">
                  <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="gb-2008-9-1-r11-i4">
                     <m:semantics>
                        <m:mrow>
                           <m:mo>&#8722;</m:mo>
                           <m:msub>
                              <m:mrow>
                                 <m:mi>log</m:mi>
                                 <m:mo>&#8289;</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mn>10</m:mn>
                              </m:mrow>
                           </m:msub>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>E</m:mi>
                                 <m:mi>v</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:mi>l</m:mi>
                                 <m:mi>u</m:mi>
                                 <m:msub>
                                    <m:mi>e</m:mi>
                                    <m:mrow>
                                       <m:mi>D</m:mi>
                                       <m:mi>L</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>E</m:mi>
                                 <m:mi>v</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:mi>l</m:mi>
                                 <m:mi>u</m:mi>
                                 <m:msub>
                                    <m:mi>e</m:mi>
                                    <m:mrow>
                                       <m:mi>N</m:mi>
                                       <m:mi>L</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>></m:mo>
                           <m:mo>&#8722;</m:mo>
                           <m:mn>0.5</m:mn>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHsislciGGSbGaai4BaiaacEgadaWgaaWcbaGaaGymaiaaicdaaeqaaOGaaiikaKqbaoaalaaabaGaamyraiaadAhacaWGHbGaamiBaiaadwhacaWGLbWaaSbaaeaacaWGebGaamitaaqabaaabaGaamyraiaadAhacaWGHbGaamiBaiaadwhacaWGLbWaaSbaaeaacaWGobGaamitaaqabaaaaOGaaiykaiabg6da+iabgkHiTiaaicdacaGGUaGaaGynaaaa@4A4C@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
         </sec>
         <sec>
            <st>
               <p>Refinement of the search results</p>
            </st>
            <p>The refinement of search results of RM sequences were performed by FAST, an accurate structural alignment algorithm <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B63">63</abbr></abbrgrp> and a CP scoring scheme developed by Vesterstrom and Taylor <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, following these steps. Step 1: for each candidate, the putative permutation site is parsed from the alignment result of the duplicated query string. Step 2: performing two rounds of FAST structural alignments. The first round is a normal linear alignment. In the second round, the circularly permuted alignment, the PDB file of the query structure was manipulated by exchanging the amino- and carboxy-terminal halves according to the putative permutation site so that FAST will do the structural alignment 'backside first'. Step 3: if the FAST alignment size after CP is no larger than 50% of the smaller size of the query and subject proteins, it is screened out. Step 4: the RMSD cutoff of the CP alignment is set as 5 &#197;. Step 5: in order to differentiate true CP from protein with internal repeats or duplications, two criteria have been set: the alignment size of the CP alignment should be larger than that of the linear alignment; and the FAST similarity score <abbrgrp><abbr bid="B63">63</abbr></abbrgrp> or TOP score <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> (see formula (2)) calculated from the CP alignment should gain at least 25% improvement over the linear alignment. Step 6: the CP score <abbrgrp><abbr bid="B39">39</abbr></abbrgrp> was calculated from the aligned positions by FAST. It has a theoretical minimum value of -1 (a completely linear alignment) and a maximum value of 1 (a perfect CP). Although Vesterstrom and Taylor suggested that an alignment with this CP score higher than 0.25 can be considered as a significant CP, we find that 0.2 is still suitable in our multi-filter system. Step 7: the putative CP site is refined by parsing the output of FAST structural alignment.</p>
         </sec>
         <sec>
            <st>
               <p>Pair-wise circularly-permuted structural alignments</p>
            </st>
            <p>The procedure of the database search tool CPSARST can be simplified to perform pair-wise structure alignments as follows. First, transform the query and subject protein structures into RM sequences Q and S, respectively. Second, duplicate Q string to QQ, and align it to S. Third, find the best local alignment and trace it back to the 'start point', which is the putative permutation site. For example, if in the best local alignment, the fragment between residues <it>q</it><sub>1 </sub>and <it>q</it><sub>2 </sub>of Q is aligned to the fragment between <it>s</it><sub>1 </sub>and <it>s</it><sub>2 </sub>of S, then the permutation site of Q will be traced back to <it>q</it><sub>1 </sub>- <it>s</it><sub>1 </sub>+ 1. Fourth, introduce a CP into the query structure according to the putative CP site. Compare this new structure with the subject protein by using an accurate structural alignment algorithm mentioned above.</p>
         </sec>
         <sec>
            <st>
               <p>Implementation</p>
            </st>
            <p>CPSARST basically works on the structurally meaningful RM strings transformed by RST; however, since there have been many errors and inconsistencies reported in PDB entries <abbrgrp><abbr bid="B64">64</abbr></abbrgrp>, a few polypeptides (approximately 2%) cannot be successfully transformed into RM strings. Therefore, in the implementation of CPSARST, we have added two extra rounds of amino acid sequence alignment searches, one by the normal length and the other by the duplicated sequence, prior to the RM string searches. Besides, the sequence homology filter can be enabled to guarantee a higher evolutionary significance of the search results (see Discussion), and several parameters are adjustable by the users according their needs or the property of materials.</p>
            <sec>
               <st>
                  <p>Word size and gap penalties</p>
               </st>
               <p>These are traditional parameters used by sequence alignment search tools such as BLAST <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> and FASTA <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>. For CPSARST, a smaller word size can provide a more accurate determination of the CP site while taking more running time. In our experience, lower gap penalties can give CPSARST higher sensitivity, although there is a trade-off for running time, too. Generally speaking, these parameters have only minor effects on the performance.</p>
            </sec>
            <sec>
               <st>
                  <p>Permutation size limit and circular permutation score threshold</p>
               </st>
               <p>It has been supposed that a tiny permutation size is unlikely a real CP <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, but there is yet no common conclusion made for the generally suitable permutation size limit. Setting a large limit ensures that CPSARST identifies unambiguous CP relationships; however, novel cases can thus be missed. If the query protein is large enough, for example, > 150 residues, a small size limit such as 10% may still work well, but we would like to suggest a 15% limit for general situations. The CP score threshold has similar effects on the search quality of CPSARST to the permutation size limit (see Results and Materials and methods for further information).</p>
            </sec>
            <sec>
               <st>
                  <p>RMSD cutoff and structural similarity improvement filter</p>
               </st>
               <p>Closer-related protein structures will have a lower RMSD when superimposed. This is also true for CPs. This cutoff can be used as a basic quality control in the same way as other conventional structural comparison tools. The normalized structural similarity score of FAST <abbrgrp><abbr bid="B63">63</abbr></abbrgrp> is another basic quality control. Candidate pairs without enough improvement in structural similarity after CP can be screened out.</p>
               <p>Examples of practical settings for these parameters can be found in Additional data file 8. CPSARST is available at <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>CP, circular permutation; CPs, circular permutants; CPSARST, Circular Permutation Search Aided by Ramachandran Sequential Transformation; DL, duplicated; GBBP, glycine betaine-binding protein; NL, normal length; PDB, Protein Data Bank; RM, Ramachandran; RCP, random circular permutation; RMSD, root mean square distance; RST, Ramachandran sequential transformation.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>WCL designed and carried out this study and drafted the manuscript. PCL conceived the study, participated in its design and helped to draft the manuscript.</p>
      </sec>
      <sec>
         <st>
            <p>Additional data files</p>
         </st>
         <p>The following additional data are available with the online version of this paper. Additional data file <supplr sid="S1">1</supplr> lists the nrPDB-90 dataset, the 90% sequence identity subset of the PDB (January 2007). Additional data file <supplr sid="S2">2</supplr> lists the nrSCOP-90 dataset, the 90% sequence identity subset of SCOP (v.1.71). Additional data file <supplr sid="S3">3</supplr> is a table listing candidate CP pairs in the nrPDB-90 dataset detected by CPSARST with RMSD &#8804; 3.5 &#197;. Additional data file <supplr sid="S4">4</supplr> is a list of the structural neighbors of the hypothetical protein YlqF in PDB retrieved by DALI <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>. Additional data file <supplr sid="S5">5</supplr> is a table showing statistical results of protein structural database searches with broad criteria by CPSARST. Additional data file <supplr sid="S6">6</supplr> lists the RCP dataset, a collection of 20,000 <it>in silico </it>random CPs. Additional data file <supplr sid="S7">7</supplr> is a plot summarizing the score and E-value ratios calculated from the RCP dataset. Additional data file <supplr sid="S8">8</supplr> is a list of the parameter settings used throughout this article.</p>
         <suppl id="S1">
            <title>
               <p>Additional data file 1</p>
            </title>
            <caption>
               <p>The nrPDB-90 dataset</p>
            </caption>
            <text>
               <p>The 90% sequence identity subset of the PDB (January 2007).</p>
            </text>
            <file name="gb-2008-9-1-r11-S1.txt">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S2">
            <title>
               <p>Additional data file 2</p>
            </title>
            <caption>
               <p>The nrSCOP-90 dataset</p>
            </caption>
            <text>
               <p>The 90% sequence identity subset of SCOP (v.1.71).</p>
            </text>
            <file name="gb-2008-9-1-r11-S2.txt">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S3">
            <title>
               <p>Additional data file 3</p>
            </title>
            <caption>
               <p>Candidate CP pairs in the nrPDB-90 dataset detected by CPSARST with RMSD &#8804; 3.5 &#197;</p>
            </caption>
            <text>
               <p>Protein structures shown in this large table were drawn by using Chime <abbrgrp><abbr bid="B70">70</abbr></abbrgrp>.</p>
            </text>
            <file name="gb-2008-9-1-r11-S3.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S4">
            <title>
               <p>Additional data file 4</p>
            </title>
            <caption>
               <p>Structural neighbors of the hypothetical protein YlqF in PDB retrieved by DALI <abbrgrp><abbr bid="B50">50</abbr></abbrgrp></p>
            </caption>
            <text>
               <p>Structural neighbors of the hypothetical protein YlqF in PDB retrieved by DALI <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>.</p>
            </text>
            <file name="gb-2008-9-1-r11-S4.txt">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S5">
            <title>
               <p>Additional data file 5</p>
            </title>
            <caption>
               <p>Statistical results of protein structural database searches with broad criteria</p>
            </caption>
            <text>
               <p>Statistical results of protein structural database searches with broad criteria.</p>
            </text>
            <file name="gb-2008-9-1-r11-S5.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S6">
            <title>
               <p>Additional data file 6</p>
            </title>
            <caption>
               <p>The RCP dataset</p>
            </caption>
            <text>
               <p>A collection of 20,000 <it>in silico </it>random CPs.</p>
            </text>
            <file name="gb-2008-9-1-r11-S6.txt">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S7">
            <title>
               <p>Additional data file 7</p>
            </title>
            <caption>
               <p>Score and E-value ratios calculated from the RCP dataset</p>
            </caption>
            <text>
               <p>Score and E-value ratios calculated from the RCP dataset.</p>
            </text>
            <file name="gb-2008-9-1-r11-S7.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S8">
            <title>
               <p>Additional data file 8</p>
            </title>
            <caption>
               <p>Parameter settings used throughout this article</p>
            </caption>
            <text>
               <p>Parameter settings used throughout this article.</p>
            </text>
            <file name="gb-2008-9-1-r11-S8.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work was supported by the National Science Council, Taiwan, ROC (NSC grant numbers: 95-3112-B-007-006 and 96-3112-B-007-006). We thank the authors of the BLAST and FAST algorithms, which were extensively used in this study.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Circular permutations in the molecular evolution of DNA methyltransferases.</p>
            </title>
            <aug>
               <au>
                  <snm>Jeltsch</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>1999</pubdate>
            <volume>49</volume>
            <fpage>161</fpage>
            <lpage>164</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/PL00006529</pubid>
                  <pubid idtype="pmpid" link="fulltext">10368444</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Rapid motif-based prediction of circular permutations in multi-domain proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Weiner</snm>
                  <fnm>J</fnm>
                  <suf>3rd</suf>
               </au>
               <au>
                  <snm>Thomas</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Bornberg-Bauer</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>932</fpage>
            <lpage>937</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti085</pubid>
                  <pubid idtype="pmpid" link="fulltext">15788783</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Crystal structure of a natural circularly permuted jellyroll protein: 1,3-1,4-beta-D-glucanase from <it>Fibrobacter succinogenes </it>.</p>
            