<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-323</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Identification of putative domain linkers by a neural network &#8211; application to a large sequence database</p>
         </title>
         <aug>
            <au id="A1" da="yes">
               <snm>Miyazaki</snm>
               <fnm>Satoshi</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
            </au>
            <au id="A2" ca="yes">
               <snm>Kuroda</snm>
               <fnm>Yutaka</fnm>
               <insr iid="I3"/>
               <email>ykuroda@cc.tuat.ac.jp</email>
            </au>
            <au id="A3">
               <snm>Yokoyama</snm>
               <fnm>Shigeyuki</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>yokoyama@biochem.u-tokyo.ac.jp</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan</p>
            </ins>
            <ins id="I2">
               <p>RIKEN Genomic Sciences Center, 1-7-22, Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan</p>
            </ins>
            <ins id="I3">
               <p>Department of Biotechnology and Life Science, Graduate School of Technology, Tokyo University of Agriculture and Technology, 2-24-16, Nakamachi, Koganei, 184-8588, Tokyo, Japan</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>323</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/323</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16800897</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-323</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>24</day>
               <month>2</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>27</day>
               <month>6</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>27</day>
               <month>6</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Miyazaki et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p/>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The reliable dissection of large proteins into structural domains represents an important issue for structural genomics/proteomics projects. To provide a practical approach to this issue, we tested the ability of neural network to identify domain linkers from the SWISSPROT database (101602 sequences).</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Our search detected 3009 putative domain linkers adjacent to or overlapping with domains, as defined by sequence similarity to either Protein Data Bank (PDB) or Conserved Domain Database (CDD) sequences. Among these putative linkers, 75% were "correctly" located within 20 residues of a domain terminus, and the remaining 25% were found in the middle of a domain, and probably represented failed predictions. Moreover, our neural network predicted 5124 putative domain linkers in structurally un-annotated regions without sequence similarity to PDB or CDD sequences, which suggest to the possible existence of novel structural domains. As a comparison, we performed the same analysis by identifying low-complexity regions (LCR), which are known to encode unstructured polypeptide segments, and observed that the fraction of LCRs that correlate with domain termini is similar to that of domain linkers. However, domain linkers and LCRs appeared to identify different types of domain boundary regions, as only 32% of the putative domain linkers overlapped with LCRs.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Overall, our study indicates that the two methods detect independent and complementary regions, and that the combination of these methods can substantially improve the sensitivity of the domain boundary prediction. This finding should enable the identification of novel structural domains, yielding new targets for large scale protein analyses.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Structural genomics/proteomics projects seek to establish high-throughput techniques by promoting routine protein structure determination either by X-ray crystallography or NMR spectroscopy <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. However, the determination of large protein structures remains as a major hurdle, especially for NMR, which requires elaborate techniques and time consuming analyses <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Even when X-ray crystallography is employed, the average size of proteins determined by this method and listed in the PDB (Protein Data Bank) is about 230 residues. This situation not only reflects the difficulty of determining large protein structures, but also that of expressing and purifying them. Meanwhile, most large proteins are assembled from structural domains, which are structurally independent units that are able to fold into a native structure even when isolated from the rest of the protein. Thus, dissecting large proteins into their structural domains can provide several candidates for swift structural analysis by either X-ray crystallography or NMR spectroscopy.</p>
         <p>Protein dissection is often a long and tedious process. Limited proteolysis is the prevalent experimental method for determining structural domain boundaries <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>, but it does not alleviate the problems related to the expression and purification of large proteins. Screening methods for detecting natively folded proteins without relying on a specific functional activity have recently been developed <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>, and they may serve as tools to isolate natively folded domains from a library of randomly generated protein fragments, thus alleviating the need to first purify the full length protein. However, experimental methods are usually time-consuming, and less expensive computer-aided methods for detecting putative domains in protein sequences have practical values for all types of high-throughput proteomics projects <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>.</p>
         <p>Various theoretical methods for identifying domains in protein sequences have recently been reported. These include well-established sequence similarity searches against existing domain databases, such as Pfam or SMART <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. A major limitation of these methods is their inherent inability to identify completely novel domains. On the other hand, methods that do not rely on a pre-existing domain database can be valuable tools in high-throughput structural genomics projects as they can identify novel, natively folded domains suitable for structural analysis<abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>. Thus, the prediction of domain organization based on sequence information alone is presently an actively investigated topic <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>.</p>
         <p>Recently, domain prediction methods based on sequence information alone, such as the statistics of residue contact in domains <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, the statistics of domain size distribution <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, the sequence characteristics of domain linkers <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>, the amino acid composition of domain linkers <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>, covariance analysis <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>and the conservation of hydrophobic clusters <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> have been developed. Some of the aforementioned methods to detect domain boundary sequence characteristics use neural networks <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>. Neural networks <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> have been successfully applied to the prediction of several aspects of protein structure, such as secondary structures <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr></abbrgrp>, &#946; turns<abbrgrp><abbr bid="B36">36</abbr></abbrgrp>, structural classes<abbrgrp><abbr bid="B37">37</abbr></abbrgrp>, and stabilization centers<abbrgrp><abbr bid="B38">38</abbr></abbrgrp>, but its use in domain boundary recognition is relatively new <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>.</p>
         <p>In this paper, we used our neural network <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> to search for putative domain linker regions in the SWISSPROT database <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. The aim of the present study was threefold. First, we asked if our neural network &#8211; which was trained with a small data set of 74 multi-domain proteins derived from SCOP <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> &#8211; could be applied to a practical problem, specifically, that of detecting protein domains for structural genomics/proteomics projects from a large sequence dataset. Second, we were interested in comparing our predictions, which rely only on sequence characteristics, with traditional methods that detect domains by sequence similarity to domain databases; here, we used the Protein Data Bank (PDB) <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> and the Conserved Domain Database (CDD) <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. Last, we examined the possibility of improving the detection of domain boundaries by combining the detection of the putative domain linkers with that of the low-complexity regions, which encode unstructured protein sequence segments. Overall, the present analysis confirmed our previous study, and indicated that our neural network can efficiently detect domain boundaries even when applied to a large and "real" sequence database.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Detection of putative domain linkers by the neural network</p>
            </st>
            <p>In many applications, including ours, it is critical to reduce the number of false positives because of their experimental costs, while false negatives are not as detrimental. In our neural network, a 'cutoff' parameter determines the balance between specificity and sensitivity (i.e., the balance of false positives and false negatives) <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Thus, we searched for putative domain linkers in 101602 SWISSPROT sequences using high cutoff values, ranging from 0.90 to 0.98, to minimize false predictions even at the cost of missing existing linkers. The number of putative domain linkers identified by our neural network ranged from 1469 to 20876 for cutoffs of 0.98 and 0.90, respectively. As expected, the use of a higher cutoff parameter increased the number of correct predictions, but decreased the total number of predicted domain linkers (Table <tblr tid="T1">1</tblr>). Overall, the same conclusions are reached independently from the cutoff value, when it is between 0.90 and 0.98. The following discussion is based on a search with a cutoff value of 0.95, which yielded 8133 putative domain linkers, representing 1.4% of the data set on a residue number basis (Table <tblr tid="T1">1</tblr>). These figures correspond to approximately one putative linker predicted for every 12 sequences, which is a tractable number for a high-throughput experiment.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <tblbdy cols="5">
                  <r>
                     <c ca="center">
                        <p>Sequence regions detected</p>
                     </c>
                     <c ca="center">
                        <p>No. of sequences<sup>a</sup></p>
                     </c>
                     <c ca="center">
                        <p>No. of sequence regions<sup>b</sup></p>
                     </c>
                     <c ca="center">
                        <p>No. of residues<sup>c</sup></p>
                     </c>
                     <c ca="center">
                        <p>% residues<sup>d</sup></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>All</p>
                     </c>
                     <c ca="center">
                        <p>101602</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>37315215</p>
                     </c>
                     <c ca="center">
                        <p>100.00</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PDB</p>
                     </c>
                     <c ca="center">
                        <p>38470</p>
                     </c>
                     <c ca="center">
                        <p>410090</p>
                     </c>
                     <c ca="center">
                        <p>10210325</p>
                     </c>
                     <c ca="center">
                        <p>27.36</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>CDD</p>
                     </c>
                     <c ca="center">
                        <p>64349</p>
                     </c>
                     <c ca="center">
                        <p>124888</p>
                     </c>
                     <c ca="center">
                        <p>16207467</p>
                     </c>
                     <c ca="center">
                        <p>43.43</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Low-complexity regions (45, 3.4, 3.75)<sup>e</sup></p>
                     </c>
                     <c ca="center">
                        <p>48641</p>
                     </c>
                     <c ca="center">
                        <p>70373</p>
                     </c>
                     <c ca="center">
                        <p>8474412</p>
                     </c>
                     <c ca="center">
                        <p>22.71</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Low-complexity regions (45, 2.9, 3.2)</p>
                     </c>
                     <c ca="center">
                        <p>6735</p>
                     </c>
                     <c ca="center">
                        <p>8539</p>
                     </c>
                     <c ca="center">
                        <p>803001</p>
                     </c>
                     <c ca="center">
                        <p>2.15</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Low-complexity regions (45, 2.6, 2.9)</p>
                     </c>
                     <c ca="center">
                        <p>3208</p>
                     </c>
                     <c ca="center">
                        <p>3970</p>
                     </c>
                     <c ca="center">
                        <p>359227</p>
                     </c>
                     <c ca="center">
                        <p>0.96</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Low-complexity regions (45, 2.45, 2.75)</p>
                     </c>
                     <c ca="center">
                        <p>2340</p>
                     </c>
                     <c ca="center">
                        <p>2786</p>
                     </c>
                     <c ca="center">
                        <p>250796</p>
                     </c>
                     <c ca="center">
                        <p>0.67</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.90)<sup>f</sup></p>
                     </c>
                     <c ca="center">
                        <p>14239</p>
                     </c>
                     <c ca="center">
                        <p>20876</p>
                     </c>
                     <c ca="center">
                        <p>1051607</p>
                     </c>
                     <c ca="center">
                        <p>2.82</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.91)</p>
                     </c>
                     <c ca="center">
                        <p>12670</p>
                     </c>
                     <c ca="center">
                        <p>18193</p>
                     </c>
                     <c ca="center">
                        <p>953097</p>
                     </c>
                     <c ca="center">
                        <p>2.55</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.92)</p>
                     </c>
                     <c ca="center">
                        <p>11160</p>
                     </c>
                     <c ca="center">
                        <p>15620</p>
                     </c>
                     <c ca="center">
                        <p>856149</p>
                     </c>
                     <c ca="center">
                        <p>2.29</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.93)</p>
                     </c>
                     <c ca="center">
                        <p>9554</p>
                     </c>
                     <c ca="center">
                        <p>13053</p>
                     </c>
                     <c ca="center">
                        <p>752119</p>
                     </c>
                     <c ca="center">
                        <p>2.02</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.94)</p>
                     </c>
                     <c ca="center">
                        <p>7977</p>
                     </c>
                     <c ca="center">
                        <p>10591</p>
                     </c>
                     <c ca="center">
                        <p>644472</p>
                     </c>
                     <c ca="center">
                        <p>1.73</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.95)</p>
                     </c>
                     <c ca="center">
                        <p>6387</p>
                     </c>
                     <c ca="center">
                        <p>8133</p>
                     </c>
                     <c ca="center">
                        <p>529884</p>
                     </c>
                     <c ca="center">
                        <p>1.42</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.96)</p>
                     </c>
                     <c ca="center">
                        <p>4819</p>
                     </c>
                     <c ca="center">
                        <p>5892</p>
                     </c>
                     <c ca="center">
                        <p>415150</p>
                     </c>
                     <c ca="center">
                        <p>1.11</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.97)</p>
                     </c>
                     <c ca="center">
                        <p>3099</p>
                     </c>
                     <c ca="center">
                        <p>3592</p>
                     </c>
                     <c ca="center">
                        <p>281009</p>
                     </c>
                     <c ca="center">
                        <p>0.75</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Putative domain linkers (0.98)</p>
                     </c>
                     <c ca="center">
                        <p>1326</p>
                     </c>
                     <c ca="center">
                        <p>1469</p>
                     </c>
                     <c ca="center">
                        <p>128455</p>
                     </c>
                     <c ca="center">
                        <p>0.34</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Low-complexity regions (45, 2.9, 3.2) + Putative domain linkers (0.95)<sup>g</sup></p>
                     </c>
                     <c ca="center">
                        <p>10364</p>
                     </c>
                     <c ca="center">
                        <p>13946</p>
                     </c>
                     <c ca="center">
                        <p>1139983</p>
                     </c>
                     <c ca="center">
                        <p>3.06</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Statistics of SWISSPROT sequences. <sup>a </sup>Number of SWISSPROT sequences that contained the detected sequence regions. <sup>b </sup>Number of sequence regions detected in the SWISSPROT sequences. <sup>c </sup>Total number of residues in the detected sequence regions. <sup>d </sup>Percentage of residues in the detected regions relative to all of the residues in the SWISSPROT sequences. <sup>e </sup>The values of the three parameters used for the SEG program, namely, the trigger window, the trigger and extension complexities are listed in the parentheses. <sup>f </sup>The cutoff parameter used for our neural network is indicated in the parentheses. <sup>g </sup>Predictions obtained by merging putative domain linkers and the low-complexity regions.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Assignment of 'putative structural domains'</p>
            </st>
            <p>For the purposes of this discussion, we define 'putative structural domains' as sequence segments with high similarity to PDB or CDD sequences (sequence identity >30% and sequence overlap > 85%; See details in the Material and methods section). Putative structural domains are thus able to fold into a native structure or at least to form a domain, and we used them to assess the correctness of the predicted domain boundaries. As anticipated, a substantial fraction of the SWISSPROT sequences is covered by known putative structural domains. Specifically, from a total of 101602 SWISSPROT sequences, 38470 sequences (corresponding to, respectively, 38% and 27% on a sequence and residue basis) had similarity to a PDB sequence, and 64349 sequences (43% on a residue basis) had similarity to a CDD sequence (Table <tblr tid="T1">1</tblr>).</p>
         </sec>
         <sec>
            <st>
               <p>Correlation between predicted linkers and putative structural domain termini</p>
            </st>
            <p>Our method for evaluating the correctness of the predicted domain linkers was to assess their positions relative to those of putative structural domains. To this end, we classified the putative domain linkers into four classes (Figure <figr fid="F1">1A</figr>; see Materials and methods). Linkers that matched either one or both ends of a putative structural domain were classified into classes 1 and 2, respectively, and were considered as correctly predicted. Putative domain linkers overlapping with putative structural domains are likely to break them in two non-foldable sequences. They were thus counted as incorrect predictions, and classified in class 4. Finally, putative linkers that were located far away from any putative structural domains (farther than the error window discussed below) were categorized in class 3. These linkers could not be evaluated as either correct or incorrect.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Classification of the predicted linkers and the low complexity regions</p>
               </caption>
               <text>
                  <p>Classification of the predicted linkers and the low complexity regions. (A) Schematic representation of the positions of the predicted domain boundaries relative to the putative structural domains. The our classes are: correct matches at both ends (class 1), correct matches at either end (class 2), overlaps (class 4), and unmatched locations(class 3). Percentages of putative domain linkers (B) and low-complexity regions (C) in the four classes. An error window parameter, on the horizontal axis, is used to accommodate the terminal ambiguity of the assigned sequence regions. When the distance between the ends of a putative domain linker (B) or a low-complexity region (C), and the end of a putative structural domain was smaller than the error window, we considered the position of the predicted domain boundary to be correct. The error window parameter was varied from 5 to 50 residues.</p>
               </text>
               <graphic file="1471-2105-7-323-1"/>
            </fig>
            <p>The putative structural domains as defined above may contain multiple structural domains, and, hence, some linkers in class 4 may be correctly located. Our calculations thus slightly underestimate the actual performances of both the neural network and the LCRs predictions (see also next section). However, the underestimations are likely to be very small, and concern only a few percents of the putative linkers, as most proteins in the PDB (and many in the CDD) are single structural domain proteins <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp>.</p>
            <p>The above classification was performed by allowing an error window between the position of the predicted linker and the termini of the putative structural domain. As expected, when the error window was increased, the occurrence of correct matches increased while that of the overlaps decreased. With an error window of 20 residues, the percentages of correct matches (classes 1 and 2), overlaps (class 4) and unknown locations (class 3) were 27.5%, 9.2% and 63.4%, respectively (Figure <figr fid="F1">1B</figr>). Thus, 75% of the putative domain linkers with predictions that could be evaluated (classes 1, 2 and 3) were correctly located, suggesting that the boundaries of the putative structural domains can be predicted with reasonable confidence. On the other hand, almost two-thirds of the putative domain linkers were predicted in regions without a corresponding putative structural domain nearby, possibly delimiting novel structural domains not yet classified in the PDB or CDD (Figure <figr fid="F2">2</figr>).</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Putative domain linkers and low-complexity regions assigned in SWISSPROT sequences</p>
               </caption>
               <text>
                  <p>Putative domain linkers and low-complexity regions assigned in SWISSPROT sequences. Each thick black horizontal bar represents a SWISSPROT sequence used as a test sequence. The SWISSPROT ID number is indicated on the top left of the corresponding sequence. In each SWISSPROT sequence, sequence regions similar to PDB and CDD sequences were assigned as putative structural domains. A green horizontal bar represents a sequence region similar to a PDB sequence. Similarly, the horizontal bars colored in blue, red and magenta represent sequence regions similar to CDD sequences, corresponding to the Pfam, SMART and LOAD (Library Of Ancient Domains) libraries, respectively. Sequence regions predicted to be putative domain linkers are designated by vertical bars in colors ranging from yellow to brown, according to the neural network output values. Low-complexityregions are designated by cyan rectangles overlaid on black bars.</p>
               </text>
               <graphic file="1471-2105-7-323-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Detection of low-complexity regions</p>
            </st>
            <p>Most large-scale sequence databases contain a substantial number of long, unstructured, disordered regions that may interfere with systematic searches for structural domains. Thus, the detection of unstructured portions of proteins as defined by low complexity regions (LCRs), which are unlikely to fold into a globular structure <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, or structurally disordered regions <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> may help predict domain boundaries, although this was not the original intent. Here, we examined whether LCRs as detected by SEG <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, overlapped with domain boundaries. Two parameters in the SEG program, called trigger and extension complexity, control the balance between the detection number (Table <tblr tid="T1">1</tblr>) and the ratio of correct matches relative to incorrect ones (data not shown). In order to analyze approximately the same number of sequences as that of the putative linkers detected with the cutoff of 0.95, we set the trigger complexity to 2.9 and the extension complexity to 3.2, which yielded 8539 low-complexity regions (Table <tblr tid="T1">1</tblr>). Using an error window of 20 residues, the percentages of correct matches (classes 1 and 2), overlaps (class 4) and unknown locations (class 3) were 26.3%, 10.3% and 63.4%, respectively (Figure <figr fid="F1">1C</figr>). Thus, the position of the LCRs correlate with the temini of the putative structural domains at a level similar to that observed for the domain linkers (Figure <figr fid="F1">1B</figr>).</p>
         </sec>
         <sec>
            <st>
               <p>Comparison of domain boundaries detected by domain linkers and LCRs</p>
            </st>
            <p>Although both the domain linker and LCR predictions correlate well with the putative structural domain termini, it is important to note that the LCRs and linkers are located in different sequence regions. Indeed, only 2561 out of 8539 LCRs overlapped with the putative domain linkers predicted by our neural network, and, in turn, 2643 out of 8133 putative linkers were detected by the SEG program (Table <tblr tid="T2">2</tblr>). Furthermore, the sequence entropy of the putative linkers was higher than that of the LCRs, with the maximum of the sequence entropy distribution at around 3.5 for the linkers, while it was only 3.0 for the LCRs (sequence complexity values lower than 2.9 are unlikely to fold into a globular structure). Thus, our neural network appears to detect preferentially non-globular regions with higher sequence complexity than those detected by SEG. These results indicate that LCRs and linker sequences have different characteristics, and that the two methods are complementary for identifying domain boundaries (Figure <figr fid="F3">3</figr>).</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Putative domain linkers<sup>a</sup></p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Low-complexity regions<sup>b</sup></p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Uniquely linker regions</p>
                     </c>
                     <c ca="center">
                        <p>Overlapped with low-complexity regions</p>
                     </c>
                     <c ca="center">
                        <p>Overlapped with putative domain linkers</p>
                     </c>
                     <c ca="center">
                        <p>Uniquely Low complexity</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Correct matches of both ends (class 1)</p>
                     </c>
                     <c ca="center">
                        <p>236 (4.3%)</p>
                     </c>
                     <c ca="center">
                        <p>94 (3.6%)</p>
                     </c>
                     <c ca="center">
                        <p>97 (3.8%)</p>
                     </c>
                     <c ca="center">
                        <p>101 (1.7%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Correct matches of either end (class 2)</p>
                     </c>
                     <c ca="center">
                        <p>1241 (22.6%)</p>
                     </c>
                     <c ca="center">
                        <p>665 (25.2%)</p>
                     </c>
                     <c ca="center">
                        <p>706 (27.6%)</p>
                     </c>
                     <c ca="center">
                        <p>1358 (22.7%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Unknown locations (class 3)</p>
                     </c>
                     <c ca="center">
                        <p>3469 (63.2%)</p>
                     </c>
                     <c ca="center">
                        <p>1684 (63.7%)</p>
                     </c>
                     <c ca="center">
                        <p>1544 (60.3%)</p>
                     </c>
                     <c ca="center">
                        <p>3851 (64.4%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Overlaps (class 4)</p>
                     </c>
                     <c ca="center">
                        <p>544 (9.9%)</p>
                     </c>
                     <c ca="center">
                        <p>200 (7.6%)</p>
                     </c>
                     <c ca="center">
                        <p>214 (8.4%)</p>
                     </c>
                     <c ca="center">
                        <p>668 (11.2%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Total</p>
                     </c>
                     <c ca="center">
                        <p>5490</p>
                     </c>
                     <c ca="center">
                        <p>2643</p>
                     </c>
                     <c ca="center">
                        <p>2561</p>
                     </c>
                     <c ca="center">
                        <p>5978</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Overlaps between putative domain linkers and low-complexity regions. <sup>a </sup>The putative domain linkers were assigned by the neural network with a cutoff of 0.95. <sup>b </sup>The low-complexity regions were assigned by the SEG program with a trigger window of 45 residues, a trigger complexity of 2.9, and an extension complexity of 3.2.</p>
               </tblfn>
            </tbl>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Complexity distribution</p>
               </caption>
               <text>
                  <p>Complexity distribution. The sequence entropy distributions are shown for the putative domain linkers (thick solid line) and the low-complexity regions (thick broken line) longer than 45 residues. The sequence entropy was calculated by a sliding window of 45 residues over the putative domain linkers [43, 51]. The thin solid line represents the sequence entropy of all of the putative domain linkers (including those shorter than 45 residues) calculated with a window equal to the length of the linker.</p>
               </text>
               <graphic file="1471-2105-7-323-3"/>
            </fig>
            <p>As a result of their complementarity, the sensitivity of the domain detection was clearly improved by combining the LCR and linker predictions (Table <tblr tid="T1">1</tblr>; Figure <figr fid="F3">3</figr>). A combined search yielded 13946 domain boundaries, i.e., only 2726 sequences less than the total of the LCR and linker sequences. Furthermore, the domain boundary sequences identified by a combined LCR-linker search were categorized into the 4 classes in percentages similar to those identified by the separate LCR and linker searches. Thus, the total number of correctly predicted domain termini increased 1.6 fold, while the fraction of incorrect predictions (false positives) remained unchanged.</p>
         </sec>
         <sec>
            <st>
               <p>Comparison with random guesses</p>
            </st>
            <p>As a further assessment of both our neural network and the SEG program to detect putative structural domain termini, we estimated the success rate of a blind prediction. The blind prediction was defined as the probability that a randomly assigned residue in the query sequence matches with a putative structural domain terminal residue within the allowed error (Materials and methods). We compared the random guesses with our neural network and SEG prediction using a quality index calculated as the ratio of correct predictions relative to the sum of correct and incorrect predictions <abbrgrp><abbr bid="B44">44</abbr><abbr bid="B45">45</abbr><abbr bid="B46">46</abbr></abbrgrp>, which is computed as the number of sequences in classes 1 and 2 divided by those in classes 1, 2 and 4. Figure <figr fid="F4">4</figr> clearly shows that the quality index of the blind prediction is far below those of the two other methods. This result strongly supports our initial assumption that the occurrences of both the putative domain linkers and the low-complexity regions near the putative structural domain terminal regions are not fortuitous.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Comparison with blind prediction</p>
               </caption>
               <text>
                  <p>Comparison with blind prediction. The success rate (prediction quality index) of blind prediction is plotted as a function of the error window parameter (cross marks). The prediction quality factors for domain linkers (diamonds), low-complexity regions (squares), and a combined prediction (triangles) are also shown.</p>
               </text>
               <graphic file="1471-2105-7-323-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Domain termini and error windows</p>
            </st>
            <p>From a practical viewpoint, it is important to evaluate the error window within which the boundaries are predicted. The exact position of a domain boundary is obviously ambiguous. The first reason is that PDB sequences may include several unstructured terminal residues (without coordinates), causing some uncertainties about the exact positions of the putative structural domain termini. The uncertainty arising from the CDD sequence is even larger. Second, the smoothing windows used to reduce the spurious predictions introduce ambiguity in the positions of the predicted domain linkers, as they smear their C and N termini. These issues can be examined using an error window parameter that accommodates the positional ambiguity generated by both the putative structural domain termini and the predicted domain linkers (or LRCs). As shown in Figure <figr fid="F5">5</figr>, the positions of the first and last residues of the predicted domain linker are distributed randomly around the positions of the last and respectively first residue of the structural termini. This shows that the error distribution is random with a maximum at 0 residue, confirming that the linker positions are accurately assigned. The error is clearly limited to about 20 residues, and to 10 residues in most cases. Furthermore, the prediction quality index dependence on the error window also indicates that the ambiguity is limited to about 20 residues, as it reaches 70% for a 15 residue error window and then rapidly levels off for larger windows (Figure <figr fid="F4">4</figr>).</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Correlation between the positions of domain linkers and putative structural domains</p>
               </caption>
               <text>
                  <p>Correlation between the positions of domain linkers and putative structural domains. The horizontal scale represents the number of residues in the error window between the linker termini and the corresponding putative structural domain termini. This is calculated as the number of residues separating the last residue (or the first residue) of a domain linker in Classes 1 and 2 from the first residue (or respectively the last residue) of the corresponding putative structural domain. (A) Distribution calculated for putative structural domains detected by similarity to PDB and CDD, (B) to PDB, and (C) to CDD.</p>
               </text>
               <graphic file="1471-2105-7-323-5"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Our study strongly suggests that sequence characteristics alone, as detected by either our neural network or SEG, can identify domain boundaries in protein sequences even without sequence similarity to existing domain databases. There is a clear correlation between the termini of putative structural domains and the positions of both the domain linkers and the LCRs. Furthermore, our neural network and SEG are complementary for detecting domain boundaries, and when combined, the sensitivity of the domain boundary prediction is increased without decreasing its specificity. Overall, our study shows that domain identification protocol based on domain boundary prediction can be applied to practical problems, such as the identification of novel structural domains, and thus will yield new targets for large scale protein analyses.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Sequence databases and estimation of the putative structural domains</p>
            </st>
            <p>A total of 101602 SWISSPROT protein sequences <abbrgrp><abbr bid="B39">39</abbr></abbrgrp> were used in the present investigation. Since the putative structural domains needed to be structurally independent units, we located all of the sequences with high similarity to PDB <abbrgrp><abbr bid="B47">47</abbr></abbrgrp> and CDD <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> sequences, using the BLAST and RPS-BLAST programs<abbrgrp><abbr bid="B48">48</abbr><abbr bid="B49">49</abbr></abbrgrp>. To ensure the structural identity, as much as possible, we required a sequence identity greater than 30% and a sequential overlap greater than 85% over the entire length of the corresponding PDB or CDD sequence. Thus, putative structural domains detected by similarity to a PDB sequence are likely to fold into a structure similar to the corresponding PDB structure. Analogously, putative structural domains detected by similarity to CDD sequences, which is a compilation of conserved protein domain sequences imported from Pfam <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> and SMART <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, are likely correspond to a natively folded domain, although their structures have not necessarily been determined.</p>
         </sec>
         <sec>
            <st>
               <p>Putative domain linkers predicted by the neural network</p>
            </st>
            <p>We used a two hidden units neural network <abbrgrp><abbr bid="B50">50</abbr></abbrgrp> trained to distinguish between domain linker and non-linker regions. The prediction procedure was identical to that reported in our previous paper <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, except for the following two points. (1) The prediction was carried out over the entire protein sequence, namely from the start to the end of each target sequence, because the SWISSPROT sequences may contain unstructured termini. Indeed, in our previous study, we assumed that a 60 residue length is the minimum for a polypeptide to fold independently, and we omitted the 60 terminal residues of the multi-domain protein sequences from the prediction, because the protein structures were known, and we knew that no unstructured termini were present. (2) Predicted domain linkers were not ranked, because under the stringent conditions (cutoff 0.90&#8211;0.98; see below) examined here, the prediction success rate was sufficiently high without such a procedure.</p>
            <p>The smoothing window size and the threshold parameters were fixed to 19 and 0.5, respectively, as in our previous study. However, we set the cutoff parameter to values ranging from 0.90 to 0.98, because a high cutoff yields a better prediction specificity at the cost of the prediction sensitivity. The specificity and sensitivity for the first ranked domain linkers predicted with a cutoff of 0.90 are 81.8% and 10.3%, respectively, as calculated with a ten-fold jack-knife <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Low-complexity regions</p>
            </st>
            <p>Sequence entropy (also called Shannon's entropy) has been used to quantify the complexity of amino acid sequences, and several studies have examined the relationship between the sequence entropy and the globularity of proteins <abbrgrp><abbr bid="B42">42</abbr><abbr bid="B43">43</abbr></abbrgrp>. According to these studies, the sequence entropy of globular proteins is generally high, with a lower limit of around 2.9.</p>
            <p>SEG is a program that identifies low-complexity regions in protein sequences <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>. This program was originally intended to distinguish between globular and non-globular regions. In this study, we used SEG to check whether a correlation between the low-complexity regions and the putative structural domain termini existed. Three parameters in SEG, the trigger window length, the trigger complexity and the extension complexity, are used to assign low complexity regions. We set the trigger window length to 45 residues, in line with previous studies <abbrgrp><abbr bid="B43">43</abbr><abbr bid="B51">51</abbr></abbrgrp> To obtain a number of LCRs similar to that of the linkers predicted with a cutoff of 0.95, the trigger and extension complexities were set to 2.9 and 3.2, respectively (Table <tblr tid="T1">1</tblr> and Figures <figr fid="F1">1</figr> and <figr fid="F3">3</figr>).</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation of putative domain linkers and low-complexity region</p>
            </st>
            <p>We evaluated the validity of the prediction of the domain boundaries from their positions relative to the putative structural domains as defined above. The predicted domain boundaries were divided into four classes (Figure <figr fid="F1">1A</figr>), using an error window to accommodate the ambiguity in the termini position of both the predicted domain boundaries and the putative structural domains. A predicted domain boundary was considered to be correctly located when its end was separated from a putative structural domain by fewer residues than specified by the error window (Figure <figr fid="F1">1A</figr>). Class 1 includes predicted domain boundaries in which the closest ends are located within the error window of a putative structural domain. Predicted domain boundaries with both ends located within the error window of the N and C terminal ends of two putative structural domains are categorized in class 2. Class 3 consists of predicted domain boundaries that are separated from any putative structural domain by a number of residues larger than the error window.</p>
         </sec>
         <sec>
            <st>
               <p>Random guess</p>
            </st>
            <p>We assumed the success rate of a blind prediction, <it>i.e</it>. a prediction without any <it>a priori </it>information, to be the probability that a randomly assigned position matches a terminal residue of a putative structural domain. Four classes were defined similarly to those used to evaluate the putative domain linkers and the low-complexity regions. For example, a randomly picked residue was considered to be correctly located and was classified in class 1, when the end of a putative structural domain was found within the error window. The success rates (quality index) for the blind prediction, the putative domain linkers and the low-complexity regions were calculated as the rate of correct matches (classes 1 and 2) relative to both the correct and incorrect matches (classes 1, 2 and 4).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>S.M. designed the study, wrote the programs, analyzed the data, and wrote the paper under the supervision of Y.K. Y.K. conceived the study, analyzed the data and wrote the paper with S.M. S.Y. supervised S.M. and the study.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank the members of the Protein Research Group (RIKEN, GSC) for discussions, and the Informatics Infrastructure Team (RIKEN, GSC) for the computational environment. The training of the neural network was performed on a Fujitsu VPP700E supercomputer at RIKEN, Wako campus.  Satoshi Miyazaki passed away during the course of this work.  He was a gifted graduate student, a kind and generous person.  Y.K and S.Y. wish to dedicate this paper to his memory.  </p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Coverage of protein sequence space by current structural genomics targets</p>
            </title>
            <aug>
               <au>
                  <snm>O'Toole</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Raymond</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Cygler</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J Struct Funct Genomics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <issue>2-3</issue>
            <fpage>47</fpage>
            <lpage>55</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1023/A:1026156025612</pubid>
                  <pubid idtype="pmpid" link="fulltext">14649288</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Shining a light on structural genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Kim</snm>
                  <fnm>SH</fnm>
               </au>
            </aug>
            <source>Nat Struct Biol</source>
            <pubdate>1998</pubdate>
            <volume>5 Suppl</volume>
            <fpage>643</fpage>
            <lpage>645</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/1334</pubid>
                  <pubid idtype="pmpid" link="fulltext">9699614</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>The Argonne Structural Genomics Workshop: Lamaze class for the birth of a new science</p>
            </title>
            <aug>
               <au>
                  <snm>Shapiro</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Lima</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>Structure</source>
            <pubdate>1998</pubdate>
            <volume>6</volume>
            <issue>3</issue>
            <fpage>265</fpage>
            <lpage>267</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0969-2126(98)00030-6</pubid>
                  <pubid idtype="pmpid">9551549</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>The PRESAGE database for structural genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Barken</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Levitt</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <issue>1</issue>
            <fpage>251</fpage>
            <lpage>253</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">148148</pubid>
                  <pubid idtype="pmpid" link="fulltext">9847193</pubid>
                  <pubid idtype="doi">10.1093/nar/27.1.251</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing</p>
            </title>
            <aug>
               <au>
                  <snm>Mallick</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Goodwill</snm>
                  <fnm>KE</fnm>
               </au>
               <au>
                  <snm>Fitz-Gibbon</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2000</pubdate>
            <volume>97</volume>
            <issue>6</issue>
            <fpage>2450</fpage>
            <lpage>2455</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">15949</pubid>
                  <pubid idtype="pmpid" link="fulltext">10706641</pubid>
                  <pubid idtype="doi">10.1073/pnas.050589297</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Structural genomics projects in Japan</p>
            </title>
            <aug>
               <au>
                  <snm>Yokoyama</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hirota</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Kigawa</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Yabuki</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Shirouzu</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Terada</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ito</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Matsuo</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kuroda</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Nishimura</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kyogoku</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Miki</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Masui</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Kuramitsu</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nat Struct Biol</source>
            <pubdate>2000</pubdate>
            <volume>7 Suppl</volume>
            <fpage>943</fpage>
            <lpage>945</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/80712</pubid>
                  <pubid idtype="pmpid" link="fulltext">11103994</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>The impact of structural genomics: expectations and outcomes</p>
            </title>
            <aug>
               <au>
                  <snm>Chandonia</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2006</pubdate>
            <volume>311</volume>
            <issue>5759</issue>
            <fpage>347</fpage>
            <lpage>351</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1121018</pubid>
                  <pubid idtype="pmpid" link="fulltext">16424331</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>NMR spectroscopy of large molecules and multimolecular assemblies in solution</p>
            </title>
            <aug>
               <au>
                  <snm>Wider</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Wuthrich</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Curr Opin Struct Biol</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <issue>5</issue>
            <fpage>594</fpage>
            <lpage>601</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0959-440X(99)00011-1</pubid>
                  <pubid idtype="pmpid" link="fulltext">10508768</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Folding of thermolysin fragments. Identification of the minimum size of a carboxyl-terminal fragment that can fold into a stable native-like structure</p>
            </title>
            <aug>
               <au>
                  <snm>Dalzoppo</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Vita</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Fontana</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1985</pubdate>
            <volume>182</volume>
            <issue>2</issue>
            <fpage>331</fpage>
            <lpage>340</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(85)90349-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">3923205</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>The domain organization of streptokinase: nuclear magnetic resonance, circular dichroism, and functional characterization of proteolytic fragments</p>
            </title>
            <aug>
               <au>
                  <snm>Parrado</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Conejero-Lara</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Marshall</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Dobson</snm>
                  <fnm>CM</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>1996</pubdate>
            <volume>5</volume>
            <issue>4</issue>
            <fpage>693</fpage>
            <lpage>704</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">8845759</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>The structural aspects of limited proteolysis of native proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>SJ</fnm>
               </au>
            </aug>
            <source>Biochim Biophys Acta</source>
            <pubdate>1998</pubdate>
            <volume>1382</volume>
            <issue>2</issue>
            <fpage>191</fpage>
            <lpage>206</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9540791</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Identification of protein domains by shotgun proteolysis</p>
            </title>
            <aug>
               <au>
                  <snm>Christ</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Winter</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2006</pubdate>
            <volume>358</volume>
            <issue>2</issue>
            <fpage>364</fpage>
            <lpage>71. Epub 2006 Feb 13.</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2006.01.057</pubid>
                  <pubid idtype="pmpid" link="fulltext">16516923</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Rapid protein-folding assay using green fluorescent protein</p>
            </title>
            <aug>
               <au>
                  <snm>Waldo</snm>
                  <fnm>GS</fnm>
               </au>
               <au>
                  <snm>Standish</snm>
                  <fnm>BM</fnm>
               </au>
               <au>
                  <snm>Berendzen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Terwilliger</snm>
                  <fnm>TC</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>1999</pubdate>
            <volume>17</volume>
            <issue>7</issue>
            <fpage>691</fpage>
            <lpage>695</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/10904</pubid>
                  <pubid idtype="pmpid" link="fulltext">10404163</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Toward development of a screen to identify randomly encoded, foldable sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Hagihara</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>PS</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <issue>10</issue>
            <fpage>6619</fpage>
            <lpage>24. Epub 2002 May 7.</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">124452</pubid>
                  <pubid idtype="pmpid" link="fulltext">11997470</pubid>
                  <pubid idtype="doi">10.1073/pnas.102172099</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Computer-aided NMR assay for detecting natively folded structural domains</p>
            </title>
            <aug>
               <au>
                  <snm>Hondoh</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Kato</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Yokoyama</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kuroda</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>2006</pubdate>
            <volume>15</volume>
            <issue>4</issue>
            <fpage>871</fpage>
            <lpage>83. Epub 2006 Mar 7.</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1110/ps.051880406</pubid>
                  <pubid idtype="pmpid" link="fulltext">16522794</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>SMART: a web-based tool for the study of genetically mobile domains</p>
            </title>
            <aug>
               <au>
                  <snm>Schultz</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Copley</snm>
                  <fnm>RR</fnm>
               </au>
               <au>
                  <snm>Doerks</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <issue>1</issue>
            <fpage>231</fpage>
            <lpage>234</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102444</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592234</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.231</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>SMART, a simple modular architecture research tool: identification of signaling domains</p>
            </title>
            <aug>
               <au>
                  <snm>Schultz</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Milpetz</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <issue>11</issue>
            <fpage>5857</fpage>
            <lpage>5864</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">34487</pubid>
                  <pubid idtype="pmpid" link="fulltext">9600884</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.11.5857</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>The Pfam protein families database</p>
            </title>
            <aug>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Cerruti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Etwiller</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Griffiths-Jones</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Howe</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Marshall</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>1</issue>
            <fpage>276</fpage>
            <lpage>280</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99071</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752314</pubid>
                  <pubid idtype="doi">10.1093/nar/30.1.276</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>CDD: a database of conserved domain alignments with links to domain three-dimensional structure</p>
            </title>
            <aug>
               <au>
                  <snm>Marchler-Bauer</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Panchenko</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Shoemaker</snm>
                  <fnm>BA</fnm>
               </au>
               <au>
                  <snm>Thiessen</snm>
                  <fnm>PA</fnm>
               </au>
               <au>
                  <snm>Geer</snm>
                  <fnm>LY</fnm>
               </au>
               <au>
                  <snm>Bryant</snm>
                  <fnm>SH</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>1</issue>
            <fpage>281</fpage>
            <lpage>283</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99109</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752315</pubid>
                  <pubid idtype="doi">10.1093/nar/30.1.281</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Kuroda</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Tani</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Matsuo</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Yokoyama</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>2000</pubdate>
            <volume>9</volume>
            <issue>12</issue>
            <fpage>2313</fpage>
            <lpage>2321</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11206052</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Protein domain identification and improved sequence similarity searching using PSI-BLAST</p>
            </title>
            <aug>
               <au>
                  <snm>George</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2002</pubdate>
            <volume>48</volume>
            <issue>4</issue>
            <fpage>672</fpage>
            <lpage>681</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.10175</pubid>
                  <pubid idtype="pmpid" link="fulltext">12211035</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Delineation of modular proteins: domain boundary prediction from sequence information</p>
            </title>
            <aug>
               <au>
                  <snm>Kong</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ranganathan</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Brief Bioinform</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>2</issue>
            <fpage>179</fpage>
            <lpage>192</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bib/5.2.179</pubid>
                  <pubid idtype="pmpid" link="fulltext">15260897</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Prediction of the location of structural domains in globular proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Kikuchi</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Nemethy</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Scheraga</snm>
                  <fnm>HA</fnm>
               </au>
            </aug>
            <source>J Protein Chem</source>
            <pubdate>1988</pubdate>
            <volume>7</volume>
            <issue>4</issue>
            <fpage>427</fpage>
            <lpage>471</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF01024890</pubid>
                  <pubid idtype="pmpid">3255372</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Domain size distributions can predict domain boundaries</p>
            </title>
            <aug>
               <au>
                  <snm>Wheelan</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Marchler-Bauer</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bryant</snm>
                  <fnm>SH</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <issue>7</issue>
            <fpage>613</fpage>
            <lpage>618</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/16.7.613</pubid>
                  <pubid idtype="pmpid" link="fulltext">11038331</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Characterization and prediction of linker sequences of multi-domain proteins by a neural network</p>
            </title>
            <aug>
               <au>
                  <snm>Miyazaki</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kuroda</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Yokoyama</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>J Struct Funct Genomics</source>
            <pubdate>2002</pubdate>
            <volume>2</volume>
            <issue>1</issue>
            <fpage>37</fpage>
            <lpage>51</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1023/A:1014418700858</pubid>
                  <pubid idtype="pmpid" link="fulltext">12836673</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>PPRODO: prediction of protein domain boundaries using neural networks</p>
            </title>
            <aug>
               <au>
                  <snm>Sim</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>SY</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2005</pubdate>
            <volume>59</volume>
            <issue>3</issue>
            <fpage>627</fpage>
            <lpage>632</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.20442</pubid>
                  <pubid idtype="pmpid" link="fulltext">15789433</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Sequence-based prediction of protein domains</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>12</issue>
            <fpage>3522</fpage>
            <lpage>3530</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">484172</pubid>
                  <pubid idtype="pmpid" link="fulltext">15240828</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh684</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Improvement of domain linker prediction by incorporating loop-length-dependent characteristics</p>
            </title>
            <aug>
               <au>
                  <snm>Tanaka</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Yokoyama</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kuroda</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Biopolymers</source>
            <pubdate>2006</pubdate>
            <volume>84</volume>
            <issue>2</issue>
            <fpage>161</fpage>
            <lpage>168</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/bip.20361</pubid>
                  <pubid idtype="pmpid" link="fulltext">16134173</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Characteristics and prediction of domain linker sequences in multi-domain proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Tanaka</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Kuroda</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Yokoyama</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>J Struct Funct Genomics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <issue>2-3</issue>
            <fpage>79</fpage>
            <lpage>85</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1023/A:1026163008203</pubid>
                  <pubid idtype="pmpid" link="fulltext">14649291</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Armadillo: domain boundary prediction by amino acid composition</p>
            </title>
            <aug>
               <au>
                  <snm>Dumontier</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Yao</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Feldman</snm>
                  <fnm>HJ</fnm>
               </au>
               <au>
                  <snm>Hogue</snm>
                  <fnm>CW</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2005</pubdate>
            <volume>350</volume>
            <issue>5</issue>
            <fpage>1061</fpage>
            <lpage>1073</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2005.05.037</pubid>
                  <pubid idtype="pmpid" link="fulltext">15978619</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Rigden</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Protein Eng</source>
            <pubdate>2002</pubdate>
            <volume>15</volume>
            <issue>2</issue>
            <fpage>65</fpage>
            <lpage>77</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/protein/15.2.65</pubid>
                  <pubid idtype="pmpid" link="fulltext">11917143</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>SnapDRAGON: a method to delineate protein structural domains from sequence data</p>
            </title>
            <aug>
               <au>
                  <snm>George</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2002</pubdate>
            <volume>316</volume>
            <issue>3</issue>
            <fpage>839</fpage>
            <lpage>851</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2001.5387</pubid>
                  <pubid idtype="pmpid" link="fulltext">11866536</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks</p>
            </title>
            <aug>
               <au>
                  <snm>Hirst</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Sternberg</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Biochemistry</source>
            <pubdate>1992</pubdate>
            <volume>31</volume>
            <issue>32</issue>
            <fpage>7211</fpage>
            <lpage>7218</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/bi00147a001</pubid>
                  <pubid idtype="pmpid">1510913</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Predicting the secondary structure of globular proteins using neural network models</p>
            </title>
            <aug>
               <au>
                  <snm>Qian</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Sejnowski</snm>
                  <fnm>TJ</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1988</pubdate>
            <volume>202</volume>
            <issue>4</issue>
            <fpage>865</fpage>
            <lpage>884</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(88)90564-5</pubid>
                  <pubid idtype="pmpid" link="fulltext">3172241</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Prediction of protein secondary structure at better than 70% accuracy</p>
            </title>
            <aug>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1993</pubdate>
            <volume>232</volume>
            <issue>2</issue>
            <fpage>584</fpage>
            <lpage>599</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1993.1413</pubid>
                  <pubid idtype="pmpid" link="fulltext">8345525</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Prediction of the location and type of beta-turns in proteins using neural networks</p>
            </title>
            <aug>
               <au>
                  <snm>Shepherd</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Gorse</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>1999</pubdate>
            <volume>8</volume>
            <issue>5</issue>
            <fpage>1045</fpage>
            <lpage>1055</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10338015</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Neural networks for secondary structure and structural class predictions</p>
            </title>
            <aug>
               <au>
                  <snm>Chandonia</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Karplus</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>1995</pubdate>
            <volume>4</volume>
            <issue>2</issue>
            <fpage>275</fpage>
            <lpage>285</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">7757016</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Stabilization centers in proteins: identification, characterization and predictions</p>
            </title>
            <aug>
               <au>
                  <snm>Dosztanyi</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Fiser</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1997</pubdate>
            <volume>272</volume>
            <issue>4</issue>
            <fpage>597</fpage>
            <lpage>612</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1997.1242</pubid>
                  <pubid idtype="pmpid" link="fulltext">9325115</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000</p>
            </title>
            <aug>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <issue>1</issue>
            <fpage>45</fpage>
            <lpage>48</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102476</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592178</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.45</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>SCOP database in 2002: refinements accommodate structural genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Lo Conte</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>TJ</fnm>
               </au>
               <au>
                  <snm>Chothia</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Murzin</snm>
                  <fnm>AG</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>1</issue>
            <fpage>264</fpage>
            <lpage>267</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99154</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752311</pubid>
                  <pubid idtype="doi">10.1093/nar/30.1.264</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>The Protein Data Bank</p>
            </title>
            <aug>
               <au>
                  <snm>Berman</snm>
                  <fnm>HM</fnm>
               </au>
               <au>
                  <snm>Westbrook</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Feng</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Gilliland</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Bhat</snm>
                  <fnm>TN</fnm>
               </au>
               <au>
                  <snm>Weissig</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Shindyalov</snm>
                  <fnm>IN</fnm>
               </au>
               <au>
                  <snm>Bourne</snm>
                  <fnm>PE</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <issue>1</issue>
            <fpage>235</fpage>
            <lpage>242</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102472</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592235</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.235</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Analysis of compositionally biased regions in sequence databases</p>
            </title>
            <aug>
               <au>
                  <snm>Wootton</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Federhen</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Methods Enzymol</source>
            <pubdate>1996</pubdate>
            <volume>266</volume>
            <fpage>554</fpage>
            <lpage>571</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8743706</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Sequence complexity of disordered protein</p>
            </title>
            <aug>
               <au>
                  <snm>Romero</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Obradovic</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Garner</snm>
                  <fnm>EC</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2001</pubdate>
            <volume>42</volume>
            <issue>1</issue>
            <fpage>38</fpage>
            <lpage>48</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/1097-0134(20010101)42:1&lt;38::AID-PROT50>3.0.CO;2-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">11093259</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B44">
            <title>
               <p>Logical analysis of the mechanism of protein folding. I. Predictions of helices, loops and beta-structures from primary structure</p>
            </title>
            <aug>
               <au>
                  <snm>Nagano</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1973</pubdate>
            <volume>75</volume>
            <issue>2</issue>
            <fpage>401</fpage>
            <lpage>420</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(73)90030-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">4728695</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B45">
            <title>
               <p>Predictions of structural homologies in cytochrome c proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Lewis</snm>
                  <fnm>PN</fnm>
               </au>
               <au>
                  <snm>Scheraga</snm>
                  <fnm>HA</fnm>
               </au>
            </aug>
            <source>Arch Biochem Biophys</source>
            <pubdate>1971</pubdate>
            <volume>144</volume>
            <issue>2</issue>
            <fpage>576</fpage>
            <lpage>583</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0003-9861(71)90363-8</pubid>
                  <pubid idtype="pmpid" link="fulltext">5106152</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B46">
            <title>
               <p>Prediction of protein conformation</p>
            </title>
            <aug>
               <au>
                  <snm>Chou</snm>
                  <fnm>PY</fnm>
               </au>
               <au>
                  <snm>Fasman</snm>
                  <fnm>GD</fnm>
               </au>
            </aug>
            <source>Biochemistry</source>
            <pubdate>1974</pubdate>
            <volume>13</volume>
            <issue>2</issue>
            <fpage>222</fpage>
            <lpage>245</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/bi00699a002</pubid>
                  <pubid idtype="pmpid">4358940</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B47">
            <title>
               <p>The Protein Data Bank and structural genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Westbrook</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Feng</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Berman</snm>
                  <fnm>HM</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>1</issue>
            <fpage>489</fpage>
            <lpage>491</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165515</pubid>
                  <pubid idtype="pmpid" link="fulltext">12520059</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg068</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B48">
            <title>
               <p>Basic local alignment search tool</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1990</pubdate>
            <volume>215</volume>
            <issue>3</issue>
            <fpage>403</fpage>
            <lpage>410</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1990.9999</pubid>
                  <pubid idtype="pmpid" link="fulltext">2231712</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B49">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <issue>17</issue>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
                  <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B50">
            <title>
               <p> Learning representations by back-propagating errors</p>
            </title>
            <aug>
               <au>
                  <snm>Rumelhart</snm>
                  <fnm>DE</fnm>
               </au>
               <au>
                  <snm>Hinton</snm>
                  <fnm>GE</fnm>
               </au>
               <au>
                  <snm>R.J.</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1986</pubdate>
            <volume>323</volume>
            <fpage>533</fpage>
            <lpage>536</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1038/323533a0</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B51">
            <title>
               <p>Non-globular domains in protein sequences: automated segmentation using complexity measures</p>
            </title>
            <aug>
               <au>
                  <snm>Wootton</snm>
                  <fnm>JC</fnm>
               </au>
            </aug>
            <source>Comput Chem</source>
            <pubdate>1994</pubdate>
            <volume>18</volume>
            <issue>3</issue>
            <fpage>269</fpage>
            <lpage>285</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0097-8485(94)85023-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">7952898</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
