<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-10-195</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Ab initio and homology based prediction of protein domains by recursive neural networks</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Walsh</snm>
               <fnm>Ian</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>ian.walsh@ucd.ie</email>
            </au>
            <au id="A2">
               <snm>Martin</snm>
               <mi>JM</mi>
               <fnm>Alberto</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>albertoj@ucd.ie</email>
            </au>
            <au id="A3">
               <snm>Mooney</snm>
               <fnm>Catherine</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>catherine.mooney@ucd.ie</email>
            </au>
            <au id="A4">
               <snm>Rubagotti</snm>
               <fnm>Enrico</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>enrico.rubagotti@ucd.ie</email>
            </au>
            <au id="A5">
               <snm>Vullo</snm>
               <fnm>Alessandro</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>alessandro.vullo@ucd.ie</email>
            </au>
            <au id="A6" ca="yes">
               <snm>Pollastri</snm>
               <fnm>Gianluca</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>gianluca.pollastri@ucd.ie</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland</p>
            </ins>
            <ins id="I2">
               <p>Complex and Adaptive Systems Laboratory, University College Dublin, Belfield, Dublin 4, Ireland</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2009</pubdate>
         <volume>10</volume>
         <issue>1</issue>
         <fpage>195</fpage>
         <url>http://www.biomedcentral.com/1471-2105/10/195</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">19558651</pubid>
               <pubid idtype="doi">10.1186/1471-2105-10-195</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>17</day>
               <month>10</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>26</day>
               <month>6</month>
               <year>2009</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>26</day>
               <month>6</month>
               <year>2009</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2009</year>
         <collab>Walsh et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We describe novel systems for the prediction of protein domain boundaries powered by Recursive Neural Networks. The systems rely on a combination of primary sequence and evolutionary information, predictions of structural features such as secondary structure, solvent accessibility and residue contact maps, and structural templates, both annotated for domains (from the SCOP dataset) and unannotated (from the PDB). We gauge the contribution of contact maps, and PDB and SCOP templates independently and for different ranges of template quality. We find that accurately predicted contact maps are informative for the prediction of domain boundaries, while the same is not true for contact maps predicted ab initio. We also find that gap information from PDB templates is informative, but, not surprisingly, less than SCOP annotations. We test both systems trained on templates of all qualities, and systems trained only on templates of marginal similarity to the query (less than 25% sequence identity). While the first batch of systems produces near perfect predictions in the presence of fair to good templates, the second batch outperforms or match ab initio predictors down to essentially any level of template quality.</p>
               <p>We test all systems in 5-fold cross-validation on a large non-redundant set of multi-domain and single domain proteins. The final predictors are state-of-the-art, with a template-less prediction boundary recall of 50.8% (precision 38.7%) within &#177; 20 residues and a single domain recall of 80.3% (precision 78.1%). The SCOP-based predictors achieve a boundary recall of 74% (precision 77.1%) again within &#177; 20 residues, and classify single domain proteins as such in over 85% of cases, when we allow a mix of bad and good quality templates. If we only allow marginal templates (max 25% sequence identity to the query) the scores remain high, with boundary recall and precision of 59% and 66.3%, and 80% of all single domain proteins predicted correctly.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>The systems presented here may prove useful in large-scale annotation of protein domains in proteins of unknown structure. The methods are available as public web servers at the address: <url>http://distill.ucd.ie/shandy/</url> and we plan on running them on a multi-genomic scale and make the results public in the near future.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures. In this case the prediction can be applied to each protein domain separately, decreasing prediction times, and increasing prediction accuracy especially in the absence of homologues/templates and when interactions among residues are long ranging. Although domain-domain interactions would have to be ignored when predicting domain structures separately, stages for domain-domain interaction prediction can be designed <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp> to tie the domains together resulting in the final three dimensional (3D) structure. The detection of structural templates from sequence can also be improved when only considering the sequence that corresponds to each domain, since the domain itself is more likely to be evolutionarily conserved. Fold recognition methods also perform better when using individual domains rather than the entire protein <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
         <p>Experimental structural determination methods become hard to apply when considering large proteins of many domains. In X-Ray crystallography and NMR spectroscopy difficulties often arise when protein domains are joined by less flexible boundary regions. Also, NMR structural determination errors tend to arise when the protein is very long. As a result, experimental methods often determine structures by only examining individual domains or at most a few domains together <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>.</p>
         <p>Methods for the prediction of protein domains, similarly to methods for the prediction of the 3D structure, can be classified as template-based or template-free (which we will refer to as "ab initio"), depending on whether the prediction incorporates structural information from putative homologues from the Protein Data Bank <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The simplest form of domain prediction assumes all domains are continuous (i.e. domain <it>n </it>entirely follows domain <it>n </it>- 1 in the sequence). The main objective of these approaches is to identify domain boundary regions. Other methods try to assign residues to particular domains when the domains are discontinuous or split across the sequence (e.g. domain <it>n </it>is surrounded by domain <it>n </it>- 1 in the sequence). Often these latter methods rely on the availability of accurate 3D models (e.g. modelled by homology), from which the structure is parsed to domains using a 3D to domain parsing algorithm. DOMpro <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> and its server <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> use ranked structural homologues to construct a 3D structure using Modeler <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> then Protein Domain Parser <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> is used to assign the domains. If no homologues are found within a given threshold then ab initio predictions of protein domain boundaries are made from sequence alignments, secondary structure and solvent accessibility predictions. RosettaDom <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> uses many 3D structure models predicted from Rosetta <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> and the Taylor domain parsing algorithm <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
         <p>SnapDragon <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> performs 100 structural predictions from its 3D ab initio system and assigns domains based on an efficient domain parsing algorithm. These methods that rely on 3D structural models are often computationally expensive making them inapplicable for very large scale predictions.</p>
         <p>The Domain Guess by Size method <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> guesses domain boundaries solely based on the length distribution of proteins of known structure and is a useful baseline for benchmarking especially ab initio methods. DomSSEA <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> predicts domain boundaries from aligning predicted secondary structure against a database of 3D structures with annotated domain information in the CATH <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> database. Armadillo <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> is also simple and effective &#8211; it predicts domain linkers by statistics on the amino acid composition of domain boundaries.</p>
         <p>In this paper we concentrate on the evaluation of continuous domain prediction. In other words we are more interested in predicting domain boundaries rather than which domain a residue belongs to. To this end, we ignore the problem of discontinuous domains. Domain boundaries are important features of a protein and have been given particular attention over the years: an analysis of domain boundaries was carried out in <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> with the aim to design boundaries for domain fusion; boundaries are important for inter-domain coupling <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>; altering the length of boundaries connecting domains has been shown to affect protein stability, folding rates and domain-domain orientation <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>; ultimately, if the location of protein boundaries is known, barring discontinuous domains, domain identity follows.</p>
         <p>Currently well over half of all known protein sequences show some detectable degree of similarity to one or more sequences of known structure. Nearly three quarters of newly deposited structures in the PDB <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> show significant similarity to previously deposited structures <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. The state of the art predictors at the CASP 6 and 7 competitions <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp> all contain a template-based component. Homology information is particularly appealing for domain boundary prediction since only some domains for a protein may have homologues while some domains may not, but the boundary can still be inferred by subtracting the homologues from the sequence.</p>
         <p>Our method consists of learning boundaries defined by SCOP <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> from evolutionary information in the form of PSI-BLAST <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> sequence alignments, predicted template-based structural information in the form secondary structure <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, solvent accessibility <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, <it>&#981; </it>and <it>&#968; </it>torsion angles <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, contact density <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and residue-residue contact maps <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Along with these features weighted non-gap/gaps in PDB templates and weighted SCOP template definitions are used. All templates are found by simple PSI-BLAST searches on the PDB and SCOP databases. We train 1D Bidirectional Recurrent Neural Networks (BRNN) <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> for the prediction of SCOP defined domain boundaries. The novelty of the method is both in the soft prediction (we do not assume any single piece of information to be true, but rather provide all of them to the RNN) and in the input design, with both SCOP and PDB template profiles used, alongside structural predictions. The structural predictions themselves are made using weighted templates from the PDB with the predictions being significantly better than deriving the information directly from the templates <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>.</p>
         <p>We show that template information improves over ab initio even for low quality templates, when we design a specialised system for this case. The ab initio predictions compare well with other state-of-the-art ab initio predictors, and the addition of template information always improves over ab initio. As homologues become more accurate predictions are often nearly perfect. It is important to stress that, when homology information is available our algorithm does not take it as the final answer, but rather utilises the homology input in combination with accurate template-based structural information and sequence alignments. This, on average, yields significant improvements over baselines where boundaries are inferred directly from the SCOP homologues.</p>
         <p>Although we use simple PSI-BLAST based protocol to find suitable templates, our system is fully modular and may easily incorporate more sophisticated stages with better sensitivity to remote homology (perhaps even by utilising boundary predictions as templates). The method is fast and can be applied to 1000 multi domain proteins in one day on a single 2 GHZ core.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <p>Learning domain boundaries consists of mapping <it>f</it>(&#183;): &#8464; &#8594; <inline-formula><graphic file="1471-2105-10-195-i1.gif"/></inline-formula> where &#8464; = (<it>i</it><sub>1</sub>,...,<it>i</it><sub><it>N</it></sub>) and <inline-formula><graphic file="1471-2105-10-195-i1.gif"/></inline-formula> = (<it>o</it><sub>1</sub>,...,<it>o</it><sub><it>N</it></sub>) are the input and output sequences of length <it>N</it>. Each <it>o</it><sub><it>j </it></sub>&#8712; {0,1} is the output symbol at position <it>j </it>resulting in a binary classification problem of domain residues and domain boundary residues. Element <it>i</it><sub><it>j </it></sub>&#8712; <it>I </it>is the input encoding for position <it>j </it>in the sequence. The input encoding is a real numbered vector, <it>i</it><sub><it>j </it></sub>&#8712; &#8477;<sup><it>n</it></sup>, where the design choices of <it>n </it>and <it>i</it><sub><it>j </it></sub>largely determines the power of the mapping.</p>
         <p>A residue's property at position <it>j </it>in the sequence will often depend on local information surrounding <it>j </it>and long range information far up and/or down the sequence. We map residues into boundary/non-boundary states by a Bidirectional Recurrent Neural Network (BRNN) <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>:</p>
         <p>
            <display-formula id="M1">
               <graphic file="1471-2105-10-195-i2.gif"/>
            </display-formula>
         </p>
         <p>where <inline-formula><graphic file="1471-2105-10-195-i3.gif"/></inline-formula> and <inline-formula><graphic file="1471-2105-10-195-i4.gif"/></inline-formula> are vectors of hidden states capturing contextual information, respectively, from the left side and right side of the input sequence, and the functions which govern the update of <inline-formula><graphic file="1471-2105-10-195-i3.gif"/></inline-formula>, <inline-formula><graphic file="1471-2105-10-195-i4.gif"/></inline-formula> and of the output <it>o</it><sub><it>j </it></sub>(respectively <inline-formula><graphic file="1471-2105-10-195-i5.gif"/></inline-formula>, <inline-formula><graphic file="1471-2105-10-195-i6.gif"/></inline-formula> and <inline-formula><graphic file="1471-2105-10-195-i7.gif"/></inline-formula>) are realised by Multi-Layered Perceptrons with one hidden layer. <it>S </it>in the equations represents the amount of contextual information that is provided explicitly to the <inline-formula><graphic file="1471-2105-10-195-i5.gif"/></inline-formula> and <inline-formula><graphic file="1471-2105-10-195-i6.gif"/></inline-formula> networks, or maximum <it>shortcut </it>length (see below for more details). The amount of context signal is learned alongside the hidden representation and depends on the error signal produced for a particular protein at a particular residue. This is in contrast to the static window methods where a context window is chosen a priori <abbrgrp><abbr bid="B33">33</abbr><abbr bid="B34">34</abbr></abbrgrp> resulting in experiments to determine window sizes using a validation set. In this case danger of overfitting may arise for windows that are too large, especially when the training sets are small. BRNNs are trained by the standard gradient descent algorithm. The gradient of the error (the mutual entropy between target and network output) is computed via an extension of the backpropagation algorithm <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. BRNNs have been successively applied to many predictive tasks for proteins <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr></abbrgrp>.</p>
         <p>As outputs for individual residues are predicted independently, the raw probabilities of residues being in a domain boundary, <it>o</it><sub><it>j</it></sub>, contain many local peaks. This is a common problem and has also been reported in <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B38">38</abbr></abbrgrp>. In order to mitigate it we use a second stage BRNN that maps the output of the first one into the boundary/non-boundary sequence. The <it>j</it><sup><it>th </it></sup>input to this second network includes the first-layer predictions in position <it>j </it>and first stage predictions averaged over multiple contiguous windows. This input at <it>j </it>is the array <it>I</it><sub><it>j</it></sub>:</p>
         <p>
            <display-formula id="M2">
               <graphic file="1471-2105-10-195-i8.gif"/>
            </display-formula>
         </p>
         <p>where <it>k</it><sub><it>f </it></sub>= <it>j </it>+ <it>f</it>(2<it>w </it>+ 1), 2<it>w </it>+ 1 is the size of the window over which first-stage predictions are averaged and 2<it>p </it>+ 1 is the number of windows considered. In the tests we use <it>w </it>= 7 and <it>p </it>= 7, as in <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. Capturing long range dependencies is difficult, especially when using gradient descent <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. The second stage BRNN described above mitigates this problem, and the presence of shortcut connections (dependencies between a hidden vector and <it>S </it>preceding ones with <it>S </it>> 1, as in eqn. 1) also helps shortening paths between distant residues. A further way to tackle the problem which we attempt here is similar to that described in <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>, and relies on placing shortcut connections over longer ranges, corresponding to predicted contact pairs (see the next section for more details).</p>
         <sec>
            <st>
               <p>Interaction BRNN</p>
            </st>
            <p>Long ranging information, such as the one usually determining beta-sheets, is difficult to capture using most algorithms. A particular residue, <it>i</it>, may be highly coupled with another residue, <it>j</it>, far up or down the sequence. A standard BRNN (or, for that, most models we are aware of) fails capture this dependency because of the vanishing gradient problem <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, whereby the gradient of the error rapidly approaches zero as it is propagated backwards through a neural network with multiple layers. An attempt to solve this problem is to place connections into the BRNN between the two residues that are near each other in the three-dimensional space but might span large sequence separations, as for instance in <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. These interacting connections should allow the model to propagate information (and backpropagate error signals) spanning large sequence separations. Although boundaries are not expected to be coupled with other boundaries this should improve the prediction accuracy of residues interacting within a domain and thus the overall accuracy.</p>
            <p>Let us define the estimated probability of contact between residues <it>i </it>and <it>j </it>as <it>P</it><sub><it>i, j</it></sub>.</p>
            <p>When examining the contacts of residue <it>j </it>we look at non-overlapping contiguous windows of contact probabilities up-sequence from <it>j</it>:</p>
            <p>
               <display-formula>
                  <graphic file="1471-2105-10-195-i9.gif"/>
               </display-formula>
            </p>
            <p>where <it>u</it><sub><it>h </it></sub>is an array:</p>
            <p>
               <display-formula>
                  <graphic file="1471-2105-10-195-i10.gif"/>
               </display-formula>
            </p>
            <p>and <it>kh </it>= <it>j </it>+ <it>hw</it>. <it>w </it>is the window size over which probabilities are considered, <it>p </it>is the number of windows considered, which is the same as the number of shortcut connections. Windows down-sequence, (<it>d</it><sub>-1</sub>,...,<it>d</it><sub>-<it>p</it></sub>), are also taken into account.</p>
            <p>We set shortcut connections between all pairs (<it>j</it>, <it>f</it><sub><it>h</it></sub>) such that <it>f</it><sub><it>h </it></sub>= <it>argmax</it><sub><it>y</it></sub><it>u</it><sub><it>h</it>, <it>y</it></sub>, and <it>f</it><sub>-<it>h </it></sub>= <it>argmax</it><sub><it>y</it></sub><it>d</it><sub>-<it>h</it>, <it>y</it></sub>.</p>
            <p>This interaction-based BRNN (IBRNN) takes the form;</p>
            <p>
               <display-formula id="M3">
                  <graphic file="1471-2105-10-195-i11.gif"/>
               </display-formula>
            </p>
            <p>Notice how the the connection strength is multiplied by the probabilities of contact, as estimated by our contact map predictor <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>.</p>
            <p>In our models we use <it>w </it>= 15 and <it>p </it>= 5, which means that we connect residue <it>i </it>with the one residue over each 15-residue window of the protein that we deem to be most likely to interact with <it>i</it>.</p>
         </sec>
         <sec>
            <st>
               <p>Training, testing set</p>
            </st>
            <p>We start from all chains found in SCOP <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> release 1.73 that are x-ray solved with resolution &#8804; 3.0 <it>&#197; </it>and R-factor &#8804; 30%. We then use UniqueProt <abbrgrp><abbr bid="B42">42</abbr></abbrgrp> to reduce sequence similarity. We run UniqueProt with options -m custom (those sequences that appear first in the input file are more likely to appear in the output &#8211; the sequences are first sorted by decreasing quality), and HSSP <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> distance of 20 (multidomain proteins tend to be longer than 100 amino acids). We leave in boundaries for discontinuous domains, which makes the problem harder than just identifying continuous domain boundaries. In total there are 646 multi domain proteins (set M646) and 321 single domain proteins (S321) in our set. The total number of boundaries is 929.</p>
            <p>However, it is important to notice that, since we do not cast the problem as that of mapping a protein into its number of domains, but rather as that of mapping a residue into its boundary vs. non-boundary state, the effective number of examples is the number of residues in the sets (304,221 in total, of which 24,257 boundary residues) rather than the number of proteins, or boundaries. This makes the results of learning quite stable with respect to small variations in initial training conditions or small changes in the architectural parameters of the networks (as observed in preliminary experiments, not shown).</p>
            <sec>
               <st>
                  <p>PDB and SCOP templates</p>
               </st>
               <p>For each of the proteins in the dataset we search for structural templates in the PDB available on March 25th, 2008 (excluding all entries shorter than 10 residues, leaving 108,076 chains).</p>
               <p>To generate PDB templates for a protein we run three rounds of PSI-BLAST with parameters <it>b </it>= 3000, <it>e </it>= 10<sup>-3 </sup>and <it>h </it>= 10<sup>-10 </sup>against the version of the NR database as available on March 3, 2004 containing over 1.4 million sequences. The NR database is first redundancy reduced at a 98% threshold, leading to a final 1.05 million sequences. We then run a fourth round of PSI-BLAST against the PDB using the PSSM generated in the first three rounds. In this fourth round we use a high expectation parameter (<it>e </it>= 10) to include as many hits as possible. We remove from each set of templates all sequences with similarity exceeding 95% between the query and the template to avoid including the query sequence in its own set of templates and to exclude PDB resubmissions of the same structure at different resolution, other chains in N-mers and close homologues. Figure <figr fid="F1">1</figr> shows the distribution of the templates with this 95% threshold imposed on the sequence identity.</p>
               <fig id="F1">
                  <title>
                     <p>Figure 1</p>
                  </title>
                  <caption>
                     <p>Best hit distribution</p>
                  </caption>
                  <text>
                     <p><b>Best hit distribution</b>. Distribution of best-hit SCOP (blue) and best-hit PDB (red) sequence identity in the PSI-BLAST templates. Hits above 95% sequence identity excluded.</p>
                  </text>
                  <graphic file="1471-2105-10-195-1"/>
               </fig>
               <p>To train template-based predictions in marginal sequence similarity conditions we create a second set of templates excluding all templates that have a PSI-BLAST hit exceeding 25% sequence identity to the query sequence. To generate SCOP templates we label every PDB template in these two sets with their SCOP defined domain boundaries. We use the 1.73 version of SCOP released in November 2007 which contains 34,494 PDB entries and a total of 97,178 domains. As not all PDB structures have been classified by SCOP the set of SCOP templates is a subset of the PDB templates. Figure <figr fid="F2">2</figr> shows the distribution of the templates with this 25% threshold imposed on the sequence identity.</p>
               <fig id="F2">
                  <title>
                     <p>Figure 2</p>
                  </title>
                  <caption>
                     <p>Best hit distribution, max 25% seq ID allowed</p>
                  </caption>
                  <text>
                     <p><b>Best hit distribution, max 25% seq ID allowed</b>. Distribution of best-hit SCOP (blue) and best-hit PDB (red) sequence identity in the PSI-BLAST templates. Hits above 25% sequence identity excluded.</p>
                  </text>
                  <graphic file="1471-2105-10-195-2"/>
               </fig>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Input design</p>
            </st>
            <p>The input vector at postion <it>j</it>,</p>
            <p>
               <display-formula id="M4">
                  <graphic file="1471-2105-10-195-i12.gif"/>
               </display-formula>
            </p>
            <p>contains evolutionary information from multiple sequence alignments <inline-formula><graphic file="1471-2105-10-195-i13.gif"/></inline-formula>, predicted structural features <inline-formula><graphic file="1471-2105-10-195-i14.gif"/></inline-formula>, SCOP templates <inline-formula><graphic file="1471-2105-10-195-i15.gif"/></inline-formula>, and gap information from the PDB templates <inline-formula><graphic file="1471-2105-10-195-i16.gif"/></inline-formula>. The evolutionary profile, <inline-formula><graphic file="1471-2105-10-195-i13.gif"/></inline-formula>, contains 20 units, one for each of the amino acids. The predicted structural features consist of: secondary structure (3 classes), solvent accessibility (4 classes), coarse contact density (4 classes), local structural motifs based on <it>&#981; </it>- <it>&#968; </it>angles (14 classes) (see <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> for a precise definition), and contact maps <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B31">31</abbr></abbrgrp>. The structural predictions are based on average weighted PDB templates and sequence information and were shown to be better than simply taken the values directly from the templates <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. All predictors produce the probability of belonging to a particular structural class and it is these probabilities that are encoded into the <inline-formula><graphic file="1471-2105-10-195-i14.gif"/></inline-formula> part of the input.</p>
            <p>Contact maps should play a special role when predicting domains boundaries. The structurally compact domain regions are clearly distinguishable by visual inspection of a true map as the regions with maximal contact while the boundary regions contain minimal contact (see figure <figr fid="F3">3</figr> for an example). This observation was exploited in <abbrgrp><abbr bid="B44">44</abbr></abbrgrp> where minimal contact average was determined using covariance analysis on the multiple sequence alignments. Here we derive three numbers which describe contact density in three regions surrounding <it>j </it>from maps at a 13 &#197; threshold:</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Multi-Domain Contact Map example</p>
               </caption>
               <text>
                  <p><b>Multi-Domain Contact Map example</b>. 13 &#197; contact map for protein 1KJQA which contains three domains. The three domains are clearly &#176; distinguishable in the true contact map as the areas with most of the contacting residues pairs. The SCOP definition of the domains for this protein is: domain 1 = residues 1&#8211;111, domain 2 = residues 112&#8211;317 and domain 3 = residues 318 391. The bounding boxes for each of the domains are labeled. Notice there are a smaller number of contacts that are not part of the domains indicating domain-domain interactions.</p>
               </text>
               <graphic file="1471-2105-10-195-3"/>
            </fig>
            <p>
               <display-formula id="M5">
                  <graphic file="1471-2105-10-195-i17.gif"/>
               </display-formula>
            </p>
            <p>where <it>T</it><sub><it>j</it></sub>, <it>M</it><sub><it>j</it></sub>, <it>B</it><sub><it>j </it></sub>correspond to the top left, middle, and bottom right contact/non-contact ratio of the boxes surrounding j &#8211; see figure <figr fid="F4">4</figr>. <it>Cx</it>, <it>y </it>and <it>NCx</it>, <it>y </it>are the contacts and non-contacts for residue pair (<it>x</it>, <it>y</it>), where trivial contacts |<it>x </it>- <it>y</it>| &#8804; 3 are ignored. The maps are obtained from a new version of the predictor XXStout <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> which also takes into account template information from the PDB <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Multi-Domain Contact Map bounding boxes</p>
               </caption>
               <text>
                  <p><b>Multi-Domain Contact Map bounding boxes</b>. Again protein 1KJQA. The blue, white and yellow areas show the bounding boxes used in the calculations of <it>T</it><sub><it>j</it></sub>, <it>M</it><sub><it>j </it></sub>and <it>B</it><sub><it>j </it></sub>for domain boundary residue 111.</p>
               </text>
               <graphic file="1471-2105-10-195-4"/>
            </fig>
            <p>Ideally a boundary can be identified by large <it>T</it><sub><it>j </it></sub>and <it>B</it><sub><it>j </it></sub>and small <it>M</it><sub><it>j </it></sub>for all <it>j</it>. In an initial experiment we found out that local contacting residue pairs are much less informative to determine boundary/non-boundary residues than the global contacting profiles provided by <it>T</it><sub><it>j</it></sub>, <it>M</it><sub><it>j </it></sub>and <it>B</it><sub><it>j </it></sub>(results not shown).</p>
            <p>In the results section we show that <inline-formula><graphic file="1471-2105-10-195-i14.gif"/></inline-formula> with <it>T</it><sub><it>j</it></sub>, <it>M</it><sub><it>j </it></sub>and <it>B</it><sub><it>j </it></sub>improves boundary prediction when the <it>j </it>contact maps are sufficiently accurate. The number of units in <inline-formula><graphic file="1471-2105-10-195-i14.gif"/></inline-formula> is 3 (secondary structure) + 4 (solvent accessibility) + 4 (contact density) + 14 (structural motifs) + 3 (contact maps) = 28.</p>
            <sec>
               <st>
                  <p>Homology information</p>
               </st>
               <p>Along with structural predictions we input to the network the weighed number of boundaries that we observe in SCOP templates. If <it>Q </it>is the total number of templates found for a protein, the first element of the vector <inline-formula><graphic file="1471-2105-10-195-i15.gif"/></inline-formula> is:</p>
               <p>
                  <display-formula id="M6">
                     <graphic file="1471-2105-10-195-i18.gif"/>
                  </display-formula>
               </p>
               <p>where <it>B</it><sub><it>p </it></sub>is equal to one if template number <it>p </it>contains a boundary in the position that aligns to the <it>j</it>-th residue in the query protein. Note that we extend the original definition of SCOP boundaries by 5 residues towards both termini. If the identity between template <it>p </it>and the query is <it>id</it><sub><it>p </it></sub>and the quality of a template (measured as <it>X</it>-ray resolution + R-factor/20, as in <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>) is <it>q</it><sub><it>p </it></sub>then the weight, <it>w</it><sub><it>p</it></sub>, is:</p>
               <p>
                  <display-formula id="M7">
                     <graphic file="1471-2105-10-195-i19.gif"/>
                  </display-formula>
               </p>
               <p>Taking the cube of the identity between template and query allows to drastically reduce the contribution of low-similarity templates when good templates are available. For instance a 90% identity template is weighed two orders of magnitude more than a 20% one. In preliminary tests (not shown) this measure performed better than a number of alternatives. The second and third element of the vector <inline-formula><graphic file="1471-2105-10-195-i15.gif"/></inline-formula> encode the weighted average coverage and similarity of a column of the template profile as follows:</p>
               <p>
                  <display-formula id="M8">
                     <graphic file="1471-2105-10-195-i20.gif"/>
                  </display-formula>
               </p>
               <p>where <it>c</it><sub><it>p </it></sub>is the coverage of the sequence by template <it>p </it>(i.e. the fraction of non-gaps in the alignment), and</p>
               <p>
                  <display-formula id="M9">
                     <graphic file="1471-2105-10-195-i21.gif"/>
                  </display-formula>
               </p>
               <p>Finally weighted gap and non-gap information from the PDB templates used to make the structural predictions are input. These are computed identically to equation 6, 7, 8, and 9 except instead of boundary and non-boundary classes there are gap and non-gap classes. The intuitive reasoning behind <inline-formula><graphic file="1471-2105-10-195-i16.gif"/></inline-formula> is that domains should be evolutionarily conserved and non-gap values indicate there is a structural fragment in the PDB similar to the query sequence. Both <inline-formula><graphic file="1471-2105-10-195-i15.gif"/></inline-formula> and <inline-formula><graphic file="1471-2105-10-195-i16.gif"/></inline-formula> contain 5 units resulting in a total input size of: |<it>E</it>| + |<it>struc</it>| + |<it>SCOP</it>| + |<it>PDB</it>| = 20 + 28 + 5 = 5 = 58</p>
            </sec>
            <sec>
               <st>
                  <p>Measuring performances</p>
               </st>
               <p>To evaluate domain boundary prediction we adopt the domain boundary score used by CASP 6 and 7 <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>. A score value is rewarded between any predicted boundary, <it>P</it>, and any true boundary, <it>T</it>, within eight residues. If <it>d</it><sub><it>P</it>, <it>T </it></sub>is the smallest sequence separation between <it>P </it>and <it>T </it>(0 in case of any overlap):</p>
               <p>
                  <display-formula id="M10">
                     <graphic file="1471-2105-10-195-i22.gif"/>
                  </display-formula>
               </p>
               <p>The normalised domain boundary score between all predicted and true domain boundaries is:</p>
               <p>
                  <display-formula id="M11">
                     <graphic file="1471-2105-10-195-i23.gif"/>
                  </display-formula>
               </p>
               <p>where <it>np </it>and <it>nt </it>are the total number of predicted domain boundaries and true domain boundaries respectively. Taking the maximum domain boundary count between predicted and true, <it>max</it>(<it>np, nt</it>), penalises over-prediction and incorporates both sensitivity (precision) and specificity (recall) into one measure. <inline-formula><graphic file="1471-2105-10-195-i24.gif"/></inline-formula>, ensures the closest (predicted vs. true) boundaries are only considered all other values are ignored.</p>
               <p>We also consider our performance on single domain proteins, through the F-measure which is the harmonic mean of precision and recall. If <it>TP </it>is the number of proteins correctly predicted as single domain, <it>Pred </it>is the number of proteins predicted as single domain, and <it>Obs </it>is the true number of single domain proteins, recall is <inline-formula><graphic file="1471-2105-10-195-i25.gif"/></inline-formula> and precision is <inline-formula><graphic file="1471-2105-10-195-i26.gif"/></inline-formula>. Note that template quality, where we refer to it, is always the highest sequence identity between the query and the PDB templates found.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <p>We train and test using a 5-fold cross validation procedure. The following models were trained:</p>
         <p indent="1">&#8226; Ab initio: All structural predictions are made using our ab initio structural prediction servers <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>. In this case we use no contact information, as it led to no improvements in preliminary tests.</p>
         <p indent="1">&#8226; SCOP95: This model takes as input predicted structural information from our template-based structural predictors <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B41">41</abbr></abbrgrp>, PDB gap/non-gap information and SCOP templates.</p>
         <p indent="1">&#8226; SCOP25: Same as SCOP95 but trained on 25% thresholded templates, i.e. this time no template is allowed that shows more than 25% sequence identity to the query, including to the structural predictors.</p>
         <p indent="1">&#8226; PDB95: This is identical to the SCOP95 models except it now contains no SCOP template information. Note that, although SCOP is a subset of PDB and PDB information is input to this system, it does not include domain boundary annotations.</p>
         <p indent="1">&#8226; PDB25: Same as PDB95 but trained on 25% thresholded templates.</p>
         <p indent="1">&#8226; PDB95_NC and PDB25_NC: Identical to PDB95 and PDB25 except the contact profile in equation 5 is removed.</p>
         <p indent="1">&#8226; IBRNN95: This is identical PDB95 except the BRNN now propagates its information and backpropagates its error along additional shortcut connections that correspond to contacting residue pairs.</p>
         <p indent="1">&#8226; IBRNN25: Same as IBRNN95 but trained on 25% thresholded templates.</p>
         <p>All these models have the same architecture, except for extra or missing inputs and are trained by gradient descent. We only ran a small number (less than 10) initial experiments on the sets randomly split in half training and half test to determine a good size for the architecture, while the cross-validations themselves are run only once. Varying the number of parameters of the networks in the initial tests between approximately 5,000 and 10,000 only led to very small changes (at most 0.5%) in predictive quality. When training we place an extra &#177; 5 residues around the SCOP boundary definitions. However when testing the original SCOP definition is used. Since the problem is extremely imbalanced the optimal threshold (the one that maximises the boundary score, for which see below) for determining boundaries is generally less than 0.5. For this reason we determine the optimal threshold on the training folds and test using this threshold on the test fold.</p>
         <sec>
            <st>
               <p>95% distribution</p>
            </st>
            <p>Figure <figr fid="F5">5</figr> shows the domain boundary scores for SCOP95, PDB95 and Ab initio for the 95% template distribution, as a function of template quality. As expected SCOP95 is always clearly better than PDB95 when 20&#8211;95% templates are available with differences ranging from 18.2%&#8211;39.2%. Overall SCOP95 has a domain boundary score of 66.5% while PDB95 has 43.0%. When only considering templates with similarity greater than 25% these overall values rise to 69.3% and 45.3% respectively. In fact SCOP95 is always better than PDB95 except for a slight decrease in the sequence identity region [15,20)%. When examining the [0,25)% region as a whole we see that SCOP95 has a significantly larger domain boundary score of 40.4% as opposed to an ab initio score of 26.1%. The good performance of SCOP95 in these difficult regions may be due to finding low sequence identity templates where the networks can learn to determine the boundary by subtracting the template from the sequence. Indeed SCOP95 outperforms Ab initio in all template regions above 10% sequence identity. Ab initio is always worse than PDB95 in the [25,95]% similarity region with an overall domain boundary score 19.7% worse, suggesting the BRNN learns to determine boundaries from accurately predicted structural information.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Comparison between models with 95% max sequence ID templates</p>
               </caption>
               <text>
                  <p><b>Comparison between models with 95% max sequence ID templates</b>. Comparing models across the 95% template distribution. Domain boundary scores as a function of best hit PDB sequence identity. Blue is SCOP95, red is PDB95 and green is ab initio.</p>
               </text>
               <graphic file="1471-2105-10-195-5"/>
            </fig>
            <p>In the [0,25)% region Ab initio is mostly better than PDB95 apart from the [15,20)% interval, for an overall score of 26.1% for Ab initio vs. 21.4% for PDB95. This suggests that PDB templates, and template-based structural predictions are little help when the templates are noisy. However, when a specialised system is built that only learns from noisy templates (see next section), it is still possible to glean enough information from templates to outperform the ab initio predictor. This suggests that, more than the noise itself, the small number of examples in the [0,25)% region is the main reason why PDB95 performs worse than ab initio here.</p>
            <p>In order to assess if contact information improves domain boundary prediction with this template distribution we compare PDB95 with ab initio and an identical version of PDB95 but removing the contact inputs in equation 5 (PDB95_NC). In this case (see figure <figr fid="F6">6</figr>), PDB95_NC performs better than PDB95 (24.4% vs. 21.4%) but still slightly less well than ab initio (26.1%). As expected when templates improve (>25% identity) contact information becomes helpful, leading to significantly better domain boundary location prediction compared to both PDB95_NC and ab initio (PDB95 45.3%, PDB95_NC 34,8%, ab initio 25.6%). This proves that contact information is indeed useful when good quality contact maps are available.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Comparison between models with 95% max sequence ID templates, with or without contact information</p>
               </caption>
               <text>
                  <p><b>Comparison between models with 95% max sequence ID templates, with or without contact information</b>. Comparing the PDB only models with contact information and without. Domain boundary scores as a function of best hit PDB sequence identity. Blue is PDB_95 with contact information, red is PDB_95 without contacts and green is Ab Initio.</p>
               </text>
               <graphic file="1471-2105-10-195-6"/>
            </fig>
            <p>Figures <figr fid="F7">7</figr> and <figr fid="F8">8</figr> show that our machine learning method, trained in 5-fold cross-validation on the M646 set (see Methods for details), improves over a simple baseline where equation 6 (a weighted average of boundary/non-boundary classes in the templates, normalised between 0 and 1) is adopted as the prediction from the SCOP templates without using any machine learning filtering. Absolute performances are shown in figure <figr fid="F7">7</figr>, while figure <figr fid="F8">8</figr> focusses on the difference between SCOP95 and the baseline. Preliminary tests showed that this is a better baseline than ones where only the best template or the top ten templates are considered. This is also the same vector provided as input to our system, hence it is a fair baseline to compare the system against as any gains represent enrichment of the information contained in the templates. It is worth noting that the deviations of the absolute results (in Figure <figr fid="F7">7</figr>) of either the baseline or SCOP95 are greater than the deviations of the difference between SCOP95 and baseline on a protein, i.e. the SCOP95 gain is more stable than its absolute score, likely because the variability of the quality of the template is eliminated from the latter (SCOP95 and baseline "see" the same templates). The differences between the prediction and the SCOP baseline are less than 2 standard deviations in all regions of sequence identity to the best template except [40%,50%) and [80%,90%). However the differences are nearly always of the same sign, and overall our system beats the baseline by 5.5%, which is more than 4 standard deviations. The gain in the [25%,100%) area (5.7%) is also more than 4 standard deviations. Encouragingly, in the difficult region (i.e. [0,25)%) there is also a 4.8% improvement over the baseline, although this is marginal, at 1.5 standard deviations.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>SCOP95 predictor vs. baseline</p>
               </caption>
               <text>
                  <p><b>SCOP95 predictor vs. baseline</b>. The baseline results (red bins) and SCOP95 results (blue bins) as a function of the identity to the best SCOP template. See text for more details.</p>
               </text>
               <graphic file="1471-2105-10-195-7"/>
            </fig>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>SCOP95 predictor vs. baseline 2</p>
               </caption>
               <text>
                  <p><b>SCOP95 predictor vs. baseline 2</b>. The difference between the SCOP95 predictor score and the baseline score (boundaries directly extracted from SCOP templates, as passed as input to the predictor). The size of the blocks represents the error. Although most differences in individual bins are not significant possibly due to the small size of the sample, the overall difference, and difference in the [25,95)% interval are significant, while the gain in the [0,25)% interval is marginal. See text for more details.</p>
               </text>
               <graphic file="1471-2105-10-195-8"/>
            </fig>
            <p>Finally table <tblr tid="T1">1</tblr> shows the F-measures on single domains for all the models trained on the 95% template distribution. In the [0,25)% region ab initio has the best single domain F-measure. Again the SCOP95 model is better by 3.7% at predicting single domain proteins compared to its corresponding baseline. As the templates improve we notice a clear gap between SCOP95 and the PDB only models of PDB95 and PDB95_NC (SCOP95 improves by 9&#8211;10%). When there are only PDB templates available ab initio slightly outperforms both PDB95 and PDB95_NC but the larger increase in boundary score outweighs this for [25,95)% templates.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Single domain F-scores</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>SCOP95</p>
                     </c>
                     <c ca="left">
                        <p>Baseline95</p>
                     </c>
                     <c ca="left">
                        <p>PDB95</p>
                     </c>
                     <c ca="left">
                        <p>PDB95_NC</p>
                     </c>
                     <c ca="left">
                        <p>Ab initio</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>[0,25)%</p>
                     </c>
                     <c ca="left">
                        <p>83.7%</p>
                     </c>
                     <c ca="left">
                        <p>80.0%</p>
                     </c>
                     <c ca="left">
                        <p>85.0%</p>
                     </c>
                     <c ca="left">
                        <p>86.2%</p>
                     </c>
                     <c ca="left">
                        <p>88.1%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>[25, end)%</p>
                     </c>
                     <c ca="left">
                        <p>84.9%</p>
                     </c>
                     <c ca="left">
                        <p>84.8%</p>
                     </c>
                     <c ca="left">
                        <p>75.8%</p>
                     </c>
                     <c ca="left">
                        <p>74.5%</p>
                     </c>
                     <c ca="left">
                        <p>77.0%</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Single domain F-measures for all models trained with the 95% template distribution.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>25% distribution</p>
            </st>
            <p>In this case we exclude all templates showing a sequence identity greater than 25% to the query. The aim is to build systems that specialise on low-quality templates both by providing more low-quality examples and by not providing any good-quality ones. Figure <figr fid="F9">9</figr> shows the domain boundary scores for all the models considered in this region. As expected SCOP25 and PDB25 are now always above Ab initio, with a much greater margin and confidence than SCOP95 and PDB95.</p>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>results for max 25% identity templates</p>
               </caption>
               <text>
                  <p><b>results for max 25% identity templates</b>. Results for max 25% identity templates.</p>
               </text>
               <graphic file="1471-2105-10-195-9"/>
            </fig>
            <p>However, the F-measure on PDB25 is 8.8% worse than the same model without contact input (PDB25_NC), see table <tblr tid="T2">2</tblr>. Although the contact profile increases the boundary score, it may lead to over-predicting boundaries. The overall boundary score for the PDB25_NC is 31.1% (PDB95_NC was 24.4% in this region) an increase of 1.3% over ab initio for this region. This coupled with the fact that PDB25_NC has the highest single domain F-measure makes it the best SCOP-less template model for the [0,25)% region. The SCOP based model predicts boundaries with a score of 49% and clearly outperforms its baseline again, on average by 7% (roughly twice the error). Although evaluated on a different distribution of proteins SCOP25 now has a much higher domain boundary score (+8.6%), at the price of a decrease in single domain F-measure (-5.5%).</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Single domain F-scores, max template ID 25%</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>SCOP25</p>
                     </c>
                     <c ca="left">
                        <p>Baseline25</p>
                     </c>
                     <c ca="left">
                        <p>PDB25</p>
                     </c>
                     <c ca="left">
                        <p>PDB25_NC</p>
                     </c>
                     <c ca="left">
                        <p>Ab initio</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>[0,25)%</p>
                     </c>
                     <c ca="left">
                        <p>78.2%</p>
                     </c>
                     <c ca="left">
                        <p>70.4%</p>
                     </c>
                     <c ca="left">
                        <p>70.8%</p>
                     </c>
                     <c ca="left">
                        <p>79.6%</p>
                     </c>
                     <c ca="left">
                        <p>79.1%</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Single domain F-measures for all models trained in the 25% template distribution.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Interaction BRNN</p>
            </st>
            <p>Table <tblr tid="T3">3</tblr> shows the overall results when using interaction connections within the BRNN trained with PDB only templates (IBRNN25 and IBRNN95). We can further improve boundary prediction by 3.4% with almost no change in single domain F-measure in the [25,95)% by using the IBRNN. However, the residue-residue contacts are too noisy in the [0,25)% region and therefore single domain F-measure is low compared to other models due to under prediction of boundaries. When training with [0,25)% templates (IBRNN25) both the domain boundary score and single domain F-measure fall by 9% and 5% respectively. Cleary, explicit processing of contacts improves predictions but predicted maps need to be of fair to good quality, especially in order to prevent over-prediction of boundaries and corresponding worsening of single domain predictions.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>IBRNN scores</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>IBRNN25</p>
                     </c>
                     <c ca="left">
                        <p>IBRNN95</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>[25,95)% dbs</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>48.7%(+3.4%)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>[25,95)% F</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>75.3%(-0.5%)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>[0,25)% dbs</p>
                     </c>
                     <c ca="left">
                        <p>27.3%(-9%)</p>
                     </c>
                     <c ca="left">
                        <p>27.4%(+6%)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>[0,25)% F</p>
                     </c>
                     <c ca="left">
                        <p>65.8%(-5%)</p>
                     </c>
                     <c ca="left">
                        <p>77.9%(-7.1%)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>IBRNN overall domain boundary scores (dbs) and single domain F-measure (F). In brackets the increase or decrease over the normal BRNN with contact input (i.e. PDB25 and PDB95).</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Comparison with other predictors</p>
            </st>
            <p>Comparison of different domain predictors is difficult because previous methods were based on different datasets, domain definitions, benchmarks, cross validations and evaluation procedures. Thus, we take the comparisons made here with caution. State of the art results at CASP 7 have domain boundary scores between 65&#8211;69%. Our four best models SCOP95, SCOP25, IBRNN95, PDB25_NC achieve overall domain boundary scores of 66.5%, 48.9%, 46.5% and 31.1%. Figure <figr fid="F10">10</figr> and <figr fid="F11">11</figr> show the recall and precision of our models as a function of the distance from the true boundary to consider a prediction a success. Increasing the distance between 8 and 20 results in small improvements in both prediction and recall, slightly more so for the less accurate systems (e.g. Ab initio). It should be noted that this is in essence equivalent to measuring the sensitivity of the results to artificially widening boundary regions.</p>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>Recall of domain boundaries</p>
               </caption>
               <text>
                  <p><b>Recall of domain boundaries</b>. Recall of domain boundaries as a function of distance from the true boundary.</p>
               </text>
               <graphic file="1471-2105-10-195-10"/>
            </fig>
            <fig id="F11">
               <title>
                  <p>Figure 11</p>
               </title>
               <caption>
                  <p>Precision of domain boundaries</p>
               </caption>
               <text>
                  <p><b>Precision of domain boundaries</b>. Precision of domain boundaries as a function of distance from the true boundary.</p>
               </text>
               <graphic file="1471-2105-10-195-11"/>
            </fig>
            <p>From the plots we can see that template-based models clearly outperform Ab initio both in the domain boundary score and F-measure on single domain results. At a distance of 8 from the true boundary our recall is 71.2%, 54.1%, 56.5% and 40.5% for the models SCOP95, SCOP25, IBRNN95 and PDB25_NC respectively. The precision of the four models in the same order is 74.2%, 60.7%, 54.2% and 33.7%. The recall and precision (at a distance of 8) of the best server groups at CASP were (all derived from CASP7 assessment plots): DomPro recall 79% and precision 67%, Lee recall 75% and precision 64%, RosettaDom recall 65% and precision 70% and Ginzu 59% recall and 79%. Direct comparisons would not be fair here for two major reasons: we have built SCOP domain predictors, and CASP assignments are normally different from SCOP; especially, while we show that by combining templates and sequence we perform better than by either, we obtain templates by PSI-BLAST, that has much lower sensitivity than many fold recognition components used by the top systems at CASP. However, we have run our methods on the Free Modelling (FM) CASP7 targets (i.e. those for which no suitable templates could be found according to the assessors), allowing only pre-CASP7 templates to be input (as available as of the end of April 2006). Of the 10 FM single domain targets we predict correctly 9 (T0287, T0300, T0307, T0309, T0314, T0319, T0350, T0353, T0361) and one (T0296, the longest one at 445 residues) incorrectly as having two boundaries. As for multi-domain proteins, there was not one single case of a fully FM one. If we focus on multi-domain targets containing at least one domain classified as FM: we predict both boundaries in T0356 (three domains, FM, Template-Based-Modelling, FM) correctly (within 20 residues); we predict the boundary correctly in T0347 (TBM, FM) although we also predict a second spurious boundary; we correctly predict T0316 as being 3-domain and place one of the two boundaries correctly; while we predict T0321's (TBM, TBM/FM) number of domains correctly but boundary location incorrectly by 28 residues.</p>
            <p>It should be noted that in none of these cases we find PSI-BLAST templates, so we effectively predict all of them ab initio. CASP's domain assessment also focussed on T0301, which was considered a hard TBM prediction. The assessment article <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> cites one outstanding prediction for this target with an NDO score (Normalized Domain Overlap <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>) of 90 &#8211; in this case we correctly predict the protein to be 2-domain for a NDO score of 62.2.</p>
            <p>Usually most evaluations in the literature are carried out at a distance &#177; 20 from the true domain boundary. Our ab initio model has a recall of 50.8% and a precision of 38.7% for domain boundaries within 20 residues of the true boundary. Hence, although this roughly matches the state-of-the-art (see below), in the ab initio case predictions are only of limited practical use. However: for a majority of known protein sequences it is possible to identify a putative homologue in the PDB (for instance, upwards of 80% of queries at the last two CASP competitions have been assessed as template-based); even in the ab initio case it is possible to achieve a higher precision at the cost of reduced recall. For instance we obtain a 55% precision for a recall of 21.2%. Single domains are predicted with a recall of 80.3% and a precision of 78.1% on our dataset. Table <tblr tid="T4">4</tblr> shows a summary of some other methods and a short description of the dataset used. Random, is a predictor in which we place the correct number of boundaries within a protein, but in a random position. In this case the Precision/Recall are 24.5%/17.6%, or approximately 2/3 and 1/3 of our ab initio system. ChopNet <abbrgrp><abbr bid="B47">47</abbr></abbrgrp> has a reported boundary recall between 46&#8211;51% (when the boundary is within &#177; 20 of the true boundary) and single domain recall of 73% on their SCOP defined dataset when training on both a CATH and SCOP dataset. When training on a SCOP only dataset as in this study the recall of boundaries seems to be slightly reduced but the single domain recall is drastically reduced to 49%. The ab initio version of DOMAC <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp> (DomPro at Casp 7) achieves a recall of 88.5% and a precision of 46.5% on single domain proteins, and achieves 27% and 14% recall and precision of domain boundaries within 20 residues, corresponding to an F1 score (the harmonic mean of recall and precision) of 18.4%. The dataset is a balanced, high-quality dataset manually curated by Holland et al. <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>. In order to compare our methods with DOMAC we have also tested our ab initio predictor on this set, and obtain somewhat different recall and precision (16.3% and 19.7%), which yield a similar F1 (17.8%). The Domain Guess by Size algorithm <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> has a recall of 50% for domains shorter than 400 amino acids on a dataset with domain definitions from the Conserved Domain Database <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>. This seems surprisingly good for such a simple method. However predictions were considered correct if a correct prediction is made in one out of top ten predictions, with the accuracy decreasing somewhat when considering the best hit. SnapDragon <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> correctly identifies 47% of its single domain proteins. It also achieves a recall of 42.3% and precision of 39.8% for the boundaries on a mixture of discontinuous and continuous protein domain dataset. The true boundary sizes here were enlarged to a minimum of 21 residues with a correct boundary being &#177; 10 from this true boundary; making our &#177; 20 boundary distance comparable. Armadillo <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> achieves a recall of 37% and a precision of 36% on boundaries with a simple amino acid propensity index. Again boundaries were considered correct for &#177; 20 residues.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Comparison with other methods</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="center">
                        <p>Method</p>
                     </c>
                     <c ca="left">
                        <p>Dataset: number(domain definition)</p>
                     </c>
                     <c ca="left">
                        <p>Recall boundary</p>
                     </c>
                     <c ca="left">
                        <p>Precision boundary</p>
                     </c>
                     <c ca="left">
                        <p>Recall single</p>
                     </c>
                     <c ca="left">
                        <p>Precision single</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>This study</p>
                     </c>
                     <c ca="left">
                        <p>967(<it>SCOP</it>)</p>
                     </c>
                     <c ca="left">
                        <p>50.8%</p>
                     </c>
                     <c ca="left">
                        <p>38.7%</p>
                     </c>
                     <c ca="left">
                        <p>80.3%</p>
                     </c>
                     <c ca="left">
                        <p>78.1%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Random</p>
                     </c>
                     <c ca="left">
                        <p>967(<it>SCOP</it>)</p>
                     </c>
                     <c ca="left">
                        <p>17.6%</p>
                     </c>
                     <c ca="left">
                        <p>24.5%</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>ChopNet</p>
                     </c>
                     <c ca="left">
                        <p>2127(<it>SCOP</it>) + 1300(<it>CATH</it>)</p>
                     </c>
                     <c ca="left">
                        <p>46 <it>- </it>51%</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>73%</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>DOMAC</p>
                     </c>
                     <c ca="left">
                        <p>156 Holland <abbrgrp><abbr bid="B48">48</abbr></abbrgrp></p>
                     </c>
                     <c ca="left">
                        <p>27%</p>
                     </c>
                     <c ca="left">
                        <p>14%</p>
                     </c>
                     <c ca="left">
                        <p>88.5</p>
                     </c>
                     <c ca="left">
                        <p>46.5</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>DGS</p>
                     </c>
                     <c ca="left">
                        <p>1236(<it>CDD</it>) <abbrgrp><abbr bid="B49">49</abbr></abbrgrp></p>
                     </c>
                     <c ca="left">
                        <p>50%</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>SnapDragon</p>
                     </c>
                     <c ca="left">
                        <p>414(<it>Taylor</it>) <abbrgrp><abbr bid="B13">13</abbr></abbrgrp></p>
                     </c>
                     <c ca="left">
                        <p>42.3%</p>
                     </c>
                     <c ca="left">
                        <p>39.8%</p>
                     </c>
                     <c ca="left">
                        <p>47%</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Armadillo</p>
                     </c>
                     <c ca="left">
                        <p>585(<it>CATH </it>+ <it>V AST </it>+ <it>SCOP</it>)</p>
                     </c>
                     <c ca="left">
                        <p>37%</p>
                     </c>
                     <c ca="left">
                        <p>36%</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Results for some other methods on their datasets compared to the overall ab initio results in this study.</p>
                  <p>See text for method citations.</p>
               </tblfn>
            </tbl>
            <p>Finally, we directly compare Ab initio and SCOP25 with the predictor in <abbrgrp><abbr bid="B50">50</abbr></abbrgrp> and with PPRODO <abbrgrp><abbr bid="B51">51</abbr></abbrgrp> and report the results in Table <tblr tid="T5">5</tblr>. In this case the two predictors are optimised for two-domain proteins rather than for a mixture of single and multiple-domain ones. For this reason we test the predictors, where possible, on both our sets and the sets they were optimised on. On the PPRODO set Ab initio roughly matches PPRODO's Recall (64.6% vs. 65.5%) but not Precision (48.3% vs. 65.5%), with slightly more favourable comparisons against <abbrgrp><abbr bid="B50">50</abbr></abbrgrp> that has published Precision and Recall of 62%. On single domain our Recall is similar to PPRODO's. SCOP25 (no templates of any kind are input that show an identity greater than 25% to the query) fares better than Ab initio and roughly equivalently to PPRODO with a Recall/Precision of 69.8%/57.8%. On our sets PPRODO performs quite well, with a Recall/Precision of 56.5%/51.3%, higher than Ab initio (50.8%/38.7%) but this time substantially lower than SCOP25 (59%/66.3%). All systems perform roughly equally well on single domains, with Recalls just over 80%. We were not able to get a version of the CAT dataset also used in <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>, and could not obtain a copy of the predictor in <abbrgrp><abbr bid="B50">50</abbr></abbrgrp> so we could not test it on our sets. In figure <figr fid="F12">12</figr> we report a ROC curve for PPRODO, Ab initio and SCOP25 on our sets. In this case success is measured per residue, rather than per boundary. It is important to notice that we use the original assignment of boundaries adopted by the different programs, i.e. a boundary is extended by 20 residues in both directions to determine positives for PPRODO, and by 5 for Ab initio and SCOP25. In this case the AUC (area under the curve) is 0.76 for PPRODO, 0.78 for Ab initio and 0.87 for SCOP25. If we consider boundaries to be extended by 20 residues, Ab initio and SCOP25 AUC decrease to 0.73 and 0.81, respectively. If we test PPRODO on boundaries extended by 5 residues on both sides, its AUC climbs slightly, to 0.77.</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Comparison with other methods 2</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="center">
                        <p>Method</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>PPRODO sets</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>M646+S321 sets</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Recall</p>
                     </c>
                     <c ca="left">
                        <p>Precision</p>
                     </c>
                     <c ca="left">
                        <p>Single</p>
                     </c>
                     <c ca="left">
                        <p>Recall</p>
                     </c>
                     <c ca="left">
                        <p>Precision</p>
                     </c>
                     <c ca="left">
                        <p>Single</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Ab initio</p>
                     </c>
                     <c ca="left">
                        <p>64.6%</p>
                     </c>
                     <c ca="left">
                        <p>48.3%</p>
                     </c>
                     <c ca="left">
                        <p>70.0%</p>
                     </c>
                     <c ca="left">
                        <p>50.8%</p>
                     </c>
                     <c ca="left">
                        <p>38.7%</p>
                     </c>
                     <c ca="left">
                        <p>80.3%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>SCOP25</p>
                     </c>
                     <c ca="left">
                        <p>69.8%</p>
                     </c>
                     <c ca="left">
                        <p>57.8%</p>
                     </c>
                     <c ca="left">
                        <p>70.2%</p>
                     </c>
                     <c ca="left">
                        <p>59.0%</p>
                     </c>
                     <c ca="left">
                        <p>66.3%</p>
                     </c>
                     <c ca="left">
                        <p>80.0%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPRODO <abbrgrp><abbr bid="B51">51</abbr></abbrgrp></p>
                     </c>
                     <c ca="left">
                        <p>65.5%</p>
                     </c>
                     <c ca="left">
                        <p>65.5%</p>
                     </c>
                     <c ca="left">
                        <p>70.2%</p>
                     </c>
                     <c ca="left">
                        <p>56.5%</p>
                     </c>
                     <c ca="left">
                        <p>51.3%</p>
                     </c>
                     <c ca="left">
                        <p>81.9%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B50">50</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                     <c ca="left">
                        <p>62.0%</p>
                     </c>
                     <c ca="left">
                        <p>62.0%</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Comparisons with PPRODO <abbrgrp><abbr bid="B51">51</abbr></abbrgrp> and <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>.</p>
               </tblfn>
            </tbl>
            <fig id="F12">
               <title>
                  <p>Figure 12</p>
               </title>
               <caption>
                  <p>Ab initio vs. SCOP25 vs. PPRODO</p>
               </caption>
               <text>
                  <p><b>Ab initio vs. SCOP25 vs. PPRODO</b>. ROC curves for Ab initio, SCOP25 and PPRODO (see text for more details).</p>
               </text>
               <graphic file="1471-2105-10-195-12"/>
            </fig>
            <p>As our results show, template information, when handled by the best systems (SCOP25, SCOP95, depending on quality) can only improve on the ab initio system for all sequence identity ranges, even in the difficult [0,25)% region. Template-based comparisons are even harder as the data available are more sparse. On a simple two domain set DomSSEA [16] achieves a domain boundary recall of 49% again with &#177; 20 residues and correctly predicts 82.3% of single domains. DOMAC <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> (DOMAC is the hybrid of DomPro <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> and template based modeling) achieves a domain boundary recall of 50% and a precision of 76.5% (F = 60.5%) within 20 residues of the true domain for its template based part, and an F-measure of 83.7% on single domains, again on the Holland dataset.</p>
            <p>Our best template-based system (SCOP95) has boundary recall and precision of 74.0% and 77.1% (F = 75.5%) at &#177; 20 residues and classifies correctly 85.3% of single domain proteins. Even when we only use marginal templates (SCOP25, max 25%) we achieve boundary recall and precision of 59% and 66.3% (F = 62.4%) and predict 80% of single domain proteins correctly. Although on different sets, all measures are roughly as good as the state-of-the-art systems DomSSEA and DOMAC.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have developed a fast system for the prediction of SCOP defined domain boundaries that takes advantage of template-based structural predictions and SCOP templates. Within the limits of comparing systems on different datasets, we have shown that our ab initio system compares well with state-of-the-art ab initio predictors. Our best template-based systems outperform the ab initio system even when poor templates are available, suggesting that not only can they be used for effective domain annotation in the presence of SCOP templates, but they may achieve state-of-the-art performances when only twilight or no templates ([0,25)% sequence identity to the query) are available. We have also shown that our machine learning systems outperform baselines where boundary definitions are extracted directly from the best SCOP template, or from weighed and unweighed profiles of templates. Moreover we have shown that, when high-quality contact maps are factored into the prediction via a sophisticated machine learning model it may be possible to achieve even better results. The systems are entirely automated and can be run on a genome scale on a small cluster of PCs.</p>
         <p>Our future work will focus on a number of directions: training and testing our systems on marginal templates, for instance obtained by subtler homology detection algorithms than PSI-BLAST; building a large-scale database of domain predictions to make publicly available, and to feed into the prediction loop alongside SCOP definitions; studying different domain definitions, as for instance those in CATH and PrISM; testing the hypothesis that exon information can lead to improved ab initio predictions <abbrgrp><abbr bid="B52">52</abbr><abbr bid="B53">53</abbr></abbrgrp>. Finally, we have set up a public web server implementing the methods we described in this manuscript. The URL of the server is <url>http://distill.ucd.ie/shandy/</url>.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>IW designed and developed all the predictors, and wrote most of the first draft of the manuscript. AJMM contributed to the design of the homology component. CM produced the final homology detection component and contributed to the manuscript. ER designed an alternative pipeline and provided constant challenges to the development of the final systems. AV assisted many phases of the development and provided useful suggestions. GP sparked the process, supervised all phases, wrote parts of the initial draft, produced the final version of the manuscript, and set up the web server. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work is supported by Science Foundation Ireland grant 05/RFP/CMS0029, grant RP/2005/219 from the Health Research Board of Ireland and a UCD President's Award 2004.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Computational prediction of domain interactions</p>
            </title>
            <aug>
               <au>
                  <snm>Pagel</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Strack</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Oesterheld</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Stumpflen</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Frishman</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Methods Mol Biol</source>
            <pubdate>2007</pubdate>
            <volume>369</volume>
            <fpage>3</fpage>
            <lpage>15</lpage>
         </bibl>
         <bibl id="B2">
            <title>
               <p>An integrated approach to the prediction of domain-domain interactions</p>
            </title>
            <aug>
               <au>
                  <snm>Lee</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Deng</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sun</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>269</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1481624</pubid>
                  <pubid idtype="pmpid" link="fulltext">16725050</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Threading methods for protein structure prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Hadley</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Bioinformatics, sequence, structure and databanks</source>
            <publisher>Heidelberg: Springer Verlag</publisher>
            <editor>Higgins D, Taylor WM</editor>
            <pubdate>2000</pubdate>
            <fpage>1</fpage>
            <lpage>13</lpage>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Solution Structure of the N-Terminal F1 Module Pair from Human Fibronectin</p>
            </title>
            <aug>
               <au>
                  <snm>Potts</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bright</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bolton</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Pickford</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Biochemistry</source>
            <pubdate>1999</pubdate>
            <volume>38</volume>
            <issue>26</issue>
            <fpage>8304</fpage>
            <lpage>8312</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10387076</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Recent transformations in structural biology</p>
            </title>
            <aug>
               <au>
                  <snm>Matthews</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Methods in Enzymology</source>
            <pubdate>1997</pubdate>
            <volume>276</volume>
            <fpage>3</fpage>
            <lpage>10</lpage>
         </bibl>
         <bibl id="B6">
            <title>
               <p>The Protein Data Bank</p>
            </title>
            <aug>
               <au>
                  <snm>Berman</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Westbrook</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Feng</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Gilliland</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Bhat</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Weissig</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Shindyalov</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Bourne</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>NAR</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>235</fpage>
            <lpage>242</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102472</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592235</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>DOMpro: Protein Domain Prediction Using Profiles Secondary Structure, Relative Solvent Accessibility and Recursive Neural Networks</p>
            </title>
            <aug>
               <au>
                  <snm>Cheng</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sweredoski</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Data Mining and Knowledge Discovery</source>
            <pubdate>2006</pubdate>
            <volume>13</volume>
            <issue>1</issue>
            <fpage>1</fpage>
            <lpage>10</lpage>
         </bibl>
         <bibl id="B8">
            <title>
               <p>An Accurate, Hybrid Protein Domain Prediction Server</p>
            </title>
            <aug>
               <au>
                  <snm>Cheng</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2007</pubdate>
            <volume>35</volume>
            <fpage>354</fpage>
            <lpage>356</lpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Comparative protein modelling by satisfaction of spatial restraints</p>
            </title>
            <aug>
               <au>
                  <snm>Sali</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Blundell</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1993</pubdate>
            <volume>234</volume>
            <fpage>779</fpage>
            <lpage>815</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">8254673</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>PDP: protein domain parser</p>
            </title>
            <aug>
               <au>
                  <snm>Alexandrov</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Shindyalov</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>3</issue>
            <fpage>429</fpage>
            <lpage>430</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12584135</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Automated prediction of domain boundaries in CASP6 targets using Ginzu and RosettaDOM</p>
            </title>
            <aug>
               <au>
                  <snm>Kim</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Chivian</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Malmstr&#246;m</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Baker</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2005</pubdate>
            <volume>61</volume>
            <issue>7</issue>
            <fpage>193</fpage>
            <lpage>200</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16187362</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions</p>
            </title>
            <aug>
               <au>
                  <snm>Simons</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kooperberg</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Baker</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1997</pubdate>
            <volume>268</volume>
            <issue>1</issue>
            <fpage>209</fpage>
            <lpage>25</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9149153</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Protein structural domain identification</p>
            </title>
            <aug>
               <au>
                  <snm>Taylor</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Protein Engineering</source>
            <pubdate>1999</pubdate>
            <volume>12</volume>
            <issue>3</issue>
            <fpage>203</fpage>
            <lpage>216</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10235621</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>SnapDRAGON: a method to delineate protein structural domains from sequence data</p>
            </title>
            <aug>
               <au>
                  <snm>Georgea</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>2002</pubdate>
            <volume>316</volume>
            <issue>2</issue>
            <fpage>839</fpage>
            <lpage>851</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11866536</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Domain size distributions can predict domain boundaries</p>
            </title>
            <aug>
               <au>
                  <snm>Wheelan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Marchler-Bauer</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bryant</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <issue>7</issue>
            <fpage>613</fpage>
            <lpage>618</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11038331</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Rapid protein domain assignment from amino acid sequence using predicted secondary structure</p>
            </title>
            <aug>
               <au>
                  <snm>Marsden</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>McGuffin</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Protein Science</source>
            <pubdate>2002</pubdate>
            <volume>11</volume>
            <fpage>2814</fpage>
            <lpage>2824</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2373756</pubid>
                  <pubid idtype="pmpid" link="fulltext">12441380</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>CATH: A Hierarchic Classification of Protein Domain Structures</p>
            </title>
            <aug>
               <au>
                  <snm>Orengo</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Michie</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Swindells</snm>
                  <fnm>DTand</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Structure</source>
            <pubdate>1997</pubdate>
            <volume>5</volume>
            <issue>8</issue>
            <fpage>1093</fpage>
            <lpage>108</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9309224</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Armadillo: domain boundary prediction by amino acid composition</p>
            </title>
            <aug>
               <au>
                  <snm>Dumontier</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Yao</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Feldman</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Hogue</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2005</pubdate>
            <volume>350</volume>
            <issue>5</issue>
            <fpage>1061</fpage>
            <lpage>73</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15978619</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>An analysis of protein domain linkers: their classification and role in protein folding</p>
            </title>
            <aug>
               <au>
                  <snm>George</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Protein Engineering</source>
            <pubdate>2002</pubdate>
            <volume>15</volume>
            <issue>11</issue>
            <fpage>871</fpage>
            <lpage>879</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12538906</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Role of linkers in communication between protein modules</p>
            </title>
            <aug>
               <au>
                  <snm>Gokhale</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>C</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Current Opinion in Chemical Biology</source>
            <pubdate>2000</pubdate>
            <volume>4</volume>
            <issue>1</issue>
            <fpage>22</fpage>
            <lpage>27</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10679375</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Optimizing the Stability of Single-Chain Proteins by Linker Length and Composition Mutagenesis</p>
            </title>
            <aug>
               <au>
                  <snm>Robinson</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sauer</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>PNAS</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <issue>11</issue>
            <fpage>5929</fpage>
            <lpage>5934</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">34497</pubid>
                  <pubid idtype="pmpid" link="fulltext">9600894</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Linker length and composition influence the flexibility of Oct-1 DNA binding</p>
            </title>
            <aug>
               <au>
                  <snm>van Leeuwen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Strating</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rensen</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>de Laat</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Vliet</snm>
                  <mnm>van der</mnm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>EMBO J</source>
            <pubdate>1997</pubdate>
            <volume>16</volume>
            <issue>8</issue>
            <fpage>2043</fpage>
            <lpage>2053</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1169807</pubid>
                  <pubid idtype="pmpid" link="fulltext">9155030</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Improving the Accuracy of Protein Secondary Structure Prediction Using Structural Alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Montgomerie</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sundaraj</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gallin</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Wishart</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>301</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1550433</pubid>
                  <pubid idtype="pmpid" link="fulltext">16774686</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>CASP Home page</p>
            </title>
            <url>http://predictioncenter.org/</url>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Assessment of predictions submitted for the CASP7 domain prediction category</p>
            </title>
            <aug>
               <au>
                  <snm>Tress</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Cheng</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Joo</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Seo</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Baker</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Chivian</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Ezkurdia</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2007</pubdate>
            <volume>69</volume>
            <issue>8</issue>
            <fpage>137</fpage>
            <lpage>51</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">17680686</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>SCOP: a structural classification of proteins database for the investigation of sequences and structures</p>
            </title>
            <aug>
               <au>
                  <snm>Murzin</snm>
                  <fnm>AG</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Chothia</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1995</pubdate>
            <volume>247</volume>
            <fpage>536</fpage>
            <lpage>540</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">7723011</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Gapped Blast and psi-blast: a new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Porter: a new, accurate server for protein secondary structure prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>McLysaght</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>8</issue>
            <fpage>1719</fpage>
            <lpage>20</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15585524</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information</p>
            </title>
            <aug>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mooney</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vullo</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <issue>201</issue>
            <fpage>12</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1800867</pubid>
                  <pubid idtype="pmpid" link="fulltext">17224043</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Protein Structural Motif Prediction in Multidimensional <it>f</it>-Space leads to improved Secondary Structure Prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Mooney</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vullo</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>2006</pubdate>
            <volume>13</volume>
            <issue>8</issue>
            <fpage>1489</fpage>
            <lpage>1502</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">17061924</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>A two-stage approach for improved prediction of residue contact maps</p>
            </title>
            <aug>
               <au>
                  <snm>Vullo</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Walsh</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>180</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1484494</pubid>
                  <pubid idtype="pmpid" link="fulltext">16573808</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Exploiting the past and the future in protein secondary structure prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Frasconi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Soda</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>937</fpage>
            <lpage>946</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10743560</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Prediction of protein secondary structure at better than 70% accuracy</p>
            </title>
            <aug>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1997</pubdate>
            <volume>232</volume>
            <fpage>584</fpage>
            <lpage>599</lpage>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Conservation and prediction of solvent accessibility in protein families</p>
            </title>
            <aug>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Schhneider</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>1994</pubdate>
            <volume>20</volume>
            <issue>3</issue>
            <fpage>216</fpage>
            <lpage>26</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7892171</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Learning internal representations by error propagation</p>
            </title>
            <aug>
               <au>
                  <snm>Rumelhart</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Hinton</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Williams</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Parallel distributed processing: explorations in the microstructure of cognition</source>
            <pubdate>1986</pubdate>
            <volume>1</volume>
            <issue>foundations</issue>
            <fpage>318</fpage>
            <lpage>62</lpage>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Prediction of Coordination Number and Relative Solvent Accessibility in Proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Fariselli</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Casadio</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2002</pubdate>
            <volume>47</volume>
            <fpage>142</fpage>
            <lpage>235</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11933061</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Przybylski</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2002</pubdate>
            <volume>47</volume>
            <fpage>228</fpage>
            <lpage>235</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11933069</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Domains, motifs and clusters in the protein universe</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Curr Opin Chem Biol</source>
            <pubdate>2003</pubdate>
            <volume>7</volume>
            <issue>1</issue>
            <fpage>5</fpage>
            <lpage>11</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12547420</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Learning long-term dependencies with gradient descent is difficult</p>
            </title>
            <aug>
               <au>
                  <snm>Bengio</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Frasconi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Simard</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>IEEE Trans Neural Netwprks</source>
            <pubdate>1994</pubdate>
            <volume>5</volume>
            <issue>2</issue>
            <fpage>157</fpage>
            <lpage>66</lpage>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Learning Protein Secondary Structure from Sequential and Relational Data</p>
            </title>
            <aug>
               <au>
                  <snm>Ceroni</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Frasconi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Neural Networks</source>
            <pubdate>2005</pubdate>
            <volume>18</volume>
            <issue>8</issue>
            <fpage>1029</fpage>
            <lpage>39</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16182513</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Ab initio and template-based prediction of multi-class distance maps by two-dimensional recursive neural networks</p>
            </title>
            <aug>
               <au>
                  <snm>Walsh</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Ba&#250;</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Mooney</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vullo</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>BMC Structural Biology</source>
            <pubdate>2009</pubdate>
            <volume>9</volume>
            <fpage>5</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2654788</pubid>
                  <pubid idtype="pmpid" link="fulltext">19183478</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Creating representative protein sequence sets</p>
            </title>
            <aug>
               <au>
                  <snm>Mika</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>13</issue>
            <fpage>3789</fpage>
            <lpage>91</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">169026</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824419</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>The HSSP database of protein structure-sequence alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1995</pubdate>
            <volume>24</volume>
            <issue>1</issue>
            <fpage>201</fpage>
            <lpage>205</lpage>
         </bibl>
         <bibl id="B44">
            <title>
               <p>Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Rigden</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Protein Engineering</source>
            <pubdate>2002</pubdate>
            <volume>15</volume>
            <issue>2</issue>
            <fpage>65</fpage>
            <lpage>77</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11917143</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B45">
            <title>
               <p>Enlarged representative set of protein structures</p>
            </title>
            <aug>
               <au>
                  <snm>Hobohm</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>1994</pubdate>
            <volume>3</volume>
            <fpage>522</fpage>
            <lpage>24</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2142698</pubid>
                  <pubid idtype="pmpid" link="fulltext">8019422</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B46">
            <title>
               <p>Distill: A suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Ba&#250;</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>AJM</fnm>
               </au>
               <au>
                  <snm>Mooney</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vullo</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Walsh</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>402</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1574355</pubid>
                  <pubid idtype="pmpid" link="fulltext">16953874</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B47">
            <title>
               <p>Sequence-based prediction of protein domains</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>12</issue>
            <fpage>3522</fpage>
            <lpage>3530</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">484172</pubid>
                  <pubid idtype="pmpid" link="fulltext">15240828</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B48">
            <title>
               <p>A benchmark for domain assignment from protein 3-dimensional structure and its applications</p>
            </title>
            <aug>
               <au>
                  <snm>Holland</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Veretnik</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Shindyalov</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Bourne</snm>
                  <fnm>PE</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2006</pubdate>
            <volume>361</volume>
            <fpage>562</fpage>
            <lpage>590</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16863650</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B49">
            <title>
               <p>CDD: a curated Entrez database of conserved domain alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Marchler-Bauer</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Anderson</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>DeWeese-Scott</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Fedorova</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Geer</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hurwitz</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Jackson</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Jacobs</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lanczycki</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Liebert</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Madej</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Marchler</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Mazumder</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Nikolskaya</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Panchenko</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Rao</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Shoemaker</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Simonyan</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Song</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Thiessen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Vasudevan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Yin</snm>
                  <mi>J</mi>
                  <fnm>Yamashita</fnm>
               </au>
               <au>
                  <snm>Bryant</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>1</issue>
            <fpage>383</fpage>
            <lpage>387</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165534</pubid>
                  <pubid idtype="pmpid" link="fulltext">12520028</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B50">
            <title>
               <p>Sequence-based protein domain boundary prediction using BP neural network with various property profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Ye</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Zhou</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2008</pubdate>
            <volume>71</volume>
            <fpage>300</fpage>
            <lpage>307</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">17932915</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B51">
            <title>
               <p>Pprodo: prediction of protein domain boundaries using neural networks</p>
            </title>
            <aug>
               <au>
                  <snm>Sim</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>SY</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2005</pubdate>
            <volume>59</volume>
            <fpage>627</fpage>
            <lpage>632</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15789433</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B52">
            <title>
               <p>The exon theory of genes</p>
            </title>
            <aug>
               <au>
                  <snm>Gilbert</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Cold Spring Harbor symposia on quantitative biology</source>
            <pubdate>1987</pubdate>
            <volume>52</volume>
            <fpage>901</fpage>
            <lpage>5</lpage>
            <xrefbib>
               <pubid idtype="pmpid">2456887</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B53">
            <title>
               <p>Testing the exon theory of genes: the evidence from protein structure</p>
            </title>
            <aug>
               <au>
                  <snm>Gilbert</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1992</pubdate>
            <volume>265</volume>
            <issue>5169</issue>
            <fpage>202</fpage>
            <lpage>207</lpage>
         </bibl>
      </refgrp>
   </bm>
</art>
