<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-437</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>STAR: predicting recombination sites from amino acid sequence</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Bauer</snm>
               <mi>C</mi>
               <fnm>Denis</fnm>
               <insr iid="I1"/>
               <email>d.bauer@imb.uq.edu.au</email>
            </au>
            <au id="A2">
               <snm>Bod&#233;n</snm>
               <fnm>Mikael</fnm>
               <insr iid="I2"/>
               <email>mikael@itee.uq.edu.au</email>
            </au>
            <au id="A3">
               <snm>Thier</snm>
               <fnm>Ricarda</fnm>
               <insr iid="I3"/>
               <email>r.thier@uq.edu.au</email>
            </au>
            <au id="A4">
               <snm>Gillam</snm>
               <mi>M</mi>
               <fnm>Elizabeth</fnm>
               <insr iid="I3"/>
               <email>e.gillam@uq.edu.au</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Institute for Molecular Bioscience, The University of Queensland, QLD 4072, Australia</p>
            </ins>
            <ins id="I2">
               <p>School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, Australia</p>
            </ins>
            <ins id="I3">
               <p>School of Biomedical Sciences, The University of Queensland, QLD 4072, Australia</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>437</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/437</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17026775</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-437</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>29</day>
               <month>4</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>08</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>08</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Bauer et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Designing novel proteins with site-directed recombination has enormous prospects. By locating effective recombination sites for swapping sequence parts, the probability that hybrid sequences have the desired properties is increased dramatically. The prohibitive requirements for applying current tools led us to investigate machine learning to assist in finding useful recombination sites from amino acid sequence alone.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We present STAR, Site Targeted Amino acid Recombination predictor, which produces a score indicating the structural disruption caused by recombination, for each position in an amino acid sequence. Example predictions contrasted with those of alternative tools, illustrate STAR'S utility to assist in determining useful recombination sites. Overall, the correlation coefficient between the output of the experimentally validated protein design algorithm SCHEMA and the prediction of STAR is very high (0.89).</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>STAR allows the user to explore useful recombination sites in amino acid sequences with unknown structure and unknown evolutionary origin. The predictor service is available from <url>http://pprowler.itee.uq.edu.au/star</url>.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Recombinant DNA techniques, such as DNA shuffling, generate more diverse libraries than random mutagenesis with a relatively high fraction of functional proteins <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Like mutagenesis but unlike <it>de novo </it>protein design, recombination deals with native sequences whose effectiveness for some particular function is established. As protein design tools, recombinatorial techniques dramatically reduce the combinatorial space of possible sequences to an area which can be explored <it>in vitro </it>more easily.</p>
         <p>Site-directed (as opposed to random) recombination systematically reduces the amino acid sequence space for consideration by identifying specific sites in parental sequences at which their parts can be interchanged <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. This paper develops and evaluates a method that uses machine learning to suggest recombination sites solely from the amino acid sequences of the parents. Alternative tools may be more precise but require either parents to be structurally resolved <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> or phylogenetically well-characterised <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
         <p>SCHEMA predicts structural disruption by gleaning contact maps of parents <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The principle of SCHEMA applies to recombination site identification (using SCHEMA's so-called S-profile) or to evaluate the folding potential of specific hybrids (using SCHEMA's E-value). RASPP (Recombination as a Shortest-Path Problem) was introduced to generate and evaluate candidate hybrids using the E-value <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>.</p>
         <p>SCHEMA's S-profile essentially identifies sequence segments that are likely to fold independently from the rest of the protein <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The profile is the series of sums of possible disruptions (caused by recombination) within a window centered on each residue of the protein (see Equation 1). The assumption behind SCHEMA is that protein function is preserved by not interfering with these structural "building blocks". Different segments are sampled from a family of parent proteins to be recombined with the main structural features left intact <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Indeed, sites identified by SCHEMA match successful recombination sites used in <it>in vitro </it>experiments <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Additionally, the application of SCHEMA on the <it>&#946;</it>-lactamase and cytochrome P450 families has been helpful in evaluating (and maximising) the ratio of folded and functional enzyme hybrids <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B8">8</abbr></abbrgrp>. Note, SCHEMA's "building blocks" must not be confused with domains or other structural or functional subdivisions (motifs, modules or exons). By placing recombination sites only at the boundaries of such structural subdivisions, exploration is severely hampered <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Additionally, to enhance a function, intra-domain segments may need to be perturbed.</p>
         <p>Unfortunately, SCHEMA requires the full tertiary description of the protein structure (as in the Protein Data Bank, PDB). This requirement severely limits the number of candidate proteins to the small group that are already resolved. Due to the expensive, time-consuming and complicated nature of structure determination, the number of proteins with known structure is likely to remain comparably small to the number of known sequences.</p>
         <p>FamClash is another method for evaluating the potency of hybrids <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. FamClash checks for every amino acid pair [<it>i</it>, <it>j</it>] (residue at position <it>i </it>in parent 1 and position <it>j </it>in parent 2) if charge, volume and hydrophobicity is in agreement with the conserved properties at [<it>i</it>, <it>j</it>] in the protein family both parents belong to. To determine the conserved properties of the family a preprocessing step is necessary. Each residue pair in the <it>m</it>th member of the family forms a unique 3D coordinate [<m:math name="1471-2105-7-437-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>j</m:mi></m:mrow><m:mi>m</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaemyAaKMaeiilaWIaemOAaOgabaGaemyBa0gaaaaa@32E3@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-7-437-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>V</m:mi><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>j</m:mi></m:mrow><m:mi>m</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGwbGvdaqhaaWcbaGaemyAaKMaeiilaWIaemOAaOgabaGaemyBa0gaaaaa@3309@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-7-437-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>H</m:mi><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>j</m:mi></m:mrow><m:mi>m</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGibasdaqhaaWcbaGaemyAaKMaeiilaWIaemOAaOgabaGaemyBa0gaaaaa@32ED@</m:annotation></m:semantics></m:math>] with <m:math name="1471-2105-7-437-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mrow><m:mi>i</m:mi><m:mo>,</m:mo><m:mi>j</m:mi></m:mrow><m:mi>m</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaemyAaKMaeiilaWIaemOAaOgabaGaemyBa0gaaaaa@32E3@</m:annotation></m:semantics></m:math> = <it>C</it>(<it>k</it>) + <it>C</it>(<it>l</it>), where <it>k </it>is the residue at position <it>i </it>in <it>m </it>and <it>l </it>at <it>j </it>and <it>C</it>(&#183;) is the value of a lookup table containing the charge of the residue (<it>V </it>is the volume and <it>H </it>the hydrophobicity, respectively). The space is partitioned into cubes. Residue pairs whose coordinates are within one cube are defined to be similar. A position [<it>i</it>, <it>j</it>] is conserved if 20% of the family members have a residue pair for [<it>i</it>, <it>j</it>] that is within the cube. Each residue pair in the hybrid can now be assessed with the triplet of the mean values of the conserved residue pairs in the protein family. Any deviation between hybrid and conserved property in the family is denoted as a "clash". Positions with no conservation are simply ignored. A limitation of FamClash is its reliance on members of the parents' protein family. A sufficiently large protein family needs to be established and analysed to properly identify conservation.</p>
         <p>Saraf et al. developed OPTCOMB <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, which like RASPP identifies the optimal recombination sites and, additionally, is able to limit the parental sequence fragments for the library to the most promising ones. FamClash <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> is used as an objective function. Thus, OPTCOMB identifies recombination sites which produce hybrids with the minimal number of clashes. Like RASPP, OPTCOMB is able to use any function that evaluates a protein including SCHEMA <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> or the current method.</p>
         <p>Using machine learning methods, we developed STAR (Site Targeted Amino acid Recombination predictor), to extend a SCHEMA-like analysis to proteins for which no structure has been solved. STAR predicts the maximum number of connections that can be broken by recombination for each position in the parent &#8211; SCHEMA's single-parent S-score (see Equation 1). A minimum in this STAR-profile corresponds to a region where the protein structure has lower contact density and thus less prone to be corrupted by recombination.</p>
         <p>Notably, SCHEMA's S-score has largely been superseded by the E-value (which is now the recommended choice according to the original SCHEMA authors). Since the intention here is partly to explore the applicability of machine learning tools to assist in the determination of recombination sites, we choose to focus on the simpler strategy based on the S-score as calculated from a single sequence.</p>
         <p>For comparison, I-Mutant2.0 <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, MUpro <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and Conseq <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> predict stability changes caused by single-point mutation from amino acid sequence data. These methods are not specifically designed to assist in the determination of recombination sites.</p>
         <p>However, by exhaustively testing all possible amino acid substitution for each position, we can identify crucial amino acids within the sequence, and compare these scores with STAR's. I-Mutant2.0 and MUpro are both machine learning-based and predict a stability change (positive or negative) of the whole protein for a given single-point mutation. If a residue is crucial for retaining the current structure, substitutions would result in large negative predictions. Conseq uses phylogenetic trees to calculate a conservation score for each residue which should correlate with the importance of it.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <p>The goal is to develop a model using machine learning methods to predict SCHEMA's <it>single-</it>parent S-profile from an amino acid sequence. From a machine learning point of view, the goal presents a significant challenge since SCHEMA relies crucially on structural features, not amino acid composition. As per Equation 1 the S-score counts the number of connections that break if position <it>i </it>serves as the point at which sequence parts are exchanged.</p>
         <p>
            <m:math name="1471-2105-7-437-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
               <m:semantics>
                  <m:mrow>
                     <m:msub>
                        <m:mi>S</m:mi>
                        <m:mi>i</m:mi>
                     </m:msub>
                     <m:mo>=</m:mo>
                     <m:mstyle displaystyle="true">
                        <m:munderover>
                           <m:mo>&#8721;</m:mo>
                           <m:mrow>
                              <m:mi>j</m:mi>
                              <m:mo>=</m:mo>
                              <m:mi>i</m:mi>
                              <m:mo>&#8722;</m:mo>
                              <m:mi>w</m:mi>
                              <m:mo>+</m:mo>
                              <m:mn>1</m:mn>
                           </m:mrow>
                           <m:mi>i</m:mi>
                        </m:munderover>
                        <m:mrow>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>k</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mi>j</m:mi>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mi>j</m:mi>
                                    <m:mo>+</m:mo>
                                    <m:mi>w</m:mi>
                                    <m:mo>&#8722;</m:mo>
                                    <m:mn>2</m:mn>
                                 </m:mrow>
                              </m:munderover>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>l</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mi>k</m:mi>
                                          <m:mo>+</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>+</m:mo>
                                          <m:mi>w</m:mi>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>c</m:mi>
                                          <m:mrow>
                                             <m:mi>k</m:mi>
                                             <m:mi>l</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                     </m:mstyle>
                     <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                     <m:mrow>
                        <m:mo>(</m:mo>
                        <m:mn>1</m:mn>
                        <m:mo>)</m:mo>
                     </m:mrow>
                  </m:mrow>
                  <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaWgaaWcbaGaemyAaKgabeaakiabg2da9maaqahabaWaaabCaeaadaaeWbqaaiabdogaJnaaBaaaleaacqWGRbWAcqWGSbaBaeqaaaqaaiabdYgaSjabg2da9iabdUgaRjabgUcaRiabigdaXaqaaiabdQgaQjabgUcaRiabdEha3jabgkHiTiabigdaXaqdcqGHris5aaWcbaGaem4AaSMaeyypa0JaemOAaOgabaGaemOAaOMaey4kaSIaem4DaCNaeyOeI0IaeGOmaidaniabggHiLdaaleaacqWGQbGAcqGH9aqpcqWGPbqAcqGHsislcqWG3bWDcqGHRaWkcqaIXaqmaeaacqWGPbqAa0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaIXaqmaiaawIcacaGLPaaaaaa@5D00@</m:annotation>
               </m:semantics>
            </m:math>
         </p>
         <p><it>S<sub>i </sub></it>describes, for each sequence position <it>i</it>, the number of contacts within the window (<it>i </it>- <it>w</it>, <it>i </it>+ <it>w</it>) that could be broken if the recombination site is positioned at <it>i</it>. <it>c</it><sub><it>kl </it></sub>= 1 if residues <it>k </it>and <it>l </it>are in within 4.5 &#197; of one another, and <it>c</it><sub><it>kl </it></sub>= 0 otherwise. In the <it>multi-</it>parent S-score, <it>c<sub>kl </sub></it>is derived from a multiply-aligned contact map <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. <it>c<sub>kl </sub></it>only counts when the two residues <it>k </it>and <it>I </it>come from different parents and represent different amino acids. The single-parent S-score neglects the overlap between parents and is thus an upper bound of the multi-parent S-score (and converges to it with decreasing parent sequence similarity). To follow the configuration used by <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, <it>w </it>is set to 14 residues.</p>
         <p>To build a data set for training and evaluating machine learning models, the binary contact map was derived for each of the proteins in a 945-protein data set (extracted from PDB) by employing a Euclidian cut-off distance of 4.5 &#197;. Equation 1 is then determined for each residue in the set. For practical purposes, the single-parent S-score is normalised to fall in the 0&#8211;1 interval: <m:math name="1471-2105-7-437-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>i</m:mi><m:mi>N</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaemyAaKgabaGaemOta4eaaaaa@3088@</m:annotation></m:semantics></m:math> = <it>tanh</it>(<it>S</it><sub><it>i</it></sub>/<it>max</it>) where we have preset <it>max </it>= 637 from gleaning the data set. We call this score the <it>calculated </it>STAR-score. The sequence set represents a diverse range of proteins and has no pairs with more than 25% sequence similarity <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. The data set is made available from the predictor home page.</p>
      </sec>
      <sec>
         <st>
            <p>Model development</p>
         </st>
         <p>PSI-BLAST profiles are used successfully for most protein structure prediction problems, including secondary structure, residue contacts, and surface accessibility <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. In a standard fashion, each protein chain in the data set is represented using the PSI-BLAST profile (PSI-BLAST is run with 3 iterations over Genbank's non-redundant protein set). The profile implicitly incorporates information about sequence variability and the location of indels within a family of proteins <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
         <p>We evaluate two major types of machine learning algorithms, namely Support Vector Regression and Neural Networks. Both types have repeatedly been found superior for relevant prediction problems (e.g. secondary structure prediction <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>, contact number and solvent accessibility prediction <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>). The input window size of all models is set to 15 residues: the residue for which the SCHEMA score is predicted and then 7+7 residues immediately upstream and downstream, respectively. We use 2-fold crossvalidation to develop and test models, meaning that one simulation run involves training and testing two models. Each model is trained on half the data and tested on the remaining half but controlled so that each data point appears as a test point in exactly one model. Preliminary simulations with the used models showed that the differences between 10-fold and 2-fold crossvalidation are negligible (not shown). To minimize the computational time required, we consistently use 2-fold crossvalidation to develop and test the models. However, we repeated the runs to guarantee consistency in the results. The average accuracy is reported below.</p>
         <p>The STAR-score is not predicted with a reasonable accuracy if the machine learning algorithm is presented with only the sequence data encoded using PSI-BLAST profiles (see Table <tblr tid="T1">1</tblr>). Instead we first present the sequence to an already existing secondary structure predictor and then use its output as the input to our STAR-score predictor. To predict the secondary structure, we use the Continuum Secondary Structure Predictor <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> which produces a probability for each secondary structure state. Noteworthy, the Continuum Secondary Structure has a state-of-the-art classification accuracy of <it>Q</it><sub>3 </sub>= 77.3 and from preliminary trials we note there is no significant difference in STAR-score prediction accuracy if the true secondary structure is used as input instead.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>The STAR-score prediction accuracy.</p>
            </caption>
            <tblbdy cols="2">
               <r>
                  <c ca="left">
                     <p>Configuration</p>
                  </c>
                  <c ca="center">
                     <p>r</p>
                  </c>
               </r>
               <r>
                  <c cspan="2">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>FFNN (0 hidden)</p>
                  </c>
                  <c ca="center">
                     <p>0.56</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>FFNN (20 hidden)</p>
                  </c>
                  <c ca="center">
                     <p>0.56</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>FFNN (40 hidden)</p>
                  </c>
                  <c ca="center">
                     <p>0.56</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>BRNN (7+7 hidden)</p>
                  </c>
                  <c ca="center">
                     <p>0.66</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>The STAR-score prediction accuracy (established from test data using the correlation with the <it>calculated </it>STAR-score) when sequence data was presented directly to the machine learning algorithm.</p>
            </tblfn>
         </tbl>
         <p>The simple Feed Forward Neural Network (FFNN) and the Bidirectional Recurrent Neural Network (BRNN) <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> are trained and evaluated on the 945-protein data set. Training uses gradient descent to minimise the error as measured on the single output node. The learning rate is <it>&#951; </it>= 0.001. A variety of hidden node numbers <it>h </it>(including not using a hidden layer at all) are trialled. For all neural networks, training data is presented in batches of 100 windows before the weights are changed. A total of 40,000 sequences were presented in random order before we stopped training. In preliminary studies this number was seen as sufficient for convergence.</p>
         <p>Recent findings suggest that Support Vector Regression (SVR) exceeds the accuracy reached by many neural networks <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. Essentially, support vector regression operates by finding so-called support vectors that collectively represent the function in a feature space. A kernel function maps the input sequence encoding into the feature space. Support vector regression can be understood as minimising a tube wrapped around the hypothesis function. During training, the "tube" is defined by an <it>&#949;</it>-insensitive loss function (where <it>&#949; </it>represents the size of a margin to the hypothesis function). Sample targets outside this margin are penalised. For <it>&#949;</it>, we use the standard value of 0.1. We examine optimisation using <it>&#949;</it>-SVR and <it>&#957;</it>-SVR with the same protein data set. <it>&#957;</it>-SVR replaces the <it>&#949; </it>hyper-parameter with <it>&#957;</it>, to control the number of support vectors (<it>&#957; </it>= 0.5 in all simulations). The standard stopping criterion is used and <it>C </it>was set to 0.5 (which delivered a better result than the standard value of 1). We trial both the Linear and Gaussian kernel functions (with <it>&#947; </it>= 1). Note, the exploration of the parameter space is far from being exhaustive.</p>
         <p>We use the correlation coefficient <it>r </it>between the calculated STAR-score <it>t<sub>i </sub></it>and the predicted score <it>p<sub>i </sub></it>where the index <it>i </it>represents the position in the sequence, <it>r </it>is defined for a single chain.</p>
         <p>
            <m:math name="1471-2105-7-437-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
               <m:semantics>
                  <m:mrow>
                     <m:mi>r</m:mi>
                     <m:mo>=</m:mo>
                     <m:mfrac>
                        <m:mrow>
                           <m:mo>&#9001;</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>t</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>&#8722;</m:mo>
                           <m:mo>&#9001;</m:mo>
                           <m:msub>
                              <m:mi>t</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>&#9002;</m:mo>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>&#8901;</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>p</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>&#8722;</m:mo>
                           <m:mo>&#9001;</m:mo>
                           <m:msub>
                              <m:mi>p</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>&#9002;</m:mo>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>&#9002;</m:mo>
                        </m:mrow>
                        <m:mrow>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mo>&#9001;</m:mo>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>t</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mo>&#9001;</m:mo>
                                 <m:msub>
                                    <m:mi>t</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>&#9002;</m:mo>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mo>&#9002;</m:mo>
                              </m:mrow>
                           </m:msqrt>
                           <m:mo>&#8901;</m:mo>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mo>&#9001;</m:mo>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>p</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mo>&#9001;</m:mo>
                                 <m:msub>
                                    <m:mi>p</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>&#9002;</m:mo>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mo>&#9002;</m:mo>
                              </m:mrow>
                           </m:msqrt>
                        </m:mrow>
                     </m:mfrac>
                     <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                     <m:mrow>
                        <m:mo>(</m:mo>
                        <m:mn>2</m:mn>
                        <m:mo>)</m:mo>
                     </m:mrow>
                  </m:mrow>
                  <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCcqGH9aqpdaWcaaqaaiabgMYiHlabcIcaOiabdsha0naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IaeyykJeUaemiDaq3aaSbaaSqaaiabdMgaPbqabaGccqGHQms8cqGGPaqkcqGHflY1cqGGOaakcqWGWbaCdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiabgMYiHlabdchaWnaaBaaaleaacqWGPbqAaeqaaOGaeyOkJeVaeiykaKIaeyOkJepabaWaaOaaaeaacqGHPms4cqGGOaakcqWG0baDdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiabgMYiHlabdsha0naaBaaaleaacqWGPbqAaeqaaOGaeyOkJeVaeiykaKIaeyOkJepaleqaaOGaeyyXIC9aaOaaaeaacqGHPms4cqGGOaakcqWGWbaCdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiabgMYiHlabdchaWnaaBaaaleaacqWGPbqAaeqaaOGaeyOkJeVaeiykaKIaeyOkJepaleqaaaaakiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@72CD@</m:annotation>
               </m:semantics>
            </m:math>
         </p>
         <p>where &#10216;&#183;&#10217; is the mean. Ideal performance means that <it>t<sub>i </sub></it>and <it>p<sub>i </sub></it>are perfectly and positively correlated <it>r </it>= 1 All reported result values are averages over all chains (when they appear as test cases).</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>The results when using the predicted secondary structure as input are shown in Table <tblr tid="T2">2</tblr>. BRNN seems to perform slightly better than the other algorithms but the small number of trials prohibits us from ranking them confidently.</p>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>The STAR-score prediction accuracy.</p>
            </caption>
            <tblbdy cols="2">
               <r>
                  <c ca="left">
                     <p>Configuration</p>
                  </c>
                  <c ca="center">
                     <p>r</p>
                  </c>
               </r>
               <r>
                  <c cspan="2">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>FFNN (0 hidden)</p>
                  </c>
                  <c ca="center">
                     <p>0.86</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>FFNN (20 hidden)</p>
                  </c>
                  <c ca="center">
                     <p>0.86</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>FFNN (40 hidden)</p>
                  </c>
                  <c ca="center">
                     <p>0.86</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>BRNN (7+7 hidden)</p>
                  </c>
                  <c ca="center">
                     <p>0.89</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>&#949;</it>-SVR (Linear)</p>
                  </c>
                  <c ca="center">
                     <p>0.82</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>&#949;</it>-SVR (Gaussian)</p>
                  </c>
                  <c ca="center">
                     <p>0.80</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><it>&#957;</it>-SVR (Gaussian)</p>
                  </c>
                  <c ca="center">
                     <p>0.83</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>The STAR-score prediction accuracy (established from test data using the correlation with the <it>calculated </it>STAR-score) when sequence data was presented as the predicted 3-class Continuum Secondary Structure.</p>
            </tblfn>
         </tbl>
         <p>The average correlation coefficient between the predicted and calculated STAR-score for BRNN on test data (set aside from the 945-protein set using cross-validation) is 0.89 (1.0 represents perfect agreement, and 0.0 represents random agreement) and the mean squared error is 0.0028. Approximately 18% of the sequences from the STAR data set have a sequence similarity exceeding 25% with sequences in the training dataset of the Secondary Structure Predictor <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. To ensure that the overlap does not influence the estimated accuracy, we assessed the STAR prediction accuracy specifically for the sequences which had less than 25% similarity with the set used for training the secondary structure predictor. The average correlation coefficient on these 771 was 0.88 (average mean squared error is 0.0028).</p>
         <p>The most essential piece of information in the S-profile is the positions of minima <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. In addition to the correlation, the distance between the positions of the minima in the predicted function and in the target function, illustrates the suitability of the method for protein design. The average distance between the position of the predicted and the target minima is a mere 3.42 residues for the BRNN-model.</p>
         <p>The high correlation between the scores and the low average distance between their minimas, illustrate the feasibility of replacing the S-score with STAR &#8211; allowing the user to explore structural building blocks <it>in silica </it>of yet unresolved or even hypothetical protein with little loss of precision.</p>
         <p>The web interface of STAR requires the user to input a single amino acid sequence (in the FASTA format), and then predicts the single-parent SCHEMA profile. The profile can be presented as a graph &#8211; allowing the user to quickly assess regions of low connectivity lending themselves to recombination. The individual scores are also presented numerically with added details (including the continuum secondary structure). A suitable recombination site is thus one with a low score, i.e. a position which &#8211; should it be the point of exchange &#8211; disrupts few, predicted connections in the parent sequence.</p>
         <p>A family of Cephalosporinase genes from four microbial species subjected to random DNA shuffling yielded an eight-fold increase of moxalactamase activity in a single cycle of shuffling <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. We presented the sequence of PDB entry <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link> (one of the four Cephalosporinase used in the original study) to STAR. We calculated the single-parent S-score (the calculated STAR-score) from the contact map of <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link> and the multi-parent S-score using a sequence alignment between <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link> and <ext-link ext-link-type="pdb" ext-link-id="1G68">1G68</ext-link> (also used in the original study, exhibiting 40% sequence identity with <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link>). Both SCHEMA-profiles were superimposed on top of the predicted STAR-score (see Figure <figr fid="F1">1</figr>). We also present the successful recombination sites as determined from the original random shuffling experiment. The same data was used for evaluating the original SCHEMA algorithm (cf. <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>). For practical reasons we were unable to test FamClash (but cf. <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>)</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>The SCHEMA-profile, the STAR-profile, and post-processed profiles for Conseq, MUpro and I-Mutant2.0 for the protein <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link></p>
            </caption>
            <text>
               <p><b>The SCHEMA-profile, the STAR-profile, and post-processed profiles for Conseq, MUpro and l-Mutant2.0 for the protein </b><ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link>. (a) The multi-parent S-scores (normalised) along with the calculated and predicted STAR-scores for <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link>. The profiles indicate the number of disrupted connections (y-axis) at each sequence position (x-axis). (b) The post-processed score from MUpro for <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link>. The profile indicates the structural stability change caused by mutation (y-axis) for each sequence residue (x-axis). (c) The post-processed score from I-Mutant2.0 for <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link>. The profile indicates the structural stability change caused by mutation (y-axis) for each sequence residue (x-axis). (d) The post-processed score from Conseq. The profile indicates the level of amino acid conservation (y-axis) for each sequence residue (x-axis). The successful recombination sites from a random DNA shuffling experiment are added to each graph and plotted as vertical lines [1].</p>
            </text>
            <graphic file="1471-2105-7-437-1"/>
         </fig>
         <p>As can be seen in Figure <figr fid="F1">1</figr>, the successful recombination sites match the minima of all profiles with the exception of the site close to the C-terminus. The last site is deemed unsuitable by both the original SCHEMA algorithm and STAR. The full structure shown in Figure <figr fid="F2">2</figr> illustrates that regions which have divergent SCHEMA- and STAR-scores are at the surface of the molecule and should typically have lower connectivity (a generalisation that STAR seems to use).</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>The <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link> protein structure</p>
            </caption>
            <text>
               <p><b>The </b><ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link><b>protein structure. </b>Three residues are identified, each translating into divergent SCHEMA- and STAR-scores.</p>
            </text>
            <graphic file="1471-2105-7-437-2"/>
         </fig>
         <p>To put the example prediction in context, we adapted alternative methods not explicitly designed for this purpose but potentially useful for identifying recombination sites. The predicted stability change of I-Mutant2.0 and MUpro is summed up over all 19 possible amino acid substitution for each position of the amino acid sequence. The lower the score for an amino acid, the more sensitive the current structure is to perturbations. The score profile is smoothed with a kernel averaging over 10 neighboring residues, normalised and inverted to compare neatly with STAR's profile. This post-processed prediction thus has minima at positions with high substitution tolerance. Conseq predicts a conservation-score which is similarly smoothed. This modified Conseq-profile has minima at positions with low evolutionary diversity. Low evolutionary diversity is taken to indicate essential residues for structure or function of the protein.</p>
         <p>I-Mutant2.0, MUpro and Conseq predictions along with the biological verified recombination sites of <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link> are shown in Figure <figr fid="F1">1</figr>. I-Mutant2.0 and MUpro predict minima at the verified recombination sites-supporting the assumption to cut at less sensitive regions (high acceptance of substitutions). Conseq on the other hand does not seem directly applicable for the identification of recombination sites. It should be noted however that the experimental recombination data for <ext-link ext-link-type="pdb" ext-link-id="1BLS">1BLS</ext-link> is far from exhaustive. Our comparison merely serves to illustrate that currently there appears to exist no silver bullet for recombination site prediction.</p>
         <p>STAR uses continuum secondary structure as input. However, STAR does more than avoid breaking helices. Rat reductase (<ext-link ext-link-type="pdb" ext-link-id="1AMO">1AMO</ext-link>) has a bundle of short helices, interrupted by coiled segments (around residues 375&#8211;450). As the user of the online service can verify, STAR recognises the bundle and predicts a constant high connectivity score.</p>
         <p>PurN and GART glycinaminid ribonucleotide transformylase (70% sequence identity) were recombined and functional hybrid proteins were selected <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>. Recombination was restricted to occur between amino acid position 50 and 150. In Figure <figr fid="F3">3</figr>, the multi-parent S-profile determined from the PDB entry <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link> and GART <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> is shown. The predicted STAR-profile was generated from the sequence of <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link>. For comparison the graph also shows the calculated single-sequence S-profile for <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link>. In spite of the high sequence similarity between <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link> and GART, the single- and multi-parent profiles are very similar. The <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link> structure is incomplete for residues 110&#8211;133 and hence the calculated S-profiles are undefined for this segment. The predictor was presented with the full sequence (which is known for <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link>) and correctly characterises this segment as unsuitable for recombination. Finally, we removed all residues in the unknown sequence segment to illustrate the predictors ability to cope with incomplete sequence data.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>The SCHEMA-profile and the STAR-profile for the protein <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link></p>
            </caption>
            <text>
               <p><b>The SCHEMA-profile and the STAR-profile for the protein </b><ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link>. The multi-parent S-score for <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link> and GART (normalised) and the single-parent S-score for <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link> (normalised), along with the predicted STAR-score for <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link>. The gap is caused by the lack of information in the PDB file for residues 110 to 133. The prediction for the complete sequence accurately disqualifies recombination in this area, while agrees with the prediction generated for a sequence in which these 23 residues were removed. The successful recombination sites from a DNA shuffling experiment are added and plotted as vertical lines [23, 24].</p>
            </text>
            <graphic file="1471-2105-7-437-3"/>
         </fig>
         <p>MUpro and I-Mutant2.0 performed poorly on <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link> (data not shown). Conseq failed completely to generate an output since <ext-link ext-link-type="pdb" ext-link-id="1CDD">1CDD</ext-link> has too few family members to generate the required alignment.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>SCHEMA-based guidance can increase the fraction of properly folded proteins resulting from a single round of recombination from a mere 9% to 75% <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. However, access to tertiary and quaternary structures is limited, imposing a severe restriction on the use of algorithms like SCHEMA. If exploration beyond the few sequences that have been rigorously characterised is required (e.g. to use promising products generated by an initial round of recombination), we cannot use methods that assume access to tertiary or quaternary structure.</p>
         <p>As at April 2006, there are about 35,000 structurally resolved proteins in the Protein Data Bank but several hundreds of thousands of known proteins in sequence databases. Since STAR requires only the protein sequence as input, it enables the protein engineer to choose candidate proteins on the basis of functional properties (say, specific enzymatic activity), and not be limited to those for which full structural or extensive protein family information is available.</p>
         <p>STAR has been trained on data generated by the normalised version of the single-parent S-profile on the basis of resolved protein structures, representative for the whole protein universe (as known through PDB). STAR is able to generalise so that each position in any hypothetical or yet unresolved protein can be accurately evaluated in relation to its potential to disrupt the structure (should it be used as a recombination site). In this context, we review recent algorithms intended for predicting structural stability changes caused by single-point mutagenesis, and adapt them to similarly serve to identify sites prone to disrupt structure.</p>
      </sec>
      <sec>
         <st>
            <p>Availability</p>
         </st>
         <p>&#8226; Project name : STAR</p>
         <p>&#8226; Project home page : <url>http://pprowler.itee.uq.edu.au/star</url></p>
         <p>&#8226; Operating System(s): Platform independent</p>
         <p>&#8226; Programming language: Java servlet</p>
         <p>&#8226; Licence: a licence is required for non-academic use</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>DCB researched and developed the SCHEMA score prediction model under the supervision of MB and RT. DCB and MB drafted the paper and RT and</p>
         <p>EMG provided substantive feedback on the draft and helped finishing the manuscript. The prediction service was developed by MB and DCB.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>The authors wish to acknowledge Dr. Zheng Yuan for help with designing the data set.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>DNA shuffling of a family of genes from diverse species accelerates directed evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Crameri</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Raillard</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Bermudez</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Stemmer</snm>
                  <fnm>WP</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1998</pubdate>
            <volume>391</volume>
            <issue>6664</issue>
            <fpage>288</fpage>
            <lpage>291</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/34663</pubid>
                  <pubid idtype="pmpid" link="fulltext">9440693</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>General method for sequence-independent site-directed chimeragenesis</p>
            </title>
            <aug>
               <au>
                  <snm>Hiraga</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Arnold</snm>
                  <fnm>FH</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2003</pubdate>
            <volume>330</volume>
            <issue>2</issue>
            <fpage>287</fpage>
            <lpage>296</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0022-2836(03)00590-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">12823968</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Library analysis of SCHEMA-guided protein recombination</p>
            </title>
            <aug>
               <au>
                  <snm>Meyer</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Silberg</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>Voigt</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Endelman</snm>
                  <fnm>JB</fnm>
               </au>
               <au>
                  <snm>Mayo</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>ZG</fnm>
               </au>
               <au>
                  <snm>Arnold</snm>
                  <fnm>FH</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>2003</pubdate>
            <volume>12</volume>
            <issue>8</issue>
            <fpage>1686</fpage>
            <lpage>1693</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1110/ps.0306603</pubid>
                  <pubid idtype="pmpid" link="fulltext">12876318</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>On the conservative nature of intragenic recombination</p>
            </title>
            <aug>
               <au>
                  <snm>Drummond</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Silberg</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Wilke</snm>
                  <fnm>CO</fnm>
               </au>
               <au>
                  <snm>Arnold</snm>
                  <fnm>FH</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <issue>15</issue>
            <fpage>5380</fpage>
            <lpage>5385</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">556249</pubid>
                  <pubid idtype="pmpid" link="fulltext">15809422</pubid>
                  <pubid idtype="doi">10.1073/pnas.0500729102</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Protein building blocks preserved by recombination</p>
            </title>
            <aug>
               <au>
                  <snm>Voigt</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Martinez</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>ZG</fnm>
               </au>
               <au>
                  <snm>Mayo</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Arnold</snm>
                  <fnm>FH</fnm>
               </au>
            </aug>
            <source>Nat Struct Biol</source>
            <pubdate>2002</pubdate>
            <volume>9</volume>
            <issue>7</issue>
            <fpage>553</fpage>
            <lpage>558</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">12042875</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>FamClash: a method for ranking the activity of engineered enzymes</p>
            </title>
            <aug>
               <au>
                  <snm>Saraf</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Horswill</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Benkovic</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Maranas</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2004</pubdate>
            <volume>101</volume>
            <issue>12</issue>
            <fpage>4142</fpage>
            <lpage>4147</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">384708</pubid>
                  <pubid idtype="pmpid" link="fulltext">14981242</pubid>
                  <pubid idtype="doi">10.1073/pnas.0400065101</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Site-directed protein recombination as a shortest-path problem</p>
            </title>
            <aug>
               <au>
                  <snm>Endelman</snm>
                  <fnm>JB</fnm>
               </au>
               <au>
                  <snm>Silberg</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>ZG</fnm>
               </au>
               <au>
                  <snm>Arnold</snm>
                  <fnm>FH</fnm>
               </au>
            </aug>
            <source>Protein Eng Des Sel</source>
            <pubdate>2004</pubdate>
            <volume>17</volume>
            <issue>7</issue>
            <fpage>589</fpage>
            <lpage>594</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/protein/gzh067</pubid>
                  <pubid idtype="pmpid" link="fulltext">15331774</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Functional evolution and structural conservation in chimeric cytochromes p450: calibrating a structure-guided approach</p>
            </title>
            <aug>
               <au>
                  <snm>Otey</snm>
                  <fnm>CR</fnm>
               </au>
               <au>
                  <snm>Silberg</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>Voigt</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Endelman</snm>
                  <fnm>JB</fnm>
               </au>
               <au>
                  <snm>Bandara</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Arnold</snm>
                  <fnm>FH</fnm>
               </au>
            </aug>
            <source>Chem Biol</source>
            <pubdate>2004</pubdate>
            <volume>11</volume>
            <issue>3</issue>
            <fpage>309</fpage>
            <lpage>318</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.chembiol.2004.02.018</pubid>
                  <pubid idtype="pmpid" link="fulltext">15123260</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Design of combinatorial protein libraries of optimal size</p>
            </title>
            <aug>
               <au>
                  <snm>Saraf</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Gupta</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Maranas</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2005</pubdate>
            <volume>60</volume>
            <issue>4</issue>
            <fpage>769</fpage>
            <lpage>777</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.20490</pubid>
                  <pubid idtype="pmpid" link="fulltext">16001404</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure</p>
            </title>
            <aug>
               <au>
                  <snm>Capriotti</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Fariselli</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Casadio</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <issue>Web Server</issue>
            <fpage>W306</fpage>
            <lpage>W310</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1160136</pubid>
                  <pubid idtype="pmpid" link="fulltext">15980478</pubid>
                  <pubid idtype="doi">10.1093/nar/gki375</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Prediction of protein stability changes for single-site mutations using support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Cheng</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Randall</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2006</pubdate>
            <volume>62</volume>
            <issue>4</issue>
            <fpage>1125</fpage>
            <lpage>1132</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.20810</pubid>
                  <pubid idtype="pmpid" link="fulltext">16372356</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>ConSeq: the identification of functionally and structurally important residues in protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Berezin</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Glaser</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Rosenberg</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Paz</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Pupko</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Fariselli</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Casadio</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Ben-Tal</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>8</issue>
            <fpage>1322</fpage>
            <lpage>1324</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth070</pubid>
                  <pubid idtype="pmpid" link="fulltext">14871869</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Selection of representative protein data sets</p>
            </title>
            <aug>
               <au>
                  <snm>Hobohm</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Scharf</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Protein Sci</source>
            <pubdate>1992</pubdate>
            <volume>1</volume>
            <issue>3</issue>
            <fpage>409</fpage>
            <lpage>417</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">1304348</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Protein secondary structure prediction based on position-specific scoring matrices</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1999</pubdate>
            <volume>292</volume>
            <issue>2</issue>
            <fpage>195</fpage>
            <lpage>202</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1999.3091</pubid>
                  <pubid idtype="pmpid" link="fulltext">10493868</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Prediction of coordination number and relative solvent accessibility in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Fariselli</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Casadio</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2002</pubdate>
            <volume>47</volume>
            <issue>2</issue>
            <fpage>142</fpage>
            <lpage>153</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.10069</pubid>
                  <pubid idtype="pmpid" link="fulltext">11933061</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>A novel method of protein secondary structure prediction with high segment overlap measure: Svm approach</p>
            </title>
            <aug>
               <au>
                  <snm>Hua</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sun</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>2001</pubdate>
            <volume>308</volume>
            <fpage>397</fpage>
            <lpage>407</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2001.4580</pubid>
                  <pubid idtype="pmpid" link="fulltext">11327775</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures</p>
            </title>
            <aug>
               <au>
                  <snm>Boden</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Yuan</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Bailey</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>68</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1386714</pubid>
                  <pubid idtype="pmpid" link="fulltext">16478545</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-68</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Exploiting the past and the future in protein secondary structure prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Frasconi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Soda</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Pollastri</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <issue>11</issue>
            <fpage>937</fpage>
            <lpage>946</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.11.937</pubid>
                  <pubid idtype="pmpid" link="fulltext">10743560</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Predictive approaches for choosing hyperparameters in gaussian processes</p>
            </title>
            <aug>
               <au>
                  <snm>Sundararajan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Keerthi</snm>
                  <fnm>SS</fnm>
               </au>
            </aug>
            <source>Neural Comput</source>
            <pubdate>2001</pubdate>
            <volume>13</volume>
            <issue>5</issue>
            <fpage>1103</fpage>
            <lpage>1118</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1162/08997660151134343</pubid>
                  <pubid idtype="pmpid">11359646</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Better prediction of protein contact number using a support vector regression analysis of amino acid sequence</p>
            </title>
            <aug>
               <au>
                  <snm>Yuan</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>248</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1277819</pubid>
                  <pubid idtype="pmpid" link="fulltext">16221309</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-248</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Basic local alignment search tool</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1990</pubdate>
            <volume>215</volume>
            <issue>3</issue>
            <fpage>403</fpage>
            <lpage>410</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1990.9999</pubid>
                  <pubid idtype="pmpid" link="fulltext">2231712</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Using a residue clash map to functionally characterize protein recombination hybrids</p>
            </title>
            <aug>
               <au>
                  <snm>Saraf</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Maranas</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>Protein Eng</source>
            <pubdate>2003</pubdate>
            <volume>16</volume>
            <issue>12</issue>
            <fpage>1025</fpage>
            <lpage>1034</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/protein/gzg129</pubid>
                  <pubid idtype="pmpid" link="fulltext">14983083</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Rapid generation of incremental truncation libraries for protein engineering using alpha-phosphothioate nucleotides</p>
            </title>
            <aug>
               <au>
                  <snm>Lutz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ostermeier</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Benkovic</snm>
                  <fnm>SJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <issue>4</issue>
            <fpage>E16</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">29623</pubid>
                  <pubid idtype="pmpid" link="fulltext">11160936</pubid>
                  <pubid idtype="doi">10.1093/nar/29.4.e16</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>A combinatorial approach to hybrid enzymes independent of DNA homology</p>
            </title>
            <aug>
               <au>
                  <snm>Ostermeier</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Shim</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Benkovic</snm>
                  <fnm>SJ</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>1999</pubdate>
            <volume>17</volume>
            <issue>12</issue>
            <fpage>1205</fpage>
            <lpage>1209</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/70754</pubid>
                  <pubid idtype="pmpid" link="fulltext">10585719</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
