<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-10-107</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>EFICAz<sup>2</sup>: enzyme function inference by a combined approach enhanced by machine learning</p>
         </title>
         <aug>
            <au id="A1" ce="yes">
               <snm>Arakaki</snm>
               <mi>K</mi>
               <fnm>Adrian</fnm>
               <insr iid="I1"/>
               <email>adrian.arakaki@gatech.edu</email>
            </au>
            <au id="A2" ce="yes">
               <snm>Huang</snm>
               <fnm>Ying</fnm>
               <insr iid="I2"/>
               <email>yih007@ucsd.edu</email>
            </au>
            <au id="A3" ca="yes">
               <snm>Skolnick</snm>
               <fnm>Jeffrey</fnm>
               <insr iid="I1"/>
               <email>skolnick@gatech.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia, 30318, USA</p>
            </ins>
            <ins id="I2">
               <p>California Institute for Telecommunications and Information Technology, University of California, San Diego, La Jolla, CA, 92093, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2009</pubdate>
         <volume>10</volume>
         <issue>1</issue>
         <fpage>107</fpage>
         <url>http://www.biomedcentral.com/1471-2105/10/107</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">19361344</pubid>
               <pubid idtype="doi">10.1186/1471-2105-10-107</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>18</day>
               <month>11</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>13</day>
               <month>4</month>
               <year>2009</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>13</day>
               <month>4</month>
               <year>2009</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2009</year>
         <collab>Arakaki et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-based components is associated to one of two specific kinds of enzyme families. EFICAz exhibits high precision performance, except when the maximal test to training sequence identity (MTTSI) is lower than 30%. To improve EFICAz's performance in this regime, we: i) increased the number of predictive components and ii) took advantage of consensual information from the different components to make the final EC number assignment.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have developed two new EFICAz components, analogs to the two FDR-based components, where the discrimination between homo and heterofunctional members is based on the evaluation, via Support Vector Machine models, of all the aligned positions between the query sequence and the multiple sequence alignments associated to the enzyme families. Benchmark results indicate that: i) the new SVM-based components outperform their FDR-based counterparts, and ii) both SVM-based and FDR-based components generate unique predictions. We developed classification tree models to optimally combine the results from the six EFICAz components into a final EC number prediction. The new implementation of our approach, EFICAz<sup>2</sup>, exhibits a highly improved prediction precision at MTTSI &lt; 30% compared to the original EFICAz, with only a slight decrease in prediction recall. A comparative analysis of enzyme function annotation of the human proteome by EFICAz<sup>2 </sup>and KEGG shows that: i) when both sources make EC number assignments for the same protein sequence, the assignments tend to be consistent and ii) EFICAz<sup>2 </sup>generates considerably more unique assignments than KEGG.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Performance benchmarks and the comparison with KEGG demonstrate that EFICAz<sup>2 </sup>is a powerful and precise tool for enzyme function annotation, with multiple applications in genome analysis and metabolic pathway reconstruction. The EFICAz<sup>2 </sup>web service is available at: <url>http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html</url></p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>From a purely biochemical point of view, enzymes constitute the most important group of proteins. They are versatile, catalyzing most chemical reactions involved in the metabolism of living organisms, and abundant, representing approximately 15% to 35% of a given proteome <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. Enzymes are classified according to the Enzyme Commission (EC) system, a hierarchical system that assigns a unique four-field number to each enzymatic activity <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. The first field of an EC number indicates the general class of catalyzed reaction. The second and third fields depend on different criteria related to the chemical features of the substrate and the product of the reaction, and the fourth field is a sequential number without any special meaning. A comprehensive and detailed enzyme function annotation of the available genomes is necessary not only to increase our understanding of the biochemistry of living organisms, but also to gain more insight into the evolutionary processes that originated the diversity of enzymes currently found in nature <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. The precise assignment of EC numbers to catalytic proteins is a vital requirement for the correct reconstruction of metabolic pathways <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Moreover, reconstructed metabolic pathways play a key role in many biomedical approaches <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>, but the success of these applications strongly depends of the quality of the functional annotations of the enzymes comprising such pathways <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
         <p>Despite the great importance of precise EC number assignments, enzyme functions as well as other molecular, cellular or physiological functions, are often inferred from sequence similarity to previously characterized proteins <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. In this annotation modality, commonly known as "prediction by homology transfer", the (incorrect) assumption is that all homologs have the same function <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. This functional annotation strategy is negatively affected by at least two factors. The first factor is the functional diversity of highly similar sequences observed in many protein families <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. For example, to transfer detailed enzyme function, given by four-field EC numbers, with an average precision of at least 90%, a sequence identity threshold of 60% is required <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. However, the functional annotation of many genomes has been carried out employing much lower thresholds <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. The second factor is the structural and functional modularity of proteins <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>; thus, when the modular nature of proteins is disregarded, functional annotations based on best database hits are often erroneous <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. Mainly due to these factors, sequence similarity-based annotation strategies result in a high number of errors <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp> that often propagate in public databases <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. For instance, it has been estimated that functional assignments inferred by sequence similarity in the Gene Ontology sequence database (GOSeqLite), have an estimated error rate of 49% <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Other approaches for enzyme function prediction do not directly depend on the level of similarity between sequences. For example, several methods are based on the identification of specific structural patterns associated with functional sites <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>, but they are limited by the requirement that the query protein's structure be solved. Yet other approaches are based on the analysis of properties of proteins such as tissue specificity, subcellular location and phylogenetic information <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, or genome context and other functional association evidence <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. However, these methods also suffer from the lack of consistent and comprehensive database annotations related to this kind of sequence-independent features.</p>
         <p>To address the limitations of transfer of enzyme function by sequence similarity, we developed EFICAz (Enzyme Function Inference by a Combined Approach), an engine for large-scale high-precision enzyme function inference <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. The original implementation of EFICAz combines the predictions of four independent methods: (C1) <b>CHIEFc family based FDR recognition</b>: detection of Functionally Discriminating Residues (FDRs) in enzyme families obtained by a Conservation-controlled HMM Iterative procedure for Enzyme Family classification (CHIEFc), (C2) <b>Multiple Pfam family based FDR recognition</b>: detection of FDRs in combinations of Pfam families that concurrently detect a particular enzyme function, (C3) <b>CHIEFc family specific SIT evaluation</b>: pairwise sequence comparison using a CHIEFc family specific Sequence Identity Threshold (SIT), and (C4) <b>High specificity multiple PROSITE pattern recognition</b>: detection of multiple PROSITE patterns that, taken all together, are specifically associated to a particular enzyme function. Since each predictive component was designed to be highly precise and predictions made by any pair of components do not completely overlap (including C1 and C2, which only differ in the way the protein families are defined), at the final stage, EFICAz makes a particular EC number assignment when one or more of the four component methods predict a given EC number. Since EFICAz and its components have been fully described before <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, here, we briefly introduce the basics of the predictive components based on the recognition of FDRs and highlight possible improvements.</p>
         <p>A CHIEFc or Pfam enzyme family <it>E </it>is defined by a multiple alignment of sequences evolutionary related to a seed group of sequences sharing a particular EC number EC<sub><it>E</it></sub>. FDRs are residues in specific positions of the alignment, selected via an Evolutionary Footprinting method <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> for their ability to discriminate between homo-functional and hetero-functional family members. Homo- and hetero-functional family members are defined as sequences annotated or not annotated with the EC number EC<sub><it>E</it></sub>, respectively. To apply an FDR recognition method, we first determine if a query sequence <b>q </b>is a member of an enzyme family <it>E </it>by evaluating a Hidden Markov Model derived from <it>E</it>. If so, we check if <b>q </b>exhibits conservation of the FDRs associated with <it>E</it>. When both conditions are fulfilled, we predict that <b>q </b>is a homo-functional member of <it>E </it>and assign the EC number EC<sub><it>E </it></sub>to the query sequence <b>q</b>. A figure illustrating the concept of FDRs can be found in Additional file <supplr sid="S1">1</supplr>: Figure S1. Example of Functionally Discriminating Residues (FDRs). A potential pitfall of the FDR recognition methods is that if the number of FDRs for a given enzyme family is too small, it can be difficult to achieve high prediction precision, because the matching of a very small number of residues in an alignment is more likely to occur by chance. Conversely, if the number of FDRs is too large, the prediction recall might suffer, because the matching of a large number of residues in an alignment imposes a very restrictive condition. In principle, these issues could be addressed by techniques more advanced than FDR matching in terms of their ability to detect the signals characteristic of homo-functional enzyme family members in the query sequence. In this work, we describe the development of a method for enzyme function inference that is based on this premise. We employ a Support Vector Machine (SVM) learning approach <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> that evaluates all the aligned positions between a query sequence and the multiple sequence alignment associated to a given Pfam or CHIEFc enzyme family. We term these components: (C5) <b>CHIEFc family based SVM evaluation </b>and (C6) <b>Multiple Pfam family based SVM evaluation</b>, and our benchmarks show that they yield higher predictive performance than their counterparts based on FDR recognition.</p>
         <suppl id="S1">
            <title>
               <p>Additional file 1</p>
            </title>
            <text>
               <p><b>Figure S1</b>. Example of Functionally Discriminating Residues (FDRs).</p>
            </text>
            <file name="1471-2105-10-107-S1.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>As mentioned above, in the previous implementation of EFICAz, all EC numbers predicted by the four original component methods were been reported, whether they agreed with each other or not. Here, based on estimations of the method's performance that are more realistic than those published before <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B27">27</abbr></abbrgrp>, we show that such a strategy tends to negatively affect prediction precision, especially at low levels of maximal test to training sequence identity (MTTSI, formally defined in the Methods section). To address this issue, we have developed a tree-based classification algorithm <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> that applies a set of hierarchical rules to generate an EC number assignment from the list of the component methods that predict such EC number and the query sequence's MTTSI. We have included the two additional SVM-based component methods as well as the classification tree algorithm in the current implementation of EFICAz, that we term EFICAz<sup>2</sup>. According to the results of our performance benchmarks, EFICAz<sup>2 </sup>is dramatically more precise than EFICAz at low MTTSI, while it shows only a modest decrease in recall in this MTTSI regime.</p>
         <p>The rest of this paper is organized as follows: in the Results and Discussion section, we describe the development and benchmarking of the SVM-based enzyme function inference method and the classification tree algorithm to generate the final EC number prediction, and present a comparative study of enzyme function annotations of the human proteome by EFICAz<sup>2 </sup>and KEGG <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. In the Conclusions section, we summarize the present work, stress its significance, and discuss its limitations. Finally, in the Methods section, we describe the data sources and procedures for training and benchmarking of EFICAz<sup>2</sup>, provide details about the statistical analyses and technical aspects of the generation of SVM and classification tree models, and describe the data sources for the comparative analysis of enzyme function annotation of the human proteome.</p>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <sec>
            <st>
               <p>Novel EFICAz components based on SVM</p>
            </st>
            <p>Two of the four component methods in the original implementation of EFICAz are based on the identification of homo-functional members of a given CHIEFc (C1) or Multiple Pfam enzyme family (C2), i.e., members whose enzymatic activity coincides with that of the seed enzymes that originated the family. The criterion followed by these methods to consider a query sequence as homo-functional (and therefore make the corresponding EC assignment) is the matching of FDRs. Since FDRs constitute a subset of all residues in the multiple sequence alignment associated to an enzyme family, we reasoned that an algorithm operating over all the aligned positions (i.e., with access to all possible information) could achieve higher discriminatory power, at least in certain cases. This situation is analogous to that of patterns and profiles for the identification of protein families and domains in the PROSITE database <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>.</p>
            <p>Initially, PROSITE consisted of patterns alone and was later enriched by the inclusion of profiles. Although, in general, PROSITE profiles exhibit increased sensitivity with respect to patterns, profiles and patterns complement each other, i.e. both types of descriptors offer unique advantages in particular cases <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <p>Our implementation of the profile-like approach to the recognition of homo-functional sequences is based on SVM models associated to each enzyme family. The basic idea of the SVM algorithm is mapping the data from an input space into a high-dimensional feature space via a kernel function, and finding a hyper-plane to separate positive and negative samples in the feature space <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. The training of the SVM models is carried out using the whole set of aligned residues in the corresponding multiple sequence alignment, which include both positives or homo-functional sequences and negatives or hetero-functional sequences (see Methods section, "Support vector machine models"). The new component methods were termed: (C5) CHIEFc family based SVM evaluation and (C6) Multiple Pfam family based SVM evaluation. In order to compare the performance of the new SVM-based components to that of the FDR-based components, we carried out extensive benchmarking. First, we trained the two FDR-based (C1 and C2) and the two SVM-based components (C5 and C6) using previous releases of the corresponding databases; these specific versions of the component methods were later included in EFICAz<sup>2 </sup>version 10, based on the Release 10 of UniProt <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> (see Methods section, "Datasets for the training of different EFICAz<sup>2 </sup>versions"). Then, we selected test sequences from all of the well annotated, newly added Swiss-Prot sequences in UniProt Release 12.6 that were not included in the Release 10. Finally, for each test sequence, we collected the enzyme function predicted by each of the four components under evaluation and calculated the average precision and recall (see Methods section, "Benchmarking of EFICAz<sup>2 </sup>version 10"). The statistical significance of the differences in method's performance was evaluated as described in "Statistical analyses", in the Methods section.</p>
            <p>Figure <figr fid="F1">1</figr> shows a comparison of the performance of the FDR-based (C1) and the SVM-based approaches (C5) applied to three-field EC number (Figure <figr fid="F1">1AB</figr>) and four-field EC number CHIEFc enzyme families (Figure <figr fid="F1">1CD</figr>). In the case of three-field EC number classifiers, the SVM-based method achieves significantly higher average recall at MTTSI lower than 30% and higher than 80% (Figure <figr fid="F1">1A</figr>), but shows no significant difference in average precision (Figure <figr fid="F1">1B</figr>). The SVM-based implementation for four-field EC number classifiers also shows an advantage in terms of average recall at MTTSI higher than 80% (Figure <figr fid="F1">1C</figr>), in addition to a significant increase of average precision at MTTSI between 30% and 40%. Figure <figr fid="F2">2</figr> shows a comparison of the performances of the FDR-based (C2) and the SVM-based approaches (C6) applied to three-field EC number (Figure <figr fid="F2">2AB</figr>) and four-field EC number Multiple Pfam enzyme families (Figure <figr fid="F2">2CD</figr>). For three-field EC number classifiers, the SVM-based method exhibits significantly higher average recall in the 40% to 50% and higher than 80% MTTSI intervals (Figure <figr fid="F2">2A</figr>), and significantly higher average precision in the 30% to 40% MTTSI interval (Figure <figr fid="F2">2B</figr>). For four-field EC number classifiers, the improvements in average recall (Figure <figr fid="F2">2C</figr>) and precision (Figure <figr fid="F2">2D</figr>) of the SVM-based approach applied to Multiple Pfam families occur in the same MTTSI intervals as the improvements observed when this approach is applied to CHIEFc families (Figure <figr fid="F1">1C, D</figr>). In summary, in all the cases where the differences are statistically significant, the SVM-based methods show improved performance with respect to the corresponding FDR-based implementations. In fact, with only a few exceptions, the SVM-based methods exhibit the same or better average recall and precision than the FDR-based ones, although in several MTTSI intervals the current benchmark does not contain enough test sequences to make the differences between methods statistically significant.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Prediction performance of the FDR-based and SVM-based approaches applied to Multiple Pfam enzyme families</p>
               </caption>
               <text>
                  <p><b>Prediction performance of the FDR-based and SVM-based approaches applied to Multiple Pfam enzyme families</b>. For three-field (A, B) or four-field EC number classifiers (C, D), the average recall (A, C) and average precision (B, D) of the FDR-based (blue columns) and SVM-based (red columns) approaches is plotted at different intervals of maximal test to training sequence identity (MTTSI). The average of each performance indicator is done over all the EC numbers defined in the specified MTTSI interval (numbers at the bottom of each column). Details about the benchmark can be found in "Benchmarking of EFICAz<sup>2 </sup>version 10", in the Methods section. Statistically significant differences in performance are indicated by black lines under the corresponding columns (see "Statistical analyses", in the Methods section). Values on top of each column represent average +/- standard deviation.</p>
               </text>
               <graphic file="1471-2105-10-107-1"/>
            </fig>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Prediction performance of the FDR-based and SVM-based approaches applied to CHIEFc enzyme families</p>
               </caption>
               <text>
                  <p><b>Prediction performance of the FDR-based and SVM-based approaches applied to CHIEFc enzyme families</b>. For three-field (A, B) or four-field EC number classifiers (C, D), the average recall (A, C) and average precision (B, D) of the FDR-based (blue columns) and SVM-based (red columns) approaches is plotted at different intervals of maximal test to training sequence identity (MTTSI). The average of each performance indicator is done over all the EC numbers defined in the specified MTTSI interval (numbers at the bottom of each column). Details about the benchmark can be found in "Benchmarking of EFICAz<sup>2 </sup>version 10", in the Methods section. Statistically significant differences in performance are indicated by black lines under the corresponding columns (see "Statistical analyses", in the Methods section). Values on top of each column represent average +/- standard deviation.</p>
               </text>
               <graphic file="1471-2105-10-107-2"/>
            </fig>
            <p>Since EFICAz works by combining the predictions of different non-completely overlapping methods, even if the FDR- and the SVM-based approaches had identical average performance, they could still be both useful, provided that each method can generate its own set of unique predictions. Figure <figr fid="F3">3</figr> shows the fraction of test sequences correctly predicted by either approach, both approaches, or none of them, when implemented on three-field or four-field EC number classifiers based on Pfam or CHIEFc enzyme families. Although the overlap of the approaches is high, each method provides a set of unique predictions, with a higher contribution from the SVM-approach for three-field EC number classifiers (10.0% and 6.3% for Multiple Pfam and CHIEFc enzyme families, respectively), and similar contributions from each approach for four-field EC number classifiers. Thus, we decided to keep the FDR-based predicted components and incorporate the SVM-based components: (C5) CHIEFc family based SVM evaluation and (C6) Multiple Pfam family based SVM evaluation in the new version of EFICAz.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Prediction overlap of FDR-based and SVM-based methods</p>
               </caption>
               <text>
                  <p><b>Prediction overlap of FDR-based and SVM-based methods</b>. The fractions of test sequences (corresponding to the benchmark described in "Benchmarking of EFICAz<sup>2 </sup>version 10", in the Methods section) correctly predicted by three or four-field EC number classifiers applied to Multiple Pfam or CHIEFc enzyme families are represented. For combination of enzyme family and level of description of the classifiers, we show the fraction corresponding to unique predictions made by the FDR-based (blue) or SVM-based method (green), and the fraction corresponding to predictions made by both (orange) or none of the methods (yellow).</p>
               </text>
               <graphic file="1471-2105-10-107-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Combination rules based on classification trees</p>
            </st>
            <p>The original version of EFICAz adopted the simple strategy of predicting a given EC number when at least one of its four component did <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. Figure <figr fid="F4">4</figr> shows the result of a benchmark that compares the performance of three different implementations of EFICAz (version 10), in terms of average recall (Figure <figr fid="F4">4AC</figr>) and average precision (Figure <figr fid="F4">4BD</figr>), distinguishing between two levels of detail of enzyme function given by three-field (Figure <figr fid="F4">4AB</figr>) or four-field EC numbers (Figure <figr fid="F4">4CD</figr>). As opposed to the results from previous benchmarks <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B27">27</abbr></abbrgrp>, the original EFICAz implementation shows poor average precision at MTTSI &lt; 30% (Figure <figr fid="F4">4BD</figr>, green columns). The discrepancy arises because in this work we employed a more rigorous way to estimate the precision of our method (see Methods section, "Benchmarking of EFICAz<sup>2 </sup>version 10"). We analyzed the effect of adding the two SVM-based components to EFICAz, bringing the total number of component methods to six (Figure <figr fid="F4">4</figr>, blue columns). As expected, a general pattern of increased recall (Figure <figr fid="F4">4AC</figr>) and decreased precision (Figure <figr fid="F4">4BD</figr>) with respect to the original four-component EFICAz can be observed, although only for three-field EC number classifiers at MTTSI &lt; 30% was the decrease in precision statistically significant.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Prediction performance of different EFICAz implementations</p>
               </caption>
               <text>
                  <p><b>Prediction performance of different EFICAz implementations</b>. For three-field (A, B) or four-field EC number classifiers (C, D), the average recall (A, C) and average precision (B, D) of the original EFICAz (green columns), EFICAz plus the new SVM-based components (blue columns) and EFICAz<sup>2 </sup>(red columns) is plotted at different intervals of maximal test to training sequence identity (MTTSI). The average of each performance indicator is done over all the EC numbers defined in the specified MTTSI interval (numbers at the bottom of each column). Details about the benchmark can be found in "Benchmarking of EFICAz<sup>2 </sup>version 10", in the Methods section. Statistically significant differences in performance are indicated by black lines under the corresponding columns (see "Statistical analyses", in the Methods section). Values on top of each column represent average +/- standard deviation.</p>
               </text>
               <graphic file="1471-2105-10-107-4"/>
            </fig>
            <p>In order to improve the precision of our approach, we decided to investigate more efficient ways to integrate the predictions generated by the six EFICAz component methods. We had demonstrated in our previous work that increased precision can be achieved by requiring the consensus of two or more components of EFICAz <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. Here, we decided to train decision tree models to find the optimal way to take advantage of consensual information from the different components. Decision trees are very effective tools in machine learning that produce accurate, highly interpretable predictions and have been successfully used in several computational biology and bioinformatics applications <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, including enzyme function prediction <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. For our particular case, we sought decision trees able to output a binary outcome (whether a given EC number is assigned or not to a protein sequence), based on the prediction results of each component. Decision trees that produce discrete outcomes are called classification trees <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. There are several possibilities to consider regarding the level of generalization of the classification trees, for example, whether or not they depend on the specific EC number type. In principle, EC number-specific classification trees could yield more accurate predictions. However, since not all the EC number types are represented in the set of test sequences, we opted for an EC number-independent solution.</p>
            <p>After the training procedure detailed in "Decision tree learning model" in the Methods section, we obtained the four classification trees shown in Figure <figr fid="F5">5</figr>, one for each combination of three or four-field EC number classifiers and low (&lt; 30%) or high (&#8805; 30%) MTTSI. Inspection of the questions associated to the nodes of the classification trees indicates that the SVM-based components are the most informative ones, for example, CHIEFc family based SVM evaluation plays a role in all four trees (Figure <figr fid="F5">5</figr>). The version of our approach that employs these classification trees to integrate the information from the six possible component methods was termed EFICAz<sup>2</sup>.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Predictive models for EFICAz<sup>2 </sup>based on classification trees</p>
               </caption>
               <text>
                  <p><b>Predictive models for EFICAz<sup>2 </sup>based on classification trees</b>. Classification trees corresponding to three-field (A, B) and four-field EC numbers (C, D) to integrate predictions from each of the six EFICAz<sup>2 </sup>components for protein sequences that exhibit MTTSI &lt; 30% (A, C) or MTTSI &#8805; 30% (B, D). CH<sub>FDR </sub>= CHIEFc family based FDR recognition; PF<sub>FDR </sub>= Multiple Pfam family based FDR recognition; CH<sub>SIT </sub>= CHIEFc family specific SIT evaluation; Prst = High specificity multiple PROSITE pattern recognition; CH<sub>svm </sub>= CHIEFc family based SVM evaluation; PF<sub>svm </sub>= Multiple Pfam family based SVM evaluation.</p>
               </text>
               <graphic file="1471-2105-10-107-5"/>
            </fig>
            <p>We compared the performance of EFICAz<sup>2 </sup>(Figure <figr fid="F4">4</figr>, red columns) to that of the original EFICAz with four components or the updated version with six components. Compared to the original EFICAz, EFICAz<sup>2 </sup>displays a statistically significant decrease in average recall at MTTSI &lt; 30% (a difference in recall of 5% and 10% for three- and four- field EC numbers, respectively, Figure <figr fid="F4">4AC</figr>) and at a few other MTTSI intervals, although the difference in recall is less than 5% in these latter cases. More importantly, EFICAz<sup>2 </sup>shows a dramatic increase in average precision at MTTSI &lt; 30% (a difference in precision of 25% and 55% for three- and four- field EC numbers, respectively, Figure <figr fid="F4">4BD</figr>). Similar tendencies, with average recall increases and average precision decreases of higher magnitude, can be observed when EFICAz<sup>2 </sup>is compared to EFICAz updated to six components. In summary, we first shifted the precision-recall trade-off towards higher recall and lower precision by adding the SVM-based components to the original EFICAz implementation. Then, by making more efficient use of consensus between predictions from different components via classification tree models, we achieved acceptable levels of average precision at low MTTSI, with low impact on the average recall. The EFICAz<sup>2 </sup>code is available upon request to academic and non-profit users. In addition, we have made EFICAz<sup>2 </sup>available as a web service <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> that allows the submission of query protein sequences and returns the output via email. If an enzyme function inference is made, the output consists of the four-field or three-field EC number prediction/s, the predictive component/s that recognized the EC number/s, the MTTSI interval associated to the query sequence and the mean and standard deviation of the precision performance obtained from benchmarks.</p>
            <p>EFICAz<sup>2 </sup>exhibits an average precision of at least 90% for MTTSI &#8805; 40% (Figure <figr fid="F4">4B, D</figr>), a non trivial achievement, considering that to achieve this level of precision from a sequence similarity criterion alone, MTTSI &#8805; 60% is required <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. Moreover, we significantly improved the prediction precision at MTTSI &lt; 30%, compared to the original implementation of EFICAz. Nevertheless, the recall in this regime still requires additional improvement (average recall of 33% and 23% for three-field and four-field EC numbers at MTTSI &lt; 30%, respectively, Figure <figr fid="F4">4AC</figr>). One possibility to overcome this EFICAz<sup>2</sup>'s limitation is to include methods that do not depend on sequence information. Some protein features that have been used before with the purpose of enzyme function prediction include protein- protein interaction <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>, phylogenetic distribution, tissue specificity and subcellular localization <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Although we will explore the possibility of including non-sequence-dependent features of proteins in future versions of EFICAz, its implementation may be impaired by the low availability or inconsistency that this kind of annotations exhibits in current databases.</p>
         </sec>
         <sec>
            <st>
               <p>Enzyme function annotation of the human proteome by EFICAz<sup>2</sup></p>
            </st>
            <p>We carried out an enzyme function reannotation of the human proteome (24,305 protein sequences) using EFICAz<sup>2 </sup>version 13 (see Methods section, "Datasets for the training of different EFICAz<sup>2 </sup>versions") and compared our annotations with those available in a recent release of KEGG (see Methods section, "Enzyme function annotation of the human proteome"). We decided to use KEGG annotations rather than other sources to compare against our EFICAz<sup>2 </sup>predictions because of the emphasis that this database puts on detailed EC number information, a fundamental requirement for the correct mapping of metabolic pathways. Two different levels of detail of the enzyme function assignment (given by three-field and four-field EC numbers) were considered separately for the analysis. Table <tblr tid="T1">1</tblr> summarizes the results of the comparison. A single protein may have more than one enzymatic activity; therefore, multiple EC numbers can be assigned to the same protein. Where it is pertinent, both the number of protein sequences and the number of annotations (that can be higher than the number of sequences) were reported.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Comparative enzyme function annotation of the human proteome<sup>(1)</sup></p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c cspan="7" ca="center">
                        <p>Level of detail of the enzyme function assignment: Three-field EC numbers</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="5" ca="center">
                        <p>EFICAz<sup>2 </sup>predictions<sup>(2)</sup></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2" ca="center">
                        <p>Annotation source</p>
                     </c>
                     <c ca="center">
                        <p>EC numbers with less than three fields<sup>(4)</sup>: <b>20,889</b></p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>Three-field EC numbers: 3,508/<b>3,416</b><sup>(5)</sup></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>EC numbers with less than three fields<sup>(4)</sup>: <b>21,398</b></p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>20,608</b>
                        </p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>EFICAz<sup>2 </sup>novels: 798/<b>790</b></p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Level of EC annotation agreement<sup>(6)</sup></p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>KEGG annotations<sup>(3)</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Annotation source</p>
                     </c>
                     <c ca="center">
                        <p>None</p>
                     </c>
                     <c ca="center">
                        <p>Partial</p>
                     </c>
                     <c ca="center">
                        <p>Full</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Three-field EC numbers: 2,954/<b>2,907</b></p>
                     </c>
                     <c ca="center">
                        <p>KEGG novels: 309/<b>281</b></p>
                     </c>
                     <c ca="center">
                        <p>EFICAz<sup>2</sup></p>
                     </c>
                     <c ca="center">
                        <p>18/<b>18</b></p>
                     </c>
                     <c ca="center">
                        <p>138/<b>67</b></p>
                     </c>
                     <c ca="center">
                        <p>2,554/<b>2,541</b></p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>KEGG</p>
                     </c>
                     <c ca="center">
                        <p>18/<b>18</b></p>
                     </c>
                     <c ca="center">
                        <p>73/<b>67</b></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7" ca="center">
                        <p>Level of detail of the enzyme function assignment: Four-field EC numbers</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="5" ca="center">
                        <p>EFICAz<sup>2 </sup>predictions<sup>(2)</sup></p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="2" ca="center">
                        <p>Annotation source</p>
                     </c>
                     <c ca="center">
                        <p>EC numbers with less than four fields<sup>(4)</sup>: <b>21,660</b></p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>Four-field EC numbers: 2,850/<b>2,645</b></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>EC numbers with less than four fields<sup>(4)</sup>: <b>21,833</b></p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>21,350</b>
                        </p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>EFICAz<sup>2 </sup>novels: 522/<b>483</b></p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Level of EC annotation agreement<sup>(6)</sup></p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>KEGG annotations<sup>(3)</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Annotation source</p>
                     </c>
                     <c ca="center">
                        <p>None</p>
                     </c>
                     <c ca="center">
                        <p>Partial</p>
                     </c>
                     <c ca="center">
                        <p>Full</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Four-field EC numbers: 2,523/<b>2,472</b></p>
                     </c>
                     <c ca="center">
                        <p>KEGG novels: 338/<b>310</b></p>
                     </c>
                     <c ca="center">
                        <p>EFICAz<sup>2</sup></p>
                     </c>
                     <c ca="center">
                        <p>49/<b>46</b></p>
                     </c>
                     <c ca="center">
                        <p>260/<b>117</b></p>
                     </c>
                     <c ca="center">
                        <p>2,019/<b>1,999</b></p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>KEGG</p>
                     </c>
                     <c ca="center">
                        <p>46/<b>46</b></p>
                     </c>
                     <c ca="center">
                        <p>120/<b>117</b></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>(1) </sup>The source of the 24,305 human protein sequences is the KEGG Genes database Release 47.0+/06-26, of June 26, 2008.</p>
                  <p><sup>(2) </sup>Predictions made by EFICAz<sup>2 </sup>version 13.</p>
                  <p><sup>(3) </sup>Annotations obtained from the KEGG Brite database Release 47.0+/06-26, of June 26, 2008.</p>
                  <p><sup>(4) </sup>Includes non-enzymes, considered as having zero-field EC numbers.</p>
                  <p><sup>(5) </sup>Non-bolded font indicates number of annotations while bolded font refers to the number of annotated protein sequences (a single protein can display more than one enzymatic activity, thus, multiple EC numbers can be assigned to the same protein sequence).</p>
                  <p><sup>(6) </sup>Here, we compare the agreement between annotations from KEGG and EFICAz<sup>2 </sup>that have the same level of detail, whether three-field or four-field EC numbers. Three different levels of agreement are considered: 1) Full: all EC numbers assigned to the protein by KEGG and EFICAz<sup>2 </sup>are identical, 2) Partial: at least one but not all the EC numbers assigned to the protein by KEGG and EFICAz<sup>2 </sup>agree, and 3) None: none of the EC numbers assigned to the protein by KEGG and EFICAz<sup>2 </sup>coincides.</p>
               </tblfn>
            </tbl>
            <p>Table <tblr tid="T1">1</tblr> show that, although both KEGG and EFICAz<sup>2 </sup>provide unique annotations, the novel assignments made by EFICAz<sup>2 </sup>significantly exceed those from KEGG. At the level of detail of three-field EC numbers, there are 798 novel annotations by EFICAz<sup>2 </sup>corresponding to 790 proteins versus 309 unique annotations for 281 proteins from KEGG. Similarly, for four-field EC numbers, there are 522 novel annotations for 483 proteins by EFICAz<sup>2 </sup>versus 338 unique annotations for 310 proteins from KEGG. We analyzed the agreement between EFICAz<sup>2 </sup>and KEGG assignments for the 2,626 sequences that were annotated with a level of detail of at least one three-field EC number by both sources. For a given annotated protein, we distinguished among three possibilities: i) full agreement, where all the EC number/s assigned to the protein by EFICAz<sup>2 </sup>and KEGG coincide, ii) partial agreement, where at least one but not all the EC numbers assigned to the protein by these sources agree, and iii) no agreement, where none of the EC numbers assigned to the protein by these sources agree. For the 2,626 common sequences annotated with three-field EC numbers, the level of full agreement is 96.8%, while the level of partial agreement or better is 99.3%. Similarly, for the 2,162 sequences annotated with four-field EC numbers by both sources, the full and at least partial agreement is 92.5% and 97.9%, respectively. The matching of EC numbers is done at the stated level of detail, i.e. when comparing three-field or four-field EC numbers, only the first three fields or the full four fields are considered, respectively.</p>
            <p>The level of agreement between KEGG and EFICAz<sup>2 </sup>can also be assessed on the basis of the total number of EC number predictions by one or the other source, rather than by the total number of annotated proteins. The number of annotations and the number of proteins may differ because a single protein may have more than one enzymatic activity; therefore, more than one EC number may be associated to it. In this case, we only distinguish between agreement and lack of it. The number annotations by EFICAz<sup>2 </sup>and KEGG for the 2,626 sequences annotated with three-field EC numbers by both sources is 2,710 and 2,645, respectively. Thus, the level of agreement is 96.7% ([67+2,554]/2,710) and 99.1% ([67+2,554]/2,645) when expressed in terms of the number of EFICAz<sup>2 </sup>and KEGG three-field EC number annotations, respectively. The number of annotations by EFICAz<sup>2 </sup>and KEGG for the 2,162 sequences annotated with four-field EC numbers by both sources is 2,328 and 2,185, respectively. Therefore, the level of agreement is 91.7% ([117+2,019]/2,328) and 97.8% ([117+2,019]/2,185), when expressed in terms of the number of EFICAz<sup>2 </sup>and KEGG four-field EC number annotations, respectively.</p>
            <p>This comparative analysis indicates that when both sources make EC number assignments for the same protein sequence, there is a high chance that these assignments are consistent. On the other hand, at the level of detail of three-field EC numbers, EFICAz<sup>2 </sup>generates more than double the number of unique assignments (i.e., assignments for proteins annotated as non-enzymes by the other compared source), while it provides more than 50% additional unique assignments when four-field EC numbers are considered. The unique EC number assignments made by EFICAz<sup>2 </sup>can be found in Additional file <supplr sid="S2">2</supplr>: Novel enzyme function annotations of the human proteome by EFICAz<sup>2</sup>.</p>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>Novel enzyme function annotations of the human proteome by EFICAz<sup>2</sup></b>. Excel spreadsheet listing all the three-field or four-field EC numbers assigned by EFICAz<sup>2 </sup>version 13 to human proteins that were not annotated as enzymes in the Release 47.0 of the KEGG database.</p>
               </text>
               <file name="1471-2105-10-107-S2.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>In this work, we described, implemented and tested EFICAz<sup>2</sup>, a new version of EFICAz <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, our automated approach for enzyme function prediction, enhanced by means of machine learning techniques. We increased the number of EFICAz components from four to six by adding two methods based on the evaluation of Pfam and CHIEFc enzyme families by SVM classifiers. The SVM-based components showed statistically significant performance improvements compared to their counterpart methods based on the detection of FDRs. We generated a set of classification trees to integrate and take advantage of the complementarity between the predictions from the six component methods, and achieved a remarkable increase in average precision at low MTTSI, with only moderate impact on average recall. When we applied EFICAz<sup>2 </sup>to the enzyme function reannotation of the human proteome, we found that for proteins annotated as enzymes by both EFICAz<sup>2 </sup>and KEGG, the assigned EC numbers were highly consistent. Moreover, the number of unique enzyme assignments generated by EFICAz<sup>2 </sup>is significantly higher than the unique enzyme annotations in KEGG. Thus, the results of the performance benchmark and the comparison with KEGG, demonstrate that EFICAz<sup>2 </sup>is a powerful and precise tool for enzyme function annotation, with multiple applications in genome analysis and metabolic pathway reconstruction.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Datasets for the training of different EFICAz<sup>2 </sup>versions</p>
            </st>
            <p>The training of EFICAz<sup>2 </sup>requires a source of protein sequences with high quality functional annotations; for this purpose, we employ the UniProt Knowledgebase database (UniProt) <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. From the UniProtKB/Swiss-Prot component of UniProt (Swiss-Prot), we extract a set of enzyme sequences and a set of non-enzyme sequences, according to the criteria described in the original EFICAz article <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. These reference sets are employed for the training of all the EFICAz<sup>2 </sup>predictive components. Table <tblr tid="T2">2</tblr> shows the number of sequences included in the "enzymes" and "non-enzymes" sets corresponding to versions 10 and 13 of EFICAz<sup>2</sup>, as well as the number of sequences with three- and four-field EC number annotations in the "enzymes" sets. To train EFICAz<sup>2 </sup>versions 10 and 13, we used Releases 10 (March 2007) and 13 (February 2008) of UniProt, respectively. For training of the predictive components "Multiple Pfam family based FDR recognition" and "Multiple Pfam family based SVM evaluation" of both EFICAz<sup>2 </sup>versions, we used the Pfam database <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> Release 22. Finally, for the training of the "High specificity multiple PROSITE pattern recognition" component of EFICAz<sup>2 </sup>versions 10 and 13, we used the Releases 20.26 and 20.30 of the PROSITE database <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>, respectively. For EFICAz<sup>2 </sup>versions 10 and 13, Table <tblr tid="T3">3</tblr> shows the number of Pfam enzyme families, CHIEFc enzyme families and PROSITE patterns as well as the number of different three-field and four-field EC numbers associated to them.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Number of sequences in reference sets used for EFICAz<sup>2 </sup>training</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Reference sequence set</p>
                     </c>
                     <c ca="center">
                        <p>EFICAz<sup>2 </sup>version 10</p>
                     </c>
                     <c ca="center">
                        <p>EFICAz<sup>2 </sup>version 13</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="1">
                        <hr/>
                     </c>
                     <c cspan="1">
                        <hr/>
                     </c>
                     <c cspan="1">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>"non enzymes"</p>
                     </c>
                     <c ca="center">
                        <p>132.342</p>
                     </c>
                     <c ca="center">
                        <p>174,898</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>"enzymes" (all)</p>
                     </c>
                     <c ca="center">
                        <p>94,028</p>
                     </c>
                     <c ca="center">
                        <p>136,167</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>"enzymes" (three-field EC number)</p>
                     </c>
                     <c ca="center">
                        <p>90,801</p>
                     </c>
                     <c ca="center">
                        <p>131,503</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>"enzymes" (four-field EC number)</p>
                     </c>
                     <c ca="center">
                        <p>76,698</p>
                     </c>
                     <c ca="center">
                        <p>111,577</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Number of families and EC number types associated with different EFICAz<sup>2 </sup>predictive components</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Type of EFICAz<sup>2 </sup>component</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Three-field EC numbers</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Four-field EC numbers</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>EFICAz<sup>2 </sup>version 10</p>
                     </c>
                     <c ca="center">
                        <p>EFICAz<sup>2 </sup>version 13</p>
                     </c>
                     <c ca="center">
                        <p>EFICAz<sup>2 </sup>version 10</p>
                     </c>
                     <c ca="center">
                        <p>EFICAz<sup>2 </sup>version 13</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="1">
                        <hr/>
                     </c>
                     <c cspan="1">
                        <hr/>
                     </c>
                     <c cspan="1">
                        <hr/>
                     </c>
                     <c cspan="1">
                        <hr/>
                     </c>
                     <c cspan="1">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PFAM families</p>
                     </c>
                     <c ca="center">
                        <p>2294/<b>202</b><sup>(1)</sup></p>
                     </c>
                     <c ca="center">
                        <p>2294/<b>201</b></p>
                     </c>
                     <c ca="center">
                        <p>2022/<b>1987</b></p>
                     </c>
                     <c ca="center">
                        <p>2153/<b>2069</b></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CHIEFc families</p>
                     </c>
                     <c ca="center">
                        <p>2932/<b>208</b></p>
                     </c>
                     <c ca="center">
                        <p>2947/<b>209</b></p>
                     </c>
                     <c ca="center">
                        <p>3548/<b>2248</b></p>
                     </c>
                     <c ca="center">
                        <p>3607/<b>2354</b></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PROSITE patterns</p>
                     </c>
                     <c ca="center">
                        <p>807/<b>102</b></p>
                     </c>
                     <c ca="center">
                        <p>1949/<b>128</b></p>
                     </c>
                     <c ca="center">
                        <p>527/<b>228</b></p>
                     </c>
                     <c ca="center">
                        <p>1368/<b>437</b></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>All EFICAz<sup>2 </sup>components</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>208</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>209</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2248</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2354</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>(1) </sup>Non-bolded font indicates number of families or patterns while bolded font refers to the number of different EC number types recognized by the indicated category of EFICAz<sup>2 </sup>predictive component.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Benchmarking of EFICAz<sup>2 </sup>version 10</p>
            </st>
            <p>To evaluate the effect of the modifications introduced into EFICAz, we performed a benchmark using annotated Swiss-Prot sequences that were not used for training EFICAz<sup>2 </sup>version 10. First, we generated (as described above) "enzymes" and "non-enzymes" reference sets from all the newly added Swiss-Prot sequences in UniProt Release 12.6 that were not included in the Release 10 of this database. The test sequences used to evaluate three-field EC number prediction performance consist of all the 16,430 members of the "non-enzymes" set plus 9,397 members of the "enzymes" set annotated with at least one of the 208 three-field EC number types recognized by EFICAz<sup>2 </sup>version 10. Similarly, the test sequences to evaluate four-field EC number prediction performance include the 16,430 non-enzymes plus 6,996 members of the 'enzymes" set annotated with at least one of the 2,248 four-field EC number types recognized by EFICAz<sup>2 </sup>version 10. Figure <figr fid="F6">6</figr> shows the distribution of the number of test sequences per enzyme type. Then, we compared the functional annotations of each test sequence in UniProt 12.6 with our functional predictions using EFICAz<sup>2 </sup>version 10, which is based on the Release 10 of UniProt.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Distribution of the number of test sequences per enzyme type</p>
               </caption>
               <text>
                  <p><b>Distribution of the number of test sequences per enzyme type</b>. Distribution of 9,397 test enzyme sequences into 145 types of three-field EC numbers (green columns) and 6,996 test enzyme sequences into 614 types of four-field EC numbers (red columns).</p>
               </text>
               <graphic file="1471-2105-10-107-6"/>
            </fig>
            <p>For a given enzyme function <b><it>f </it></b>described by a three-field or four-field EC number, we calculate: <b>precision<sub><it>f </it></sub>= TP<sub><it>f</it></sub>/(TP<sub><it>f</it></sub>+FP<sub><it>f</it></sub>)</b>, and <b>recall<sub><it>f </it></sub>= TP<sub><it>f</it></sub>/(TP<sub><it>f </it></sub>+ FN<sub><it>f</it></sub>)</b>, where (i) <b>TP<sub><it>f </it></sub></b>(number of true positives) is the number of test sequences for which the function <b><it>f </it></b>is assigned by both EFICAz<sup>2 </sup>and UniProt 12.6, (ii) <b>FP<sub><it>f </it></sub></b>(number of false positives) is the number of test sequences for which the function <b><it>f </it></b>is assigned by EFICAz<sup>2 </sup>but not by UniProt 12.6, and (iii) <b>FN<sub><it>f </it></sub></b>(number of false negatives) is the number of test sequences for which the function <b><it>f </it></b>is assigned by UniProt 12.6 but not by EFICAz<sup>2</sup>.</p>
            <p>In UniProt, as well as and in most protein sequence databases, the distribution of different EC classes is non-uniform, i.e. some enzyme functions are overrepresented while others are underrepresented (see Figure <figr fid="F6">6</figr>). To reduce the bias towards the most represented enzyme functions, we evaluate precision and recall for each individual enzyme function <b><it>f</it></b>, and then calculate average values. On the other hand, it is clear that test sequences with higher sequence identity to training enzymes are easier to predict than those exhibiting lower sequence identity. This correlation plus the fact that, in general, the sequence identities of the test sequences to the training enzymes are not uniformly distributed, introduces another potential source of bias. To reduce this second type of bias, we evaluate EFICAz<sup>2</sup>'s performance at different levels of maximal test to training sequence identity (MTTSI). We define MTTSI as the maximal sequence identity between a given test sequence whose predicted function is <b><it>f </it></b>and any training enzyme whose true function is <b><it>f</it></b>.</p>
            <p>Given a MTTSI interval <b><it>m </it></b>and an enzyme function <b><it>f</it></b>, we first select the test sequences whose EFICAz<sup>2 </sup>predicted function is <b><it>f </it></b>and whose MTTSI falls into the interval <b><it>m</it></b>. Then, based on the selected test sequences, we calculate the precision and recall of EFICAz<sup>2 </sup>for enzyme function <b><it>f </it></b>and MTSSI bin <b><it>m</it></b>. For each MTSSI bin, we calculate and report the average precision and recall across all enzyme functions for which these performance indicators are defined (i.e., where (TP<sub><it>f </it></sub>+ FP<sub><it>f</it></sub>) > 0 for precision calculation and where (TP<sub><it>f </it></sub>+ FN<sub><it>f</it></sub>) > 0 for recall calculation). It has to be mentioned that in previous benchmarks of EFICAz <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B27">27</abbr></abbrgrp>, we calculated the average precision per MTTSI bin only across the EC number types that were represented in the test sequences. In this work, we decided to average the performance of all possible EC number types, which translates into a decreased average precision (because, by definition, all the additional enzyme functions considered for the average will have zero true positives) but provides a more realistic estimation of our method's performance.</p>
            <p>In this work, we evaluated two more versions of EFICAz, besides EFICAz<sup>2</sup>: i) the original implementation of EFICAz where predictions from four component methods are combined without integration by classification tree models, and ii) a version that combines the previous four components and the two new SVM-based components, also lacking the benefit of classification tree predictive models. These versions only differ from EFICAz<sup>2 </sup>in the number of utilized component methods, or the way the predictions from different components are combined. Thus, the procedures for training of the individual components described above for EFICAz<sup>2 </sup>also apply to these two other versions of EFICAz.</p>
         </sec>
         <sec>
            <st>
               <p>Statistical analyses</p>
            </st>
            <p>We performed two-tailed t-tests to determine the significance of the differences in the average recall and precision at specific MTTSI intervals observed between different pairs of predictive methods. Our null hypothesis was that there is no significant change in these performance indicators (critical alpha level = 0.05). To evaluate differences in average recall, we used correlated t-tests because the recall values from each of the two compared methods can be matched according to their specific EC numbers. Conversely, to evaluate differences in average precision, we used t-tests for unpaired data because the prediction precision values associated with each method are not defined for the same set of EC numbers. In this case, assuming that the random variables had different (heteroscedastic t-test) or the same variance (homoscedastic t-test) yielded the same results at the set critical alpha level of 0.05.</p>
         </sec>
         <sec>
            <st>
               <p>Support vector machine models</p>
            </st>
            <p>We built an SVM model for each particular Pfam and CHIEFc enzyme family, whether the family is associated to a three-field or to a four-field EC number. Each enzyme family consists of a multiple sequence alignment of homo- and hetero-functional members; the goal of each SVM model is to discriminate between them. For classification purposes, homo- and hetero-functional members of an enzyme family are considered as positives and negatives, respectively. To transform the aligned protein sequences into a data matrix suitable for machine learning, a particular amino acid encoding scheme needs to be selected. Several methods for amino acid encoding have been proposed in the literature <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr></abbrgrp>. Here, we adopt an encoding method where each amino acid is represented by five highly interpretable continuous variables derived from multivariate statistic analysis of 494 physicochemical attributes <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. Thus, for training and evaluation of the SVM models, each aligned position of a member sequence is regarded as a five-dimensional vector, and a multiple sequence alignment with <it>M </it>proteins and <it>N </it>aligned positions is converted to a data matrix with <it>M </it>samples and <it>N</it>*5 input features. Therefore, a different SVM model is associated to each enzyme family, each model having a different number of features, depending on the number of aligned positions. We implemented the SVM models using the libSVM package <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> (kernel function = Radial Basis Function (RBF), &#947; = 1/k, where k is the number of attributes in the input data, and C = 1).</p>
         </sec>
         <sec>
            <st>
               <p>Decision tree learning model</p>
            </st>
            <p>Decision trees are predictive models that classify data by mapping features of the data items to inferences about their target values, by means of a hierarchy of questions about such features <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. Decision trees can be implemented as classification trees when the outcome is discrete, or regression trees when the outcome is continuous <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. In this work, we have used classification trees to integrate the predictions generated by each of the six EFICAz component methods (C1 to C6) into a final, more precise EC number prediction. The source for training and testing of our classification tree predictive models is the dataset described in "Benchmarking of EFICAz<sup>2 </sup>version 10", in the Methods section. Our training samples are (<b>p</b>, <b>z</b>) pairs, where <b>p </b>denotes a protein sequence and <b>z </b>indicates its EC number. The features considered for the classification are the prediction statuses of the six EFICAz components. We encode the feature information for a given sample (<b>p</b>, <b>z</b>) in a six dimensional binary vector. Thus, "1" in certain dimension of the vector means that the corresponding EFICAz component predicts that protein sequence <b>p </b>exhibits the enzymatic activity associated to EC number <b>z</b>, while "0" indicates the opposite. The outcome of the predictive model is a logic variable indicating whether or not <b>z </b>is assigned to <b>p</b>.</p>
            <p>We generated classification trees for two levels of enzyme function description (three- and four-field EC numbers) in two variants each, one for protein sequences with MTTSI &lt; 30% and the other for protein sequences with MTTSI &#8805; 30%. The 30% MTTSI threshold was empirically determined and optimized to achieve a biologically useful trade-off between the prediction performance of sequences in or out of the "Twilight Zone" of function prediction, as evaluated in our benchmarks. To create the classification trees, we used the rpart package version 3.1&#8211;41 from the statistical analysis tool R <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>. The fitting of the models was done using the default parameters of the rpart function, with the exception of the <it>weights </it>argument. We opted for an EC number-dependent case weight equal to the harmonic mean of 1 and 1/N, i.e. 2/(N+1), where N is the number of training sequences that belong to a given EC number. The rationale of this weighting scheme is that it is a halfway balance between two extreme situations: i) implementing a weight = 1/N and thus completely ignoring the natural biases in enzyme abundance that might be partially reflected in databases (all EC number types are treated equally, whether represented by only one or by a large number of sequences), and ii) using a weight &#8805; 1 for all cases (no weighting), with the risk of excessively biasing the models towards the EC numbers most abundantly represented in our training set of sequences.</p>
         </sec>
         <sec>
            <st>
               <p>Enzyme function annotation of the human proteome</p>
            </st>
            <p>The sources for the human protein sequences and their enzyme function annotations were the KEGG Genes and Brite databases (Release 47.0+/06-26, of June 26, 2008), respectively.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>AKA and YH participated in the design of EFICAz<sup>2</sup>. AKA conceived of the study, analyzed the results of the performance benchmarks, performed the reannotation of the human proteome, designed the web server and drafted the manuscript. YH implemented the machine learning enhancements of EFICAz<sup>2 </sup>and helped to draft the manuscript. JS conceived of the study, participated in its design and coordination, and helped to draft the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This research was supported by grant No. GM-48835 of the Division of General Medical Sciences of the NIH.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>High precision multi-genome scale reannotation of enzyme function by EFICAz</p>
            </title>
            <aug>
               <au>
                  <snm>Arakaki</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Tian</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Skolnick</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>315</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1764738</pubid>
                  <pubid idtype="pmpid" link="fulltext">17166279</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-7-315</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>The complement of enzymatic sets in different species</p>
            </title>
            <aug>
               <au>
                  <snm>Freilich</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Spriggs</snm>
                  <fnm>RV</fnm>
               </au>
               <au>
                  <snm>George</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Al-Lazikani</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Swindells</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2005</pubdate>
            <volume>349</volume>
            <issue>4</issue>
            <fpage>745</fpage>
            <lpage>763</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2005.04.027</pubid>
                  <pubid idtype="pmpid" link="fulltext">15896806</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Enzyme nomenclature 1992: recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes</p>
            </title>
            <aug>
               <au>
                  <snm>Webb</snm>
                  <fnm>EC</fnm>
               </au>
            </aug>
            <publisher>San Diego: Published for the International Union of Biochemistry and Molecular Biology by Academic Press</publisher>
            <pubdate>1992</pubdate>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Evolution of enzyme superfamilies</p>
            </title>
            <aug>
               <au>
                  <snm>Glasner</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Gerlt</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Babbitt</snm>
                  <fnm>PC</fnm>
               </au>
            </aug>
            <source>Curr Opin Chem Biol</source>
            <pubdate>2006</pubdate>
            <volume>10</volume>
            <issue>5</issue>
            <fpage>492</fpage>
            <lpage>497</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.cbpa.2006.08.012</pubid>
                  <pubid idtype="pmpid" link="fulltext">16935022</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Caveat emptor: limitations of the automated reconstruction of metabolic pathways in Plasmodium</p>
            </title>
            <aug>
               <au>
                  <snm>Ginsburg</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Trends Parasitol</source>
            <pubdate>2008</pubdate>
            <volume>25</volume>
            <issue>1</issue>
            <fpage>37</fpage>
            <lpage>43</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.pt.2008.08.012</pubid>
                  <pubid idtype="pmpid" link="fulltext">18986839</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Genome-scale reconstruction of the metabolic network in Staphylococcus aureus N315: an initial draft to the two-dimensional annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Becker</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Palsson</snm>
                  <fnm>BO</fnm>
               </au>
            </aug>
            <source>BMC Microbiol</source>
            <pubdate>2005</pubdate>
            <volume>5</volume>
            <fpage>12</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1079850</pubid>
                  <pubid idtype="pmpid" link="fulltext">15766389</pubid>
                  <pubid idtype="doi">10.1186/1471-2180-5-8</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>A network-based method for target selection in metabolic networks</p>
            </title>
            <aug>
               <au>
                  <snm>Guimera</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Sales-Pardo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Amaral</snm>
                  <fnm>LAN</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <issue>13</issue>
            <fpage>1616</fpage>
            <lpage>1622</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2149892</pubid>
                  <pubid idtype="pmpid" link="fulltext">17463022</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/btm150</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Metabolic reconstruction and analysis for parasite genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Pinney</snm>
                  <fnm>JW</fnm>
               </au>
               <au>
                  <snm>Papp</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hyland</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Warnbua</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Westhead</snm>
                  <fnm>DR</fnm>
               </au>
               <au>
                  <snm>McConkey</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>Trends Parasitol</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <issue>11</issue>
            <fpage>548</fpage>
            <lpage>554</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.pt.2007.08.013</pubid>
                  <pubid idtype="pmpid" link="fulltext">17950669</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Identification of metabolites with anticancer properties by Computational Metabolomics</p>
            </title>
            <aug>
               <au>
                  <snm>Arakaki</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mezencev</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bowen</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>McDonald</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Skolnick</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Mol Cancer</source>
            <pubdate>2008</pubdate>
            <volume>7</volume>
            <issue>1</issue>
            <fpage>57</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2453147</pubid>
                  <pubid idtype="pmpid" link="fulltext">18559081</pubid>
                  <pubid idtype="doi">10.1186/1476-4598-7-57</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Human metabolic network reconstruction and its impact on drug discovery and development</p>
            </title>
            <aug>
               <au>
                  <snm>Ma</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Goryanin</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Drug Discov Today</source>
            <pubdate>2008</pubdate>
            <volume>13</volume>
            <issue>9&#8211;10</issue>
            <fpage>402</fpage>
            <lpage>408</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.drudis.2008.02.002</pubid>
                  <pubid idtype="pmpid" link="fulltext">18468557</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>The past, present and future of genome-wide re-annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Karp</snm>
                  <fnm>PD</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <issue>2</issue>
            <note>COMMENT2001.</note>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">139008</pubid>
                  <pubid idtype="pmpid" link="fulltext">11864365</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function</p>
            </title>
            <aug>
               <au>
                  <snm>Punta</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ofran</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>PLoS Comput Biol</source>
            <pubdate>2008</pubdate>
            <volume>4</volume>
            <issue>10</issue>
            <fpage>e1000160</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2518264</pubid>
                  <pubid idtype="pmpid" link="fulltext">18974821</pubid>
                  <pubid idtype="doi">10.1371/journal.pcbi.1000160</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Can sequence determine function?</p>
            </title>
            <aug>
               <au>
                  <snm>Gerlt</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Babbitt</snm>
                  <fnm>PC</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2000</pubdate>
            <volume>1</volume>
            <issue>5</issue>
            <fpage>REVIEWS0005</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">138884</pubid>
                  <pubid idtype="pmpid" link="fulltext">11178260</pubid>
                  <pubid idtype="doi">10.1186/gb-2000-1-5-reviews0005</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>How well is enzyme function conserved as a function of pairwise sequence identity?</p>
            </title>
            <aug>
               <au>
                  <snm>Tian</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Skolnick</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2003</pubdate>
            <volume>333</volume>
            <issue>4</issue>
            <fpage>863</fpage>
            <lpage>882</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2003.08.057</pubid>
                  <pubid idtype="pmpid" link="fulltext">14568541</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Whole-genome sequence annotation: 'Going wrong with confidence'</p>
            </title>
            <aug>
               <au>
                  <snm>Kyrpides</snm>
                  <fnm>NC</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>1999</pubdate>
            <volume>32</volume>
            <issue>4</issue>
            <fpage>886</fpage>
            <lpage>887</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-2958.1999.01380.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">10361291</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Annotation transfer for genomics: measuring functional divergence in multi-domain proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Hegyi</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <issue>10</issue>
            <fpage>1632</fpage>
            <lpage>1640</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">311165</pubid>
                  <pubid idtype="pmpid" link="fulltext">11591640</pubid>
                  <pubid idtype="doi">10.1101/gr. 183801</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption</p>
            </title>
            <aug>
               <au>
                  <snm>Galperin</snm>
                  <fnm>MY</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Silico Biol</source>
            <pubdate>1998</pubdate>
            <volume>1</volume>
            <issue>1</issue>
            <fpage>55</fpage>
            <lpage>67</lpage>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Intrinsic errors in genome annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Devos</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Valencia</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>8</issue>
            <fpage>429</fpage>
            <lpage>431</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-9525(01)02348-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">11485799</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Errors in genome annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <issue>4</issue>
            <fpage>132</fpage>
            <lpage>133</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-9525(99)01706-0</pubid>
                  <pubid idtype="pmpid" link="fulltext">10203816</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Modeling the percolation of annotation errors in a database of protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Gilks</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Audit</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>De Angelis</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Tsoka</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>12</issue>
            <fpage>1641</fpage>
            <lpage>1649</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.12.1641</pubid>
                  <pubid idtype="pmpid" link="fulltext">12490449</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Estimating the annotation error rate of curated GO database sequence annotations</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Baumann</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>9</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1779800</pubid>
                  <pubid idtype="pmpid" link="fulltext">17214880</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-8-170</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Large-scale assessment of the utility of low-resolution protein structures for biochemical function assignment</p>
            </title>
            <aug>
               <au>
                  <snm>Arakaki</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Skolnick</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>7</issue>
            <fpage>1087</fpage>
            <lpage>1096</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth044</pubid>
                  <pubid idtype="pmpid" link="fulltext">14764543</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Prediction of enzyme function based on 3D templates of evolutionarily important amino acids</p>
            </title>
            <aug>
               <au>
                  <snm>Kristensen</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Ward</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Lisewski</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Erdin</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>BY</fnm>
               </au>
               <au>
                  <snm>Fofanov</snm>
                  <fnm>VY</fnm>
               </au>
               <au>
                  <snm>Kimmel</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kavraki</snm>
                  <fnm>LE</fnm>
               </au>
               <au>
                  <snm>Lichtarge</snm>
                  <fnm>O</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <fpage>17</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2219985</pubid>
                  <pubid idtype="pmpid" link="fulltext">18190718</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-9-17</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Automated discovery of 3D motifs for protein function annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Polacco</snm>
                  <fnm>BJ</fnm>
               </au>
               <au>
                  <snm>Babbitt</snm>
                  <fnm>PC</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>6</issue>
            <fpage>723</fpage>
            <lpage>730</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btk038</pubid>
                  <pubid idtype="pmpid" link="fulltext">16410325</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Enzyme function prediction with interpretable models</p>
            </title>
            <aug>
               <au>
                  <snm>Syed</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Yona</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Computational Systems Biology</source>
            <publisher>Totowa, NJ: Humana Press</publisher>
            <editor>McDermott J, Samudrala R, Bumgarner R, Montgomery K, Ireton R</editor>
            <pubdate>2009</pubdate>
            <volume>541</volume>
            <fpage>187</fpage>
            <lpage>199</lpage>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Identifying metabolic enzymes with multiple types of association evidence</p>
            </title>
            <aug>
               <au>
                  <snm>Kharchenko</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Freund</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Vitkup</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>GM</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>177</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1450304</pubid>
                  <pubid idtype="pmpid" link="fulltext">16571130</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-177</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference</p>
            </title>
            <aug>
               <au>
                  <snm>Tian</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Arakaki</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Skolnick</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>21</issue>
            <fpage>6226</fpage>
            <lpage>6239</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">535665</pubid>
                  <pubid idtype="pmpid" link="fulltext">15576349</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh956</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>SUPPORT-VECTOR NETWORKS</p>
            </title>
            <aug>
               <au>
                  <snm>Cortes</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Mach Learn</source>
            <pubdate>1995</pubdate>
            <volume>20</volume>
            <issue>3</issue>
            <fpage>273</fpage>
            <lpage>297</lpage>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Classification and regression trees</p>
            </title>
            <aug>
               <au>
                  <snm>Breiman</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <publisher>Belmont, Calif.: Wadsworth International Group</publisher>
            <pubdate>1984</pubdate>
         </bibl>
         <bibl id="B30">
            <title>
               <p>KEGG: Kyoto Encyclopedia of Genes and Genomes</p>
            </title>
            <url>ftp://ftp.genome.jp/pub/kegg/</url>
         </bibl>
         <bibl id="B31">
            <title>
               <p>PROSITE Database</p>
            </title>
            <url>ftp://us.expasy.org/databases/prosite/</url>
         </bibl>
         <bibl id="B32">
            <title>
               <p>PROSITE: a documented database using patterns and profiles as motif descriptors</p>
            </title>
            <aug>
               <au>
                  <snm>Sigrist</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Cerutti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Hulo</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Gattiker</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Falquet</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Brief Bioinform</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <issue>3</issue>
            <fpage>265</fpage>
            <lpage>274</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bib/3.3.265</pubid>
                  <pubid idtype="pmpid" link="fulltext">12230035</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>UniProt Knowledgebase Database</p>
            </title>
            <url>ftp://us.expasy.org/databases/uniprot/</url>
         </bibl>
         <bibl id="B34">
            <title>
               <p>What are decision trees?</p>
            </title>
            <aug>
               <au>
                  <snm>Kingsford</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Salzberg</snm>
                  <fnm>SL</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2008</pubdate>
            <volume>26</volume>
            <issue>9</issue>
            <fpage>1011</fpage>
            <lpage>1013</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nbt0908-1011</pubid>
                  <pubid idtype="pmpid" link="fulltext">18779814</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>EFICAz<sup>2 </sup>webservice</p>
            </title>
            <url>http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html</url>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Prediction of enzyme function by combining sequence similarity and protein interactions</p>
            </title>
            <aug>
               <au>
                  <snm>Espadaler</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Eswar</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Querol</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Avil&#233;s</snm>
                  <fnm>FX</fnm>
               </au>
               <au>
                  <snm>Sali</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Marti-Renom</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Oliva</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <fpage>249</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2430716</pubid>
                  <pubid idtype="pmpid" link="fulltext">18505562</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-9-249</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Pfam Database</p>
            </title>
            <url>ftp://ftp.sanger.ac.uk/pub/databases/Pfam/</url>
         </bibl>
         <bibl id="B38">
            <title>
               <p>A method to predict functional residues in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Casari</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Valencia</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nat Struct Biol</source>
            <pubdate>1995</pubdate>
            <volume>2</volume>
            <issue>2</issue>
            <fpage>171</fpage>
            <lpage>178</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nsb0295-171</pubid>
                  <pubid idtype="pmpid">7749921</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Solving the protein sequence metric problem</p>
            </title>
            <aug>
               <au>
                  <snm>Atchley</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Zhao</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Fernandes</snm>
                  <fnm>AD</fnm>
               </au>
               <au>
                  <snm>Dr&#252;ke</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <issue>18</issue>
            <fpage>6395</fpage>
            <lpage>6400</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1088356</pubid>
                  <pubid idtype="pmpid" link="fulltext">15851683</pubid>
                  <pubid idtype="doi">10.1073/pnas.0408677102</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Application of support vector machines for T-cell epitopes prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Zhao</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Pinilla</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Valmori</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>15</issue>
            <fpage>1978</fpage>
            <lpage>1984</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg255</pubid>
                  <pubid idtype="pmpid" link="fulltext">14555632</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>LIBSVM: a library for support vector machines</p>
            </title>
            <url>http://www.csie.ntu.edu.tw/~cjlin/libsvm</url>
         </bibl>
         <bibl id="B42">
            <title>
               <p>R: A Language and Environment for Statistical Computing</p>
            </title>
            <aug>
               <au>
                  <cnm>R Development Core Team</cnm>
               </au>
            </aug>
            <publisher>Vienna, Austria: R Foundation for Statistical Computing</publisher>
            <pubdate>2008</pubdate>
         </bibl>
      </refgrp>
   </bm>
</art>
