<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-284</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Correspondence</dochead>
      <bibl>
         <title>
            <p>Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Andorf</snm>
               <fnm>Carson</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>andorfc@iastate.edu</email>
            </au>
            <au id="A2">
               <snm>Dobbs</snm>
               <fnm>Drena</fnm>
               <insr iid="I2"/>
               <insr iid="I3"/>
               <insr iid="I4"/>
               <email>ddobbs@iastate.edu</email>
            </au>
            <au id="A3" ca="yes">
               <snm>Honavar</snm>
               <fnm>Vasant</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <insr iid="I4"/>
               <email>honavar@cs.iastate.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Artificial Intelligence Laboratory, Department of Computer Science, Iowa State University, Ames, Iowa, 50011, USA</p>
            </ins>
            <ins id="I2">
               <p>Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa, 50011, USA</p>
            </ins>
            <ins id="I3">
               <p>Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, Iowa, 50011, USA</p>
            </ins>
            <ins id="I4">
               <p>Center for Computational Intelligence, Learning, and Discovery, Iowa State University, Ames, Iowa, 50011, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>284</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/284</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17683567</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-284</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>14</day>
               <month>12</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>03</day>
               <month>8</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>03</day>
               <month>8</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Andorf et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be <it>inconsistent </it>with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were <it>consistent </it>with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects.</p>
               <p>Editors Note : Authors from the original publication (Okazaki et al.: <it>Nature </it>2002, <b>420</b>:563&#8211;73) have provided their response to Andorf et al, directly following the correspondence.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>As more genomic sequences become available, functional annotation of genes presents one of the most important challenges in bioinformatics. Because experimental determination of protein structure and function is expensive and time-consuming, there is an increasing reliance on automated approaches to assignment of Gene Ontology (GO) <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> functional categories to protein sequences. An advantage of such automated methods is that they can be used to annotate hundreds or thousands of proteins in a matter of minutes, which makes their use especially attractive &#8211; if not unavoidable &#8211; in large-scale genome-wide annotation efforts.</p>
         <p>Most automated approaches to protein function annotation rely on transfer of annotations from previously annotated proteins, based on sequence or structural similarity. Such annotations are susceptible to several sources of error, including errors in the original annotations from which new annotations are inferred, errors in the algorithms, bugs in the programs or scripts used to process the data, clerical errors on the part of human curators, among others. The effect of such errors can be magnified because they can propagate from one set of annotated sequences to another through widespread use of automated techniques for genome-wide functional annotation of proteins <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>. Once introduced, such errors can go undetected for a long time. Because of the increasing reliance of biologists and computational biologists on reliable functional annotations for formulation of hypotheses, design of experiments, and interpretation of results, incorrect annotations can lead to wasted effort and erroneous conclusions. Computational approaches to checking automatically inferred annotations against independent sources of evidence and detecting potential annotation errors offer a potential solution to this problem <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>.</p>
         <p>Previous work of several groups, including our own <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp> has demonstrated the usefulness of machine learning approaches to assigning putative functions to proteins based on the amino acid sequence of the proteins. On the specific problem of predicting the catalytic activity of proteins from amino acid sequence, we showed that machine learning approaches outperform methods based on sequence homology <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. This is especially true when sequence identity among proteins with a specified function is below 10%; the accuracy of predictions by our HDTree classifier was 8%&#8211;16% better than that of PSI-BLAST <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. The discriminatory power of machine learning approaches thus suggests they should be valuable for detecting potential annotation errors in functional genomics databases.</p>
         <p>Here we demonstrate that a machine learning approach, designed to predict GO functional classifications for proteins, can be used to identify and correct potential annotation errors. In this study, we focused on a small but clinically important subset of protein kinases, for which we "stumbled upon" potential annotation errors while evaluating the performance of protein function classification algorithms. We chose a set of protein kinases categorized under the GO class GO0004672, Protein Kinase Activity, which includes proteins with serine/threonine (Ser/Thr) kinase activity (GO0004674) and tyrosine (Tyr) kinase activity (GO0004713). Post-translational modification of proteins by phosphorylation plays an important regulatory role in virtually every signaling pathway in eukaryotic cells, modulating key biological processes associated with development and diseases including cancer, diabetes, hyperlipidemia and inflammation <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>. It is natural to expect that such well studied and functionally significant families of protein kinases are correctly annotated by genome-wide annotation efforts.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>The initial aim of our experiments was to evaluate the effectiveness of machine learning approaches to automate sequence-based classification of protein kinases into subfamilies. Because both the Ser/Thr and Tyr subfamilies contain highly divergent members, some of which share less than 10% sequence identity with other members, they offer a rigorous test case for evaluating the potential general utility of this approach. Previously, we developed HDTree <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, a two-stage approach that combines a classifier based on amino acid <it>k</it>-gram composition of a protein sequence, with a classifier that relies on transfer of annotation from PSI-BLAST hits (see Methods for details). A protein kinase classifier was trained on a set of 330 human protein kinases from the Ser/Thr protein kinase (GO0004674) and Tyr protein kinase (GO0004713) functional classes based on direct and indirect annotations assigned by AmiGO <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, a valuable and widely used tool for retrieving GO functional annotations of proteins. Performance of the classifier was evaluated, using 10-fold cross-validation, on two datasets: i) the dataset of 330 <it>human </it>protein kinases, and ii) a dataset of 244 <it>mouse </it>protein kinases drawn from the same GO functional classes. The initial datasets were not filtered based on evidence codes or sequence identity cutoffs.</p>
         <p>Using the AmiGO annotations as reference, the resulting HDTree classifier correctly distinguished between Ser/Thr kinases and Tyr kinases in the human kinase dataset with an overall accuracy of 89.1% and a kappa coefficient of 0.76. In striking contrast, the accuracy of the classifier on the mouse kinase dataset was only 15.1%; the correlation between the GO functional categories predicted by the classifier and the AmiGO reference labels was an alarming -0.40: 72 of the 244 mouse kinases were classified as Ser/Thr kinases, 105 as Tyr kinases, and 67 as "dual specificity" kinases (belonging to both GO0004674 and GO0004713 classes) (see Table <tblr tid="T1">1</tblr>).</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Performance of classifiers trained on human versus mouse kinases in predicting AmiGO annotations. The performance measures accuracy, kappa coefficient, correlation coefficient, precision, and recall are reported for two of the HDTree classifiers. The first classifier is trained on 330 human kinases. The performance is based on 10-fold cross-validation. The second classifier is trained on the 330 human kinases and tested on 244 mouse kinases. The annotations for the mouse and human kinases were obtained from AmiGO.</p>
            </caption>
            <tblbdy cols="12">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="3" ca="center">
                     <p>
                        <b>Correlation Coefficient</b>
                     </p>
                  </c>
                  <c cspan="3" ca="center">
                     <p>
                        <b>Precision</b>
                     </p>
                  </c>
                  <c cspan="3" ca="center">
                     <p>
                        <b>Recall</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="9">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>
                        <b>Classifier</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Accuracy</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Kappa Coefficient</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Ser/Thr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Tyr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Dual</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Ser/Thr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Tyr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Dual</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Ser/Thr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Tyr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Dual</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="12">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>Human</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>89.1</p>
                  </c>
                  <c ca="center">
                     <p>0.76</p>
                  </c>
                  <c ca="center">
                     <p>0.82</p>
                  </c>
                  <c ca="center">
                     <p>0.86</p>
                  </c>
                  <c ca="center">
                     <p>0.30</p>
                  </c>
                  <c ca="center">
                     <p>0.97</p>
                  </c>
                  <c ca="center">
                     <p>1.00</p>
                  </c>
                  <c ca="center">
                     <p>0.15</p>
                  </c>
                  <c ca="center">
                     <p>0.95</p>
                  </c>
                  <c ca="center">
                     <p>0.74</p>
                  </c>
                  <c ca="center">
                     <p>0.71</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>Mouse</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>15.1</p>
                  </c>
                  <c ca="center">
                     <p>-0.40</p>
                  </c>
                  <c ca="center">
                     <p>-0.40</p>
                  </c>
                  <c ca="center">
                     <p>-0.43</p>
                  </c>
                  <c ca="center">
                     <p>-0.01</p>
                  </c>
                  <c ca="center">
                     <p>0.17</p>
                  </c>
                  <c ca="center">
                     <p>0.11</p>
                  </c>
                  <c ca="center">
                     <p>0.25</p>
                  </c>
                  <c ca="center">
                     <p>0.41</p>
                  </c>
                  <c ca="center">
                     <p>0.07</p>
                  </c>
                  <c ca="center">
                     <p>0.01</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>Assuming the AmiGO annotations were correct, these results suggested that either this particular machine learning approach is extremely ineffective for classifying mouse protein labels, or that human and mouse protein kinases have so little in common that a classifier trained on the human proteins is doomed to fail miserably on the mouse proteins. In light of the demonstrated effectiveness of machine learning approaches on a broad range of classification tasks that arise in bioinformatics <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, and well-documented high degree of homology between human and mouse proteins <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, neither of these conclusions seemed warranted. Could this discrepancy be explained by the AmiGO annotations for mouse protein kinases? We proceeded to investigate this possibility.</p>
         <p>A comparison of the distribution of Ser/Thr, Tyr, and dual specificity kinases in mouse versus human (Figure <figr fid="F1">1a</figr>) reveals a striking discordance: based on AmiGO annotations, mouse has many more Tyr and dual specificity kinases than human and only 40% as many Ser/Thr protein kinases. In contrast, as explained below, the fractions of Ser/Thr, Tyr, and dual specificity kinases based on UniProt annotations are very similar in mouse and human (Figure <figr fid="F1">1b</figr>). Furthermore, the predictions of our two-stage machine learning algorithm are in good agreement with the UniProt annotations for both human and mouse protein kinases (Figures <figr fid="F1">1b</figr> and <figr fid="F1">1c</figr>, and Additional File <supplr sid="S9">9</supplr>).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Distribution of Ser/Thr, Tyr, and dual specificity kinases among annotated protein kinases in human versus mouse genomes [see Additional file <supplr sid="S9">9</supplr>]</p>
            </caption>
            <text>
               <p><b>Distribution of Ser/Thr, Tyr, and dual specificity kinases among annotated protein kinases in human versus mouse genomes </b>[see Additional file <supplr sid="S9">9</supplr><b>]</b>. Pie charts illustrate the functional family distribution of protein kinases in human (top) versus mouse (bottom), based on: <b>a. AmiGO functional classifications</b>: Ser/Thr (GO0004674) [Blue]; Tyr (GO0004713) [Red] or "dual specificity" (proteins with both GO classifications) [Yellow]. <b>b. UniProt annotations</b>: classification based on UniProt records containing the key words Ser/Thr [Blue], Tyr [Red], or dual specificity [Yellow] [see Additional file <supplr sid="S2">2</supplr>]. <b>c. Predicted annotations by the HDTree classifier</b>: The classifier was built on human proteins with functional labels Ser/Thr (GO0004674) [Blue], Tyr (GO0004713) [Red] or "dual specificity" [Yellow] derived from AmiGO and verified by UniProt [see Additional file <supplr sid="S4">4</supplr>].</p>
            </text>
            <graphic file="1471-2105-8-284-1"/>
         </fig>
         <p>Examination of the GO evidence codes for the mouse protein kinases revealed that 211 of 244 mouse protein kinases included the evidence code "RCA," "inferred from reviewed computational analysis" [see Additional file <supplr sid="S1">1</supplr>], indicating that these annotations had been assigned using computational tools and reviewed by a human curator before being deposited in the database used by AmiGO. Notably, 28 of 33 (85%) mouse protein kinases with an evidence code other than RCA (e.g., "inferred from direct assay") were assigned "correct" labels, relative to the AmiGO reference, by the classifier trained on the human protein kinase data. Each of the 211 proteins with the RCA evidence code had at least one annotation that could be traced to the FANTOM Consortium and RIKEN Genome Exploration Research Group <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, a source of protein function annotations in the Mouse Genome Database (MGD) <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. To further examine each of these 211 mouse protein kinases, we used the gene IDs obtained from AmiGO to extract information about each protein from UniProt <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. We searched the UniProt records for mention of "Serine/Threonine" or "Tyrosine" (or their synonyms) in fields for protein name, synonyms, references, similarity, keywords, or function, and created a dataset in which each protein kinase had one of the corresponding UniProt labels: "Ser/Thr kinase," "Tyr kinase," or "dual specificity kinase" if both keywords were found. Results of our comparison of UniProt labels with AmiGO annotations for each class in this dataset of 211 mouse protein kinases are shown in Figure <figr fid="F2">2a</figr>: for 201 of the 211 cases with an RCA annotation code, the UniProt and AmiGO labels were inconsistent. Results of our comparison are shown in Table <tblr tid="T2">2</tblr> [see Additional files <supplr sid="S2">2</supplr> and <supplr sid="S3">3</supplr>].</p>
         <suppl id="S1">
            <title>
               <p>Additional file 1</p>
            </title>
            <text>
               <p><b>Supplementary Table 1</b>: Evidence Codes for AmiGO annotations. A table displaying the Evidence Codes for AmiGO annotations of the mouse protein kinases used in this study.</p>
            </text>
            <file name="1471-2105-8-284-S1.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S2">
            <title>
               <p>Additional file 2</p>
            </title>
            <text>
               <p><b>Supplementary Table 2</b>: AmiGO annotations versus UniProt annotations (with UniProt Evidence). A table comparing the annotations found in the AmiGO server with the annotations found in UniProt.</p>
            </text>
            <file name="1471-2105-8-284-S2.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S3">
            <title>
               <p>Additional file 3</p>
            </title>
            <text>
               <p><b>Supplementary Table 3</b>: AmiGO labels, UniProt labels, and Predicted Labels for each mouse kinase protein. A table comparing the predicted annotations from our three machine learning classifiers with the annotations of AmiGO and UniProt.</p>
            </text>
            <file name="1471-2105-8-284-S3.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Comparison of AmiGO and UniProt annotations for 211 mouse protein kinases with RCA Evidence code. Each of the 211 mouse kinase proteins with an RCA evidence code used in this study has both an AmiGO and a UniProt annotation. This table shows the number of proteins that have each of the nine possible combinations of AmiGO and UniProt annotations. Each row of the table represents one of the three possible UniProt labels and each column represents each of the three AmiGO annotations. Each entry of the table shows the number of proteins with the corresponding annotation. Note that all entries along the diagonal (in bold) show the number of proteins for which the AmiGO and UniProt annotations were in agreement. All other entries show the number of proteins where AmiGO and UniProt were in disagreement [see Additional files <supplr sid="S2">2</supplr> and <supplr sid="S3">3</supplr>].</p>
            </caption>
            <tblbdy cols="4">
               <r>
                  <c ca="center">
                     <p>
                        <b>KINASE FAMILY</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>AmiGO Ser/Thr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>AmiGO Tyr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>AmiGO Dual specificity</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>UniProt Ser/Thr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>10</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>105</p>
                  </c>
                  <c ca="center">
                     <p>35</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>UniProt Tyr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>54</p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>0</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>3</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>UniProt Dual specificity</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>0</b>
                     </p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Comparison of UniProt annotations of mouse protein kinase sequences with annotations from AmiGO or predicted by HDTree</p>
            </caption>
            <text>
               <p><b>Comparison of UniProt annotations of mouse protein kinase sequences with annotations from AmiGO or predicted by HDTree</b>. The bar charts illustrate the number of proteins that were in agreement (blue)/disagreement (red) with the annotations found in UniProt. Proteins that belong to each of the three functional classes found in the UniProt records are represented by two bars. The blue bar represents the number of proteins in which UniProt and the given method share the same annotation (<it>agreement</it>) for that function. The red bar represents the number of proteins in which UniProt and the given method have different annotations (<it>disagreement</it>) for that function. <b>a</b>. AmiGO vs. UniProt annotations <b>b</b>. HDTree predictions vs. UniProt annotations [see Additional files <supplr sid="S3">3</supplr> and <supplr sid="S4">4</supplr>].</p>
            </text>
            <graphic file="1471-2105-8-284-2"/>
         </fig>
         <p>This result led us to test the ability of the HDTree classifier trained on the human kinase dataset to correctly predict the family classifications for proteins in the mouse kinase dataset, this time using UniProt instead of AmiGO annotations as the "correct" reference labels. Strikingly, the classifier (trained on the human kinase dataset) achieved a classification accuracy of 97.2%, with a kappa coefficient of 0.93, on the mouse kinase dataset. As illustrated in Figure <figr fid="F2">2b</figr>, the classifier correctly classified 205 out of the 211 mouse kinases into Ser/Thr, Tyr or dual specificity classes compared with 10 out of 211 for AmiGO. A direct comparison of classifiers based on UniProt annotations and AmiGO annotations can be seen in Table <tblr tid="T3">3</tblr>. This performance actually exceeded that of the same classifier tested on the human kinase dataset, for which an overall classification accuracy of 89.1%, with a kappa coefficient of 0.76, was obtained [see Table <tblr tid="T1">1</tblr> and see Additional file <supplr sid="S4">4</supplr>]</p>
         <suppl id="S4">
            <title>
               <p>Additional file 4</p>
            </title>
            <text>
               <p><b>Supplementary Data</b>: Machine learning approaches to predict Gene Ontology and/or UniProt Functional labels. The data provided represent the results and performance of all the machine learning approaches used in this study.</p>
            </text>
            <file name="1471-2105-8-284-S4.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <tbl id="T3">
            <title>
               <p>Table 3</p>
            </title>
            <caption>
               <p>Comparison of performance of classifiers based on AmiGO annotations and UniProt annotations. The performance measures accuracy, kappa coefficient, correlation coefficient, precision, and recall are reported for two of the HDTree classifiers. Both classifiers were trained on 330 human kinases and tested on 211 mouse kinases with RCA evidence codes in AmiGO. The first classifier was trained and tested with annotations provided by UniProt and the second classifier used annotations obtained from AmiGO.</p>
            </caption>
            <tblbdy cols="12">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="3" ca="center">
                     <p>
                        <b>Correlation Coefficient</b>
                     </p>
                  </c>
                  <c cspan="3" ca="center">
                     <p>
                        <b>Precision</b>
                     </p>
                  </c>
                  <c cspan="3" ca="center">
                     <p>
                        <b>Recall</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c cspan="9">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>Classifier</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Accuracy</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Kappa Coefficient</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Ser/Thr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Tyr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Dual</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Ser/Thr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Tyr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Dual</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Ser/Thr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Tyr</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Dual</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="12">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>UniProt</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>97.1</p>
                  </c>
                  <c ca="center">
                     <p>0.93</p>
                  </c>
                  <c ca="center">
                     <p>0.98</p>
                  </c>
                  <c ca="center">
                     <p>0.94</p>
                  </c>
                  <c ca="center">
                     <p>0.00</p>
                  </c>
                  <c ca="center">
                     <p>0.97</p>
                  </c>
                  <c ca="center">
                     <p>0.97</p>
                  </c>
                  <c ca="center">
                     <p>0.00</p>
                  </c>
                  <c ca="center">
                     <p>0.99</p>
                  </c>
                  <c ca="center">
                     <p>1.00</p>
                  </c>
                  <c ca="center">
                     <p>0.00</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>AmiGO</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>4.2</p>
                  </c>
                  <c ca="center">
                     <p>-0.37</p>
                  </c>
                  <c ca="center">
                     <p>-0.64</p>
                  </c>
                  <c ca="center">
                     <p>-0.85</p>
                  </c>
                  <c ca="center">
                     <p>0.00</p>
                  </c>
                  <c ca="center">
                     <p>0.06</p>
                  </c>
                  <c ca="center">
                     <p>0.00</p>
                  </c>
                  <c ca="center">
                     <p>0.00</p>
                  </c>
                  <c ca="center">
                     <p>0.14</p>
                  </c>
                  <c ca="center">
                     <p>0.00</p>
                  </c>
                  <c ca="center">
                     <p>0.00</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>The HDTree method uses a decision tree built from the output from eight individual classifiers. A decision tree is built by selecting, in a greedy fashion, the individual classifier that provides the maximum information about the class label at each step, <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. By examining the decision tree, it is easy to identify the individual classifiers that have the greatest influence on the classification. In the case of the kinase datasets used in this study, the classifiers constructed by the NB(k) algorithms using trimers and quadmers, NB(3) and NB(4), were found to provide the most information regarding class labels. This suggests that the biological "signals" detected by these classifiers are groups of 3&#8211;4 residues, not necessarily contiguous in the primary amino acid sequence, but often in close proximity or interacting within three-dimensional structures to form functional sites (e.g., catalytic sites, binding sites), an idea supported by the results of our previous work <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Notably, the NB(3) and NB(4) classifiers appear to contribute more to the ability to distinguish proteins with very closely related enzymatic activities than PSI-BLAST. The PSI-BLAST results influenced the final classification, however, when the NB(3) and NB(4) classifiers disagreed on the classification.</p>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Examination of the Mouse Kinome Database <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> reveals that the majority of annotated mouse kinases have a human ortholog with sequence identity > 90% [see Additional files <supplr sid="S5">5</supplr> and <supplr sid="S6">6</supplr>]. The results summarized in Figures <figr fid="F1">1</figr> and <figr fid="F2">2</figr>, together with the assumption that the relative proportions of Ser/Thr, Tyr and dual specificity kinases should not be significant different in human and mouse, led us to conclude that UniProt derived annotations are more likely to be correct than those returned by AmiGO for this group of mouse protein kinases with the RCA evidence code. We have shared our findings with the Mouse Genome Database <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, which is in the process of identifying and rectifying the source of potential problems with these annotations.</p>
         <suppl id="S5">
            <title>
               <p>Additional file 5</p>
            </title>
            <text>
               <p><b>Supplementary Table 4</b>: Mouse kinases having a human ortholog. A table displaying the human orthologs for the mouse kinases used in this study. The table also displays the identity between these orthologs.</p>
            </text>
            <file name="1471-2105-8-284-S5.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S6">
            <title>
               <p>Additional file 6</p>
            </title>
            <text>
               <p><b>Supplementary Table 5</b>: Number of mouse kinases having a specified level of sequence identity with their human orthologs. A table displaying the summary statistics of Supplementary Table <tblr tid="T4">4</tblr>.</p>
            </text>
            <file name="1471-2105-8-284-S6.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>Identifying potential annotation errors in a specific dataset such as the mouse kinase dataset solves only a part of a larger problem. Because annotation errors can propagate across multiple databases through the widespread &#8211; and often necessary &#8211; use of information derived from available annotations, it is important to track and correct errors in other databases that rely on the erroneous source. For example, using AmiGO, we retrieved 136 rat protein kinases for which annotations had been transferred from mouse protein kinases based on homology (indicated by the evidence code "ISS," 'inferred from sequence or structural similarity') with one of the 201 erroneously annotated mouse protein kinases. Examination of the UniProt records for these 136 rat protein kinases revealed that 94 of those labeled as "Ser/Thr" kinases by UniProt had AmiGO annotations of "Tyr" or "dual specificity" kinase, and 42 of those labeled as "Tyr" kinases by UniProt had AmiGO annotations of "Ser/Thr" or "dual specificity" kinase [see Additional files <supplr sid="S7">7</supplr> and <supplr sid="S8">8</supplr>].</p>
         <suppl id="S7">
            <title>
               <p>Additional file 7</p>
            </title>
            <text>
               <p><b>Supplementary Note</b>. Because there is only a non-curated reference to the work done on "Rat ISS GO annotations from MGI's mouse gene data," we provide the abstract and a link to the original reference report in this file.</p>
            </text>
            <file name="1471-2105-8-284-S7.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S8">
            <title>
               <p>Additional file 8</p>
            </title>
            <text>
               <p><b>Supplementary Table 6</b>: The UniProt and AmiGO annotations for the rat kinase proteins with mouse orthologs. This table displays the UniProt and AmiGO annotations for rat kinase proteins that were annotated based on a mouse ortholog.</p>
            </text>
            <file name="1471-2105-8-284-S8.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>A recent study found that the GO annotations with ISS (inferred from sequence or structural similarity) evidence code could have error rates as high as 49% <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. This argues for the development and large-scale application of a suite of computational tools for identifying and flagging potentially erroneous annotations in functional genomics databases. Our results suggest the utility of including machine learning methods among such a suite of tools. Large-scale application of machine learning tools to protein annotation has to overcome several challenges. Because many proteins are multi-functional, classifiers should be able to assign a sequence to multiple, not mutually exclusive, classes (the <it>multi label </it>classification problem), or more generally, to a subset of nodes in a directed-acyclic graph, e.g., the GO hierarchy, (the <it>structured label </it>classification problem). Fortunately, a number of research groups have developed machine learning algorithms for multi-label and structured label classification and demonstrated their application in large-scale protein function classification <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp>. We can draw on recent advances in machine learning methods for hierarchical multi-label classification of large sequence datasets to adapt our method to work in such a setting. For example, a binary classifier can be trained to determine membership of a given sequence in the class represented by each node of the GO hierarchy, starting with the root node (to which trivially the entire dataset is assigned). Binary classifiers at each node in the hierarchy can then be trained recursively, focusing on the dataset passed to that node from its parent(s) in the GO hierarchy.</p>
         <p>In this study, we have limited our attention to <it>sequence-based </it>machine learning methods for annotation of protein sequences. With the increasing availability of other types of data (protein structure, gene expression profiles, etc.), there is a growing interest in machine learning and other computational methods for genome-wide prediction of protein function using diverse types of information <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>. Such techniques can be applied in a manner similar to our use of sequence-based machine learning to identify potentially erroneous annotations in existing databases.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>The increasing reliance on automated tools in genome-wide functional annotation of proteins has led to a corresponding increase in the risk of propagation of annotation errors across genome databases. Short of direct experimental validation of every annotation, it is impossible to ensure that the annotations are accurate. The results presented here and in recent related studies <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp> underscore the need for checking the consistency of annotations against multiple sources of information and carefully exploring the sources of any detected inconsistencies. Addressing this problem requires the use of machine readable metadata that capture precise descriptions of all data sources, data provenance, background assumptions, and algorithms used to infer the derived information. There is also a need for computational tools that can detect annotation inconsistencies and alert data sources and their users regarding potential errors. Expertly curated databases such as the Mouse Genome Database are indispensable for research in functional genomics and systems biology, and it is important to emphasize that several measures for finding and correcting inconsistent annotations are already in place at MGD <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. The present study suggests that additional measures, especially in the case of protein annotations with RCA evidence code, can further increase the reliability of these valuable resources.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Classification Strategy</p>
            </st>
            <p>We constructed an HDTree binary classifier, described below, for each of the three kinase families. The first two kinase families correspond to the GO labels GO0004674 (Ser/Thr kinases) or GO0004713 (Tyr kinases) but not both; the third family corresponds to dual-specificity kinases that belong to both GO0004674 and GO0004713. Classifier #1 distinguishes between Ser/Thr kinases and the rest (Tyr and dual-specificity kinases). Similarly, classifier #2 distinguishes between Tyr kinases and the rest (Ser/Thr and dual specificity kinases). Classifier #3 distinguishes dual-specificity kinases from the rest (those with only Ser/Thr or Tyr activity), based on the predictions generated by classifier #1 and classifier #2 as follows: If only classifier #1 generates a positive prediction, the corresponding sequence is classified as (exclusively) a Ser/Thr kinase. If only classifier #2 generates a positive prediction, the corresponding sequence is classified as (exclusively) Tyr kinase. If both classifiers generate a positive prediction or if both classifiers generate a negative prediction, the corresponding sequence is classified as a dual-specificity kinase. We interpret the disagreement between the classifiers as indicative of signaling evidence that the protein is neither exclusively Ser/Thr nor Tyr, and hence, likely to have dual specificity. More sophisticated evidence combination methods could be used instead. However, this simple technique worked sufficiently well in the case of this dataset (see Table <tblr tid="T4">4</tblr>).</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Classification schema for Classier #3 (Method for predicting dual specificity kinases). HDTree Classifier #3 uses the outputs from HDTree Classifier #1 and HDTree Classifier #2 to distinguish between dual-specificity kinases, Ser/Thr kinases, and Tyr kinases. There are four possible labelings from the binary classifiers #1 and #2. 'Yes' or 'No' votes from Classifier #1 correspond to predictions of Ser/Thr or Tyr labels, respectively, for the protein. 'Yes' or 'No' votes from Classifier #2 correspond to predictions of Tyr or Ser/Thr labels. When both classifiers predict the protein to be Ser/Thr (that is, Classifier #1 votes 'Yes' and Classifier #2 votes 'No'), Classifier #3 labels the protein as "exclusively Ser/Thr" (and hence, not Tyr). Similarly, when both classifiers predict the protein to be Tyr, Classifier #3 labels the protein as "exclusively Tyr" (and hence not Ser/Thr). When both classifiers vote 'Yes' or when both vote 'No,' Classifier #3 labels the protein as having "Dual" catalytic activity. See Methods section for details on each classifier.</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="center">
                        <p>Prediction of classifier #1 (Ser/Thr)</p>
                     </c>
                     <c ca="center">
                        <p>Prediction of classifier #2 (Tyr)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>New Prediction of classifier #3 (Dual, Ser/Thr, Tyr)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Yes</p>
                     </c>
                     <c ca="center">
                        <p>Yes</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Dual</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Yes</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>exclusively Ser/Thr</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>Yes</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>exclusively Tyr</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Dual</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>HDTree Method</p>
            </st>
            <p>As noted above, an HDTree binary classifier <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> is constructed for each of the three kinase families. Each HDTree binary classifier is a decision tree classifier that assigns a class label to a target sequence based on the binary class labels output by the Na&#239;ve Bayes, NB k-gram, NB(k), and PSI-BLAST classifiers for the corresponding kinase families. Because there are eight classifiers Na&#239;ve Bayes, NB 2-gram, NB 3-gram, NB 4-gram, NB(2), NB(3), NB(4), and PSI-BLAST, the input to a HDTree binary classifier for each kinase family consists of an 8-tuple of class labels assigned to the sequence by the corresponding 8 classifiers. The output of the HDTree classifier for kinase family <it>c </it>is a binary class label (1 if the predicted class is <it>c</it>; 0 otherwise). Thus, each HDTree classifier is a decision tree classifier that is trained to predict the binary class label of a query sequence based on the 8-tuple of class labels predicted by the eight individual classifiers. Because HDTree is a decision tree, it is easy to determine which individual classifier(s) provided the most information in regards to the predicted class label. In the resulting tree, nodes near the top of the tree provided the most information about the class label. Thus, HDTree can also facilitate identification of the determinative biological sequence signals. We used the Weka version 3.4.4 implementation <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> (J4.8) of the C4.5 decision tree learning algorithm <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>.</p>
            <p>We describe below, a class of probabilistic models for sequence classification.</p>
         </sec>
         <sec>
            <st>
               <p>Classification Using a Probabilistic Model</p>
            </st>
            <p>We start by introducing the general procedure for building a classifier from a probabilistic generative model.</p>
            <p>Suppose we can specify a probabilistic model <it>&#945; </it>for sequences defined over some alphabet &#931; (which in our case is the 20-letter amino acid alphabet). The model <it>&#945; </it>specifies for any sequence <inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula> = <it>s</it><sub>1</sub>, ..., <it>s</it><sub><it>n</it></sub>, the probability <it>P</it><sub><it>&#945;</it></sub>(<inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula> = <it>s</it><sub>1</sub>, ..., <it>s</it><sub><it>n</it></sub>) of generating the sequence <inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula>. Suppose we assume that sequences belonging to class <it>c</it><sub><it>j </it></sub>are generated by the probabilistic generative model <it>&#945; </it>(<it>c</it><sub><it>j</it></sub>).</p>
            <p>Then, <inline-formula><m:math name="1471-2105-8-284-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mi>&#945;</m:mi></m:msub><m:mo stretchy="false">(</m:mo><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover><m:mo>=</m:mo><m:msub><m:mi>s</m:mi><m:mn>1</m:mn></m:msub><m:mo>,</m:mo><m:mn>...</m:mn><m:mo>,</m:mo><m:msub><m:mi>s</m:mi><m:mi>n</m:mi></m:msub><m:mo>|</m:mo><m:msub><m:mi>c</m:mi><m:mi>j</m:mi></m:msub><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:msub><m:mi>P</m:mi><m:mrow><m:mi>&#945;</m:mi><m:mo stretchy="false">(</m:mo><m:msub><m:mi>c</m:mi><m:mi>j</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow></m:msub><m:mo stretchy="false">(</m:mo><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover><m:mo>=</m:mo><m:msub><m:mi>s</m:mi><m:mn>1</m:mn></m:msub><m:mo>,</m:mo><m:mn>...</m:mn><m:mo>,</m:mo><m:msub><m:mi>s</m:mi><m:mi>n</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaudaWgaaWcbaacciGae8xSdegabeaakiabcIcaOmaanaaabaGaem4uamfaaiabg2da9iabdohaZnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4Cam3aaSbaaSqaaiabd6gaUbqabaGccqGG8baFcqWGJbWydaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabg2da9iabdcfaqnaaBaaaleaacqWFXoqycqGGOaakcqWGJbWydaWgaaadbaGaemOAaOgabeaaliabcMcaPaqabaGccqGGOaakdaqdaaqaaiabdofatbaacqGH9aqpcqWGZbWCdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdohaZnaaBaaaleaacqWGUbGBaeqaaOGaeiykaKcaaa@58AE@</m:annotation></m:semantics></m:math></inline-formula> is the probability of <inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula> given that the class is <it>c</it><sub><it>j</it></sub>. Therefore, given the probabilistic generative model for each of the classes in <it>C </it>(the set of possible mutually exclusive class labels) for sequences over the alphabet &#931;, we can compute the most likely class label <it>c</it>(<inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula>) for any given sequence <inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula> = <it>s</it><sub>1</sub>, ..., <it>s</it><sub><it>n </it></sub>as follows: <inline-formula><m:math name="1471-2105-8-284-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>c</m:mi><m:mo stretchy="false">(</m:mo><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mi>arg</m:mi><m:mo>&#8289;</m:mo><m:munder><m:mrow><m:mi>max</m:mi><m:mo>&#8289;</m:mo></m:mrow><m:mrow><m:msub><m:mi>c</m:mi><m:mi>j</m:mi></m:msub><m:mo>&#8712;</m:mo><m:mi>C</m:mi></m:mrow></m:munder><m:msub><m:mi>P</m:mi><m:mi>&#945;</m:mi></m:msub><m:mo stretchy="false">(</m:mo><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover><m:mo>=</m:mo><m:msub><m:mi>s</m:mi><m:mn>1</m:mn></m:msub><m:mo>,</m:mo><m:mn>...</m:mn><m:mo>,</m:mo><m:msub><m:mi>s</m:mi><m:mi>n</m:mi></m:msub><m:mo>|</m:mo><m:msub><m:mi>c</m:mi><m:mi>j</m:mi></m:msub><m:mo stretchy="false">)</m:mo><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:msub><m:mi>c</m:mi><m:mi>j</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGJbWycqGGOaakdaqdaaqaaiabdofatbaacqGGPaqkcqGH9aqpcyGGHbqycqGGYbGCcqGGNbWzdaWfqaqaaiGbc2gaTjabcggaHjabcIha4bWcbaGaem4yam2aaSbaaWqaaiabdQgaQbqabaWccqGHiiIZcqWGdbWqaeqaaOGaemiuaa1aaSbaaSqaaGGaciab=f7aHbqabaGccqGGOaakdaqdaaqaaiabdofatbaacqGH9aqpcqWGZbWCdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdohaZnaaBaaaleaacqWGUbGBaeqaaOGaeiiFaWNaem4yam2aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqWGqbaucqGGOaakcqWGJbWydaWgaaWcbaGaemOAaOgabeaakiabcMcaPaaa@5B08@</m:annotation></m:semantics></m:math></inline-formula>. Hence, the goal of a machine learning algorithm for sequence classification is to estimate the parameters that describe the corresponding probabilistic models from data. Different classifiers differ with regard to their ability to capture the dependencies among the elements of a sequence.</p>
            <p>In what follows, we use the following notations.</p>
            <p><it>n </it>= <inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula> = the length of the sequence |<inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula>|</p>
            <p><it>k </it>= the size of the k-gram (k-mer) used in the model</p>
            <p><it>s</it><sub><it>i </it></sub>= the <it>i</it><sup><it>th</it></sup>element in the sequence <inline-formula><m:math name="1471-2105-8-284-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>S</m:mi><m:mo stretchy="true">&#175;</m:mo></m:mover></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdofatbaaaaa@2DEC@</m:annotation></m:semantics></m:math></inline-formula></p>
            <p><it>c</it><sub><it>j </it></sub>= the <it>j</it><sup><it>th </it></sup>class in the class set <it>C</it></p>
         </sec>
         <sec>
            <st>
               <p>Na&#239;ve Bayes Classifier</p>
            </st>
            <p>The Na&#239;ve Bayes classifier assumes that each element of the sequence is independent of the other elements given the class label. Consequently,</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-8-284-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>c</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mover accent="true">
                              <m:mi>S</m:mi>
                              <m:mo stretchy="true">&#175;</m:mo>
                           </m:mover>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mi>arg</m:mi>
                           <m:mo>&#8289;</m:mo>
                           <m:munder>
                              <m:mrow>
                                 <m:mi>max</m:mi>
                                 <m:mo>&#8289;</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>c</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo>&#8712;</m:mo>
                                 <m:mi>C</m:mi>
                              </m:mrow>
                           </m:munder>
                           <m:msub>
                              <m:mi>P</m:mi>
                              <m:mi>&#945;</m:mi>
                           </m:msub>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8719;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>n</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>P</m:mi>
                                    <m:mi>&#945;</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mn>1</m:mn>
                                 </m:msub>
                                 <m:mo>|</m:mo>
                                 <m:msub>
                                    <m:mi>c</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mo>&#8901;</m:mo>
                                 <m:mo>&#8901;</m:mo>
                                 <m:mo>&#8901;</m:mo>
                                 <m:msub>
                                    <m:mi>P</m:mi>
                                    <m:mi>&#945;</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mi>n</m:mi>
                                 </m:msub>
                                 <m:mo>|</m:mo>
                                 <m:msub>
                                    <m:mi>c</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>c</m:mi>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGJbWycqGGOaakdaqdaaqaaiabdofatbaacqGGPaqkcqGH9aqpcyGGHbqycqGGYbGCcqGGNbWzdaWfqaqaaiGbc2gaTjabcggaHjabcIha4bWcbaGaem4yam2aaSbaaWqaaiabdQgaQbqabaWccqGHiiIZcqWGdbWqaeqaaOGaemiuaa1aaSbaaSqaaGGaciab=f7aHbqabaGcdaqeWbqaaiabdcfaqnaaBaaaleaacqWFXoqyaeqaaOGaeiikaGIaem4Cam3aaSbaaSqaaiabigdaXaqabaGccqGG8baFcqWGJbWydaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabgwSixlabgwSixlabgwSixlabdcfaqnaaBaaaleaacqWFXoqyaeqaaOGaeiikaGIaem4Cam3aaSbaaSqaaiabd6gaUbqabaGccqGG8baFcqWGJbWydaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4ganiabg+GivdGccqWGqbaucqGGOaakcqWGJbWydaWgaaWcbaGaemOAaOgabeaakiabcMcaPaaa@6E2B@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Note that the Naive Bayes classifier for sequences treats each sequence as though it were simply a <it>bag </it>of letters. We now consider two Naive Bayes-like models based on <it>k</it>-grams.</p>
         </sec>
         <sec>
            <st>
               <p>Na&#239;ve Bayes <it>k</it>-grams Classifier</p>
            </st>
            <p>The Naive Bayes <it>k</it>-grams (NB <it>k</it>-grams) <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B41">41</abbr></abbrgrp> method uses a sliding a window of size <it>k </it>along each sequence to generate a <it>bag </it>of <it>k</it>-grams representation of the sequence. Much like in the case of the Naive Bayes classifier described above treats each <it>k</it>-gram in the bag to be independent of the others given the class label for the sequence. Given this probabilistic model, the standard method for classification using a probabilistic model can be applied. The probability model associated with Na&#239;ve Bayes <it>k</it>-grams:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-8-284-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>P</m:mi>
                              <m:mi>&#945;</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mover accent="true">
                              <m:mi>S</m:mi>
                              <m:mo stretchy="true">&#175;</m:mo>
                           </m:mover>
                           <m:mo>=</m:mo>
                           <m:mo stretchy="false">[</m:mo>
                           <m:msub>
                              <m:mi>S</m:mi>
                              <m:mn>1</m:mn>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mn>1</m:mn>
                           </m:msub>
                           <m:mo>,</m:mo>
                           <m:mn>...</m:mn>
                           <m:mo>,</m:mo>
                           <m:msub>
                              <m:mi>S</m:mi>
                              <m:mi>n</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mi>n</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">]</m:mo>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mi>arg</m:mi>
                           <m:mo>&#8289;</m:mo>
                           <m:munder>
                              <m:mrow>
                                 <m:mi>max</m:mi>
                                 <m:mo>&#8289;</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>c</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo>&#8712;</m:mo>
                                 <m:mi>C</m:mi>
                              </m:mrow>
                           </m:munder>
                           <m:msub>
                              <m:mi>P</m:mi>
                              <m:mi>&#945;</m:mi>
                           </m:msub>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8719;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mi>n</m:mi>
                                    <m:mo>&#8722;</m:mo>
                                    <m:mi>k</m:mi>
                                    <m:mo>+</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                              </m:munderover>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>P</m:mi>
                                    <m:mi>&#945;</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>S</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>,</m:mo>
                                 <m:mn>...</m:mn>
                                 <m:mo>,</m:mo>
                                 <m:msub>
                                    <m:mi>S</m:mi>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>+</m:mo>
                                       <m:mi>k</m:mi>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>+</m:mo>
                                       <m:mi>k</m:mi>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                 </m:msub>
                                 <m:mo>|</m:mo>
                                 <m:msub>
                                    <m:mi>c</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>c</m:mi>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaudaWgaaWcbaacciGae8xSdegabeaakiabcIcaOmaanaaabaGaem4uamfaaiabg2da9iabcUfaBjabdofatnaaBaaaleaacqaIXaqmaeqaaOGaeyypa0Jaem4Cam3aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGtbWudaWgaaWcbaGaemOBa4gabeaakiabg2da9iabdohaZnaaBaaaleaacqWGUbGBaeqaaOGaeiyxa0LaeiykaKIaeyypa0JagiyyaeMaeiOCaiNaei4zaC2aaCbeaeaacyGGTbqBcqGGHbqycqGG4baEaSqaaiabdogaJnaaBaaameaacqWGQbGAaeqaaSGaeyicI4Saem4qameabeaakiabdcfaqnaaBaaaleaacqWFXoqyaeqaaOWaaebCaeaacqWGqbaudaWgaaWcbaGae8xSdegabeaakiabcIcaOiabdofatnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0Jaem4Cam3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGtbWudaWgaaWcbaGaemyAaKMaey4kaSIaem4AaSMaeyOeI0IaeGymaedabeaakiabg2da9iabdohaZnaaBaaaleaacqWGPbqAcqGHRaWkcqWGRbWAcqGHsislcqaIXaqmaeqaaOGaeiiFaWNaem4yam2aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUjabgkHiTiabdUgaRjabgUcaRiabigdaXaqdcqGHpis1aOGaemiuaaLaeiikaGIaem4yam2aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkaaa@8D59@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>A problem with the NB <it>k</it>-grams approach is that successive <it>k</it>-grams extracted from a sequence share <it>k</it>-1 elements in common. This grossly and systematically violates the independence assumption of Naive Bayes.</p>
         </sec>
         <sec>
            <st>
               <p>Na&#239;ve Bayes (k)</p>
            </st>
            <p>We introduce the Naive Bayes (<it>k</it>) or the NB(<it>k</it>) model <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B41">41</abbr></abbrgrp> to explicitly model the dependencies that arise as a consequence of the overlap between successive <it>k</it>-grams in a sequence. We represent the dependencies in a graphical form by drawing edges between the elements that are directly dependent on each other.</p>
            <p>Using the Junction Tree Theorem for graphical models <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, it can be proved <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> that the correct probability model <it>&#945; </it>that captures the dependencies among overlapping <it>k</it>-grams is given by:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-8-284-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>P</m:mi>
                              <m:mi>&#945;</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mover accent="true">
                              <m:mi>S</m:mi>
                              <m:mo stretchy="true">&#175;</m:mo>
                           </m:mover>
                           <m:mo>=</m:mo>
                           <m:mo stretchy="false">[</m:mo>
                           <m:msub>
                              <m:mi>S</m:mi>
                              <m:mn>1</m:mn>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mn>1</m:mn>
                           </m:msub>
                           <m:mo>,</m:mo>
                           <m:mn>...</m:mn>
                           <m:mo>,</m:mo>
                           <m:msub>
                              <m:mi>S</m:mi>
                              <m:mi>n</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msub>
                              <m:mi>s</m:mi>
                              <m:mi>n</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">]</m:mo>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8719;</m:mo>
                                       <m:mrow>
                                          <m:mi>i</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mi>n</m:mi>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mi>k</m:mi>
                                          <m:mo>+</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>P</m:mi>
                                          <m:mi>&#945;</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>S</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>,</m:mo>
                                       <m:mn>...</m:mn>
                                       <m:mo>,</m:mo>
                                       <m:msub>
                                          <m:mi>S</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>1</m:mn>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>1</m:mn>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8719;</m:mo>
                                       <m:mrow>
                                          <m:mi>i</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>2</m:mn>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mi>n</m:mi>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mi>k</m:mi>
                                          <m:mo>+</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>P</m:mi>
                                          <m:mi>&#945;</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>S</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>,</m:mo>
                                       <m:mn>...</m:mn>
                                       <m:mo>,</m:mo>
                                       <m:msub>
                                          <m:mi>S</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>2</m:mn>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>2</m:mn>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaudaWgaaWcbaacciGae8xSdegabeaakiabcIcaOmaanaaabaGaem4uamfaaiabg2da9iabcUfaBjabdofatnaaBaaaleaacqaIXaqmaeqaaOGaeyypa0Jaem4Cam3aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGtbWudaWgaaWcbaGaemOBa4gabeaakiabg2da9iabdohaZnaaBaaaleaacqWGUbGBaeqaaOGaeiyxa0LaeiykaKIaeyypa0ZaaSaaaeaadaqeWaqaaiabdcfaqnaaBaaaleaacqWFXoqyaeqaaOGaeiikaGIaem4uam1aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGZbWCdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdofatnaaBaaaleaacqWGPbqAcqGHRaWkcqWGRbWAcqGHsislcqaIXaqmaeqaaOGaeyypa0Jaem4Cam3aaSbaaSqaaiabdMgaPjabgUcaRiabdUgaRjabgkHiTiabigdaXaqabaGccqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUjabgkHiTiabdUgaRjabgUcaRiabigdaXaqdcqGHpis1aaGcbaWaaebmaeaacqWGqbaudaWgaaWcbaGae8xSdegabeaakiabcIcaOiabdofatnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0Jaem4Cam3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGtbWudaWgaaWcbaGaemyAaKMaey4kaSIaem4AaSMaeyOeI0IaeGOmaidabeaakiabg2da9iabdohaZnaaBaaaleaacqWGPbqAcqGHRaWkcqWGRbWAcqGHsislcqaIYaGmaeqaaOGaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIYaGmaeaacqWGUbGBcqGHsislcqWGRbWAcqGHRaWkcqaIXaqma0Gaey4dIunaaaaaaa@9BCD@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Now, given this probabilistic model, we can use the standard approach to classification given a probabilistic model. It is easily seen that when k = 1, Naive Bayes 1-grams as well as Naive Bayes (1) reduce to the Naive Bayes model.</p>
            <p>The relevant probabilities required for specifying the above models can be estimated using standard techniques for estimation of probabilities using Laplace estimators <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>PSI-Blast</p>
            </st>
            <p>We used PSI-BLAST (from the latest release of BLAST) <abbrgrp><abbr bid="B44">44</abbr></abbrgrp> to construct a binary classifier for each class. We used the binary class label predicted by the PSI-BLAST based classifier as an additional input to our HD-Tree classifier. Given a query sequence to be classified, we use PSI-BLAST to compare the query sequence against a reference protein sequence database, i.e., the training set used in the cross-validation process. We run PSI-BLAST with the query sequence against the reference database. We assign to the query sequence the functional class of the top scoring hit (the sequence with the lowest e-value) from the PSI-BLAST results. The resulting binary prediction of the PSI-BLAST classifier for class <it>c </it>is 1 if the class label for the top scoring hit is <it>c</it>. Otherwise, it is 0. An e-value cut-off of 0.0001 was used for PSI-BLAST, with all other parameters set to their default values.</p>
         </sec>
         <sec>
            <st>
               <p>Performance Evaluation</p>
            </st>
            <p>The performance measures <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> used to evaluate each of the different classifiers trained using machine learning algorithms are summarized in Tables <tblr tid="T5">5</tblr> and <tblr tid="T6">6</tblr>.</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Performance measure definitions for binary classification. The performance measures <it>accuracy, precision, recall, correlation coefficient, and kappa coefficient</it>are used to evaluate the performance of our machine learning approaches [45]. <it>Accuracy </it>is the fraction of overall predictions that are correct. <it>Precision </it>is the ratio of predicted true positive examples to the total number of actual positive examples. <it>Recall </it>is the ratio of predicted true positives to the total number of examples predicted as positive. <it>Correlation coefficient </it>measures the correlation between predictions and actual class labels. <it>Kappa coefficient </it>is used as a measure of agreement between two random variables (predictions and actual class labels). The table summarizes the definitions of performance measures in the 2-class setting (binary classification), where <it>M </it>= the total number of classes and <it>N </it>= the total number of examples. <it>TP, TN, FP, FN </it>are the true positives, true negatives, false positives, and false negatives for the given confusion matrix.</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>Performance Measure</p>
                     </c>
                     <c ca="center">
                        <p>Definition</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Accuracy</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>N</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>N</m:mi>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdsfaujabdcfaqjabgUcaRiabdsfaujabd6eaobqaaiabdsfaujabdcfaqjabgUcaRiabdAeagjabdcfaqjabgUcaRiabdsfaujabd6eaojabgUcaRiabdAeagjabd6eaobaaaaa@3E1C@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Precision</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>N</m:mi>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdsfaujabdcfaqbqaaiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaobaaaaa@348C@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Recall</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i9" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>P</m:mi>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbwvMCKfMBHbqedmvETj2BSbqee0evGueE0jxyaibaieIgFLIOYR2NHOxjYhrPYhrPYpI8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbbG8FasPYRqj0=yi0lXdbba9pGe9qqFf0dXdHuk9fr=xfr=xfrpiWZqaaeaabiGaaiaacaqabeaabeqacmaaaOqaamaalaaabaGaamivaiaadcfaaeaacaWGubGaamiuaiabgUcaRiaadAeacaWGqbaaaaaa@3B6B@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Correlation Coefficient</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>*</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>*</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>N</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:msqrt>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>T</m:mi>
                                                   <m:mi>P</m:mi>
                                                   <m:mo>+</m:mo>
                                                   <m:mi>F</m:mi>
                                                   <m:mi>N</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>T</m:mi>
                                                   <m:mi>P</m:mi>
                                                   <m:mo>+</m:mo>
                                                   <m:mi>F</m:mi>
                                                   <m:mi>P</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>T</m:mi>
                                                   <m:mi>N</m:mi>
                                                   <m:mo>+</m:mo>
                                                   <m:mi>F</m:mi>
                                                   <m:mi>P</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>T</m:mi>
                                                   <m:mi>N</m:mi>
                                                   <m:mo>+</m:mo>
                                                   <m:mi>F</m:mi>
                                                   <m:mi>N</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                             </m:msqrt>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdsfaujabdcfaqjabcQcaQiabdsfaujabd6eaojabgkHiTiabdAeagjabdcfaqjabcQcaQiabdAeagjabd6eaobqaamaakaaabaGaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4KaeiykaKIaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaaLaeiykaKIaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaaLaeiykaKIaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemOta4KaeiykaKcaleqaaaaaaaa@5544@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Kappa Coefficient</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i11" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>*</m:mo>
                                             <m:mo>+</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>*</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>+</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>*</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>N</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>*</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>+</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>*</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>T</m:mi>
                                             <m:mi>N</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mi>P</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabcIcaOiabdsfaujabdcfaqjabcQcaQiabgUcaRiabdsfaujabd6eaojabcMcaPiabgkHiTiabcIcaOiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaojabcMcaPiabcQcaQiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabdcfaqjabcMcaPiabgUcaRiabcIcaOiabdsfaujabd6eaojabgUcaRiabdAeagjabd6eaojabcMcaPiabcQcaQiabcIcaOiabdsfaujabd6eaojabgUcaRiabdAeagjabdcfaqjabcMcaPiabcMcaPaqaaiabd6eaojabgkHiTiabcIcaOiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaojabcMcaPiabcQcaQiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabdcfaqjabcMcaPiabgUcaRiabcIcaOiabdsfaujabd6eaojabgUcaRiabdAeagjabd6eaojabcMcaPiabcQcaQiabcIcaOiabdsfaujabd6eaojabgUcaRiabdAeagjabdcfaqjabcMcaPiabcMcaPaaaaaa@79B3@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T6">
               <title>
                  <p>Table 6</p>
               </title>
               <caption>
                  <p>Performance measure definitions for multi-class classification. The performance measures <it>accuracy, precision, recall, correlation coefficient, and kappa coefficient </it>are used to evaluate the performance of our machine learning approaches [45]. <it>Accuracy </it>is the fraction of overall predictions that are correct. <it>Precision </it>is the ratio of predicted true positive examples to the total number of actual positive examples. <it>Recall </it>is the ratio of predicted true positives to the total number of examples predicted as positive. <it>Correlation coefficient </it>measures the correlation between predictions and actual class labels. <it>Kappa coefficient </it>is used as a measure of agreement between two random variables (predictions and actual class labels). The table displays the general definition of each measure, where <it>M </it>= the total number of classes and <it>N </it>= the total number of examples, <it>x</it><sub><it>ik </it></sub>represents the number of examples in row <it>i </it>and column <it>k </it>of the given confusion matrix.</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>Performance Measure</p>
                     </c>
                     <c ca="center">
                        <p>Definition</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Accuracy (class <it>i</it>)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:msubsup>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>i</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>M</m:mi>
                                                </m:msubsup>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mrow>
                                                         <m:mi>i</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                          <m:mi>N</m:mi>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaamaaqadabaGaemiEaG3aaSbaaSqaaiabdMgaPjabdMgaPbqabaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemyta0eaniabggHiLdaakeaacqWGobGtaaaaaa@38B1@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Precision (class <it>i</it>)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i13" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>x</m:mi>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mi>i</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:msubsup>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>k</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>M</m:mi>
                                                </m:msubsup>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mrow>
                                                         <m:mi>k</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdIha4naaBaaaleaacqWGPbqAcqWGPbqAaeqaaaGcbaWaaabmaeaacqWG4baEdaWgaaWcbaGaem4AaSMaemyAaKgabeaaaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGnbqta0GaeyyeIuoaaaaaaa@3BEF@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Recall (class <it>i</it>)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i14" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>x</m:mi>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mi>i</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:msubsup>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>k</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>M</m:mi>
                                                </m:msubsup>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mrow>
                                                         <m:mi>i</m:mi>
                                                         <m:mi>k</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdIha4naaBaaaleaacqWGPbqAcqWGPbqAaeqaaaGcbaWaaabmaeaacqWG4baEdaWgaaWcbaGaemyAaKMaem4AaSgabeaaaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGnbqta0GaeyyeIuoaaaaaaa@3BEF@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Correlation Coefficient (class <it>i</it>)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i15" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msub>
                                                <m:mi>x</m:mi>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mi>i</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                             <m:mo>*</m:mo>
                                             <m:mstyle displaystyle="true">
                                                <m:msub>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>h</m:mi>
                                                      <m:mo>&#8800;</m:mo>
                                                      <m:mi>i</m:mi>
                                                   </m:mrow>
                                                </m:msub>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mrow>
                                                         <m:mi>h</m:mi>
                                                         <m:mi>h</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mstyle>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mstyle displaystyle="true">
                                                <m:msubsup>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>k</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>M</m:mi>
                                                </m:msubsup>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mrow>
                                                         <m:mi>k</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                   <m:mo>*</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msubsup>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>j</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mi>M</m:mi>
                                                      </m:msubsup>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>x</m:mi>
                                                            <m:mrow>
                                                               <m:mi>i</m:mi>
                                                               <m:mi>j</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                         <m:mo stretchy="false">)</m:mo>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:msqrt>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mrow>
                                                         <m:mi>i</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                   <m:mo>+</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msubsup>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>k</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mi>M</m:mi>
                                                      </m:msubsup>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>x</m:mi>
                                                            <m:mrow>
                                                               <m:mi>k</m:mi>
                                                               <m:mi>i</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mrow>
                                                         <m:mi>i</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                   <m:mo>+</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msubsup>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>j</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mi>M</m:mi>
                                                      </m:msubsup>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>x</m:mi>
                                                            <m:mrow>
                                                               <m:mi>i</m:mi>
                                                               <m:mi>j</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msub>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>h</m:mi>
                                                            <m:mo>&#8800;</m:mo>
                                                            <m:mi>i</m:mi>
                                                         </m:mrow>
                                                      </m:msub>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>x</m:mi>
                                                            <m:mrow>
                                                               <m:mi>h</m:mi>
                                                               <m:mi>h</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                   <m:mo>+</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msubsup>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>k</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mi>M</m:mi>
                                                      </m:msubsup>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>x</m:mi>
                                                            <m:mrow>
                                                               <m:mi>k</m:mi>
                                                               <m:mi>i</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msub>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>h</m:mi>
                                                            <m:mo>&#8800;</m:mo>
                                                            <m:mi>i</m:mi>
                                                         </m:mrow>
                                                      </m:msub>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>x</m:mi>
                                                            <m:mrow>
                                                               <m:mi>h</m:mi>
                                                               <m:mi>h</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                   <m:mo>+</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msubsup>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>j</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mi>M</m:mi>
                                                      </m:msubsup>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>x</m:mi>
                                                            <m:mrow>
                                                               <m:mi>i</m:mi>
                                                               <m:mi>j</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                             </m:msqrt>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabcIcaOiabdIha4naaBaaaleaacqWGPbqAcqWGPbqAaeqaaOGaeiOkaOYaaabeaeaacqWG4baEdaWgaaWcbaGaemiAaGMaemiAaGgabeaaaeaacqWGObaAcqGHGjsUcqWGPbqAaeqaniabggHiLdGccqGGPaqkcqGHsislcqGGOaakdaaeWaqaaiabdIha4naaBaaaleaacqWGRbWAcqWGPbqAaeqaaOGaeiOkaOYaaabmaeaacqWG4baEdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabcMcaPaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemyta0eaniabggHiLdaaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGnbqta0GaeyyeIuoaaOqaamaakaaabaGaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPjabdMgaPbqabaGccqGHRaWkdaaeWaqaaiabdIha4naaBaaaleaacqWGRbWAcqWGPbqAaeqaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd2eanbqdcqGHris5aOGaeiykaKIaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPjabdMgaPbqabaGccqGHRaWkdaaeWaqaaiabdIha4naaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabd2eanbqdcqGHris5aOGaeiykaKIaeiikaGYaaabeaeaacqWG4baEdaWgaaWcbaGaemiAaGMaemiAaGgabeaaaeaacqWGObaAcqGHGjsUcqWGPbqAaeqaniabggHiLdGccqGHRaWkdaaeWaqaaiabdIha4naaBaaaleaacqWGRbWAcqWGPbqAaeqaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd2eanbqdcqGHris5aOGaeiykaKIaeiikaGYaaabeaeaacqWG4baEdaWgaaWcbaGaemiAaGMaemiAaGgabeaaaeaacqWGObaAcqGHGjsUcqWGPbqAaeqaniabggHiLdGccqGHRaWkdaaeWaqaaiabdIha4naaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabd2eanbqdcqGHris5aOGaeiykaKcaleqaaaaaaaa@AB9F@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Kappa Coefficient</p>
                     </c>
                     <c ca="center">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-8-284-i16" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:msubsup>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>i</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>M</m:mi>
                                                </m:msubsup>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mrow>
                                                         <m:mi>i</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msubsup>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>h</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mi>M</m:mi>
                                                      </m:msubsup>
                                                      <m:mrow>
                                                         <m:mo stretchy="false">(</m:mo>
                                                         <m:mstyle displaystyle="true">
                                                            <m:msubsup>
                                                               <m:mo>&#8721;</m:mo>
                                                               <m:mrow>
                                                                  <m:mi>k</m:mi>
                                                                  <m:mo>=</m:mo>
                                                                  <m:mn>1</m:mn>
                                                               </m:mrow>
                                                               <m:mi>M</m:mi>
                                                            </m:msubsup>
                                                            <m:mrow>
                                                               <m:msub>
                                                                  <m:mi>x</m:mi>
                                                                  <m:mrow>
                                                                     <m:mi>k</m:mi>
                                                                     <m:mi>h</m:mi>
                                                                  </m:mrow>
                                                               </m:msub>
                                                               <m:mo>*</m:mo>
                                                               <m:mstyle displaystyle="true">
                                                                  <m:msubsup>
                                                                     <m:mo>&#8721;</m:mo>
                                                                     <m:mrow>
                                                                        <m:mi>j</m:mi>
                                                                        <m:mo>=</m:mo>
                                                                        <m:mn>1</m:mn>
                                                                     </m:mrow>
                                                                     <m:mi>M</m:mi>
                                                                  </m:msubsup>
                                                                  <m:mrow>
                                                                     <m:msub>
                                                                        <m:mi>x</m:mi>
                                                                        <m:mrow>
                                                                           <m:mi>h</m:mi>
                                                                           <m:mi>j</m:mi>
                                                                        </m:mrow>
                                                                     </m:msub>
                                                                     <m:mo stretchy="false">)</m:mo>
                                                                  </m:mrow>
                                                               </m:mstyle>
                                                            </m:mrow>
                                                         </m:mstyle>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>N</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mstyle displaystyle="true">
                                                <m:msubsup>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>h</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>M</m:mi>
                                                </m:msubsup>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mstyle displaystyle="true">
                                                      <m:msubsup>
                                                         <m:mo>&#8721;</m:mo>
                                                         <m:mrow>
                                                            <m:mi>k</m:mi>
                                                            <m:mo>=</m:mo>
                                                            <m:mn>1</m:mn>
                                                         </m:mrow>
                                                         <m:mi>M</m:mi>
                                                      </m:msubsup>
                                                      <m:mrow>
                                                         <m:msub>
                                                            <m:mi>x</m:mi>
                                                            <m:mrow>
                                                               <m:mi>k</m:mi>
                                                               <m:mi>h</m:mi>
                                                            </m:mrow>
                                                         </m:msub>
                                                         <m:mo>*</m:mo>
                                                         <m:mstyle displaystyle="true">
                                                            <m:msubsup>
                                                               <m:mo>&#8721;</m:mo>
                                                               <m:mrow>
                                                                  <m:mi>j</m:mi>
                                                                  <m:mo>=</m:mo>
                                                                  <m:mn>1</m:mn>
                                                               </m:mrow>
                                                               <m:mi>M</m:mi>
                                                            </m:msubsup>
                                                            <m:mrow>
                                                               <m:msub>
                                                                  <m:mi>x</m:mi>
                                                                  <m:mrow>
                                                                     <m:mi>h</m:mi>
                                                                     <m:mi>j</m:mi>
                                                                  </m:mrow>
                                                               </m:msub>
                                                               <m:mo stretchy="false">)</m:mo>
                                                            </m:mrow>
                                                         </m:mstyle>
                                                      </m:mrow>
                                                   </m:mstyle>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaamaaqadabaGaemiEaG3aaSbaaSqaaiabdMgaPjabdMgaPbqabaGccqGHsisldaaeWaqaaiabcIcaOmaaqadabaGaemiEaG3aaSbaaSqaaiabdUgaRjabdIgaObqabaGccqGGQaGkdaaeWaqaaiabdIha4naaBaaaleaacqWGObaAcqWGQbGAaeqaaOGaeiykaKcaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGnbqta0GaeyyeIuoaaSqaaiabdUgaRjabg2da9iabigdaXaqaaiabd2eanbqdcqGHris5aaWcbaGaemiAaGMaeyypa0JaeGymaedabaGaemyta0eaniabggHiLdaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGnbqta0GaeyyeIuoaaOqaaiabd6eaojabgkHiTmaaqadabaGaeiikaGYaaabmaeaacqWG4baEdaWgaaWcbaGaem4AaSMaemiAaGgabeaakiabcQcaQmaaqadabaGaemiEaG3aaSbaaSqaaiabdIgaOjabdQgaQbqabaGccqGGPaqkaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd2eanbqdcqGHris5aaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyta0eaniabggHiLdaaleaacqWGObaAcqGH9aqpcqaIXaqmaeaacqWGnbqta0GaeyyeIuoaaaaaaa@7820@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>CA conceived of and designed the study, carried out the data analysis and visualization, developed the Java computer code, and drafted the manuscript. DD and VH contributed to the design of the study, analysis and interpretation of results, and writing of the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
      <sec>
         <st>
            <p>Response from original authors</p>
         </st>
         <p>Masaaki Furuno<sup>1,4</sup>, David Hill<sup>2,5</sup>, Judith Blake<sup>2,5</sup>, Richard Baldarelli<sup>2</sup>, Piero Carninci<sup>3,4</sup>, and Yoshihide Hayashizaki <sup>1,3,4</sup></p>
         <p>Addresses (<sup>1</sup>Functional RNA Research Program, RIKEN Frontier Research System, RIKEN Wako Institute, Wako, Japan. <sup>2</sup>Mouse Genome Informatics Consortium, The Jackson Laboratory, Bar Harbor, Maine, United States of America. <sup>3</sup>Genome Science Laboratory, Discovery Research Institute, RIKEN Wako Institute, Wako, Japan. <sup>4</sup>Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama Institute, Kanagawa, Japan. <sup>5</sup>Gene Ontology Consortium, The Jackson Laboratory, Bar Harbor, Maine, United States of America)</p>
         <p>In this paper, the authors checked for potential Gene Ontology (GO) annotation errors using a machine learning approach. The authors' method identified a set of errors in GO annotations that relate to a very small subset of results from the 2001/2002 FANTOM2 analysis. These have subsequently been corrected.</p>
         <p>We agree with the authors point about the importance of detecting the annotation errors. However, we believe that the errors the authors describe are exaggerated in importance as a result of the selection of datasets that they used and for the small set of genes that they studied. We will explain why they obtained these results, and we have identified a data curation change that has been implemented. However, we note such updates and revisions are a daily part of the work of large bio-informatics resources and of the work of the genome informatics community.</p>
         <p>The strategy employed in FANTOM2 was appropriate and reflected the best strategy for mining large-scale functional information available at the time. In the computational analysis published in 2002 by the FANTOM2 Consortium, protein sequences were compared to other protein sequences and GO annotations were inferred from identical or highly similar proteins. GO annotations were also inferred from InterPro domains that were found in the coding regions of the proteins. The advanced analysis resulted in GO predictions for many proteins we knew nothing about at that time. A subset of the results of this landmark analysis were integrated into Mouse Genome Informatics after the FANTOM2 publication. This data set is important because it was the first analysis of this scale and complexity performed in mouse.</p>
         <p>By retrieving annotations from AmiGO, Andorf <it>et al </it>restricted themselves to the subset of aggressively predicted FANTOM2 GO annotations while not considering high-quality FANTOM2 GO annotations that are represented in MGI using other automated methods. This is because AmiGO by policy does not display annotations inferred from automated methods. Much of the FANTOM data does not appear in AmiGO because it entered the regular MGI annotation stream and receives regular refreshing. As a result, this analysis casts a small subset of the FANTOM2 GO annotations in an unfair light. To obtain a fair analysis of all GO terms annotated at the time of FANTOM2, the original FANTOM2 data are available <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>.</p>
         <p>The results reported by Andorf <it>et al </it>remind us that conclusions based on a particular data set must be viewed in the context of a thorough understanding of how the data was generated and what is being represented in the set that is used for the analysis. The errors in GO annotation found by the authors are not due to general poor quality of FANTOM2 annotation. Rather, unique annotations from FANTOM2, as data associated with a publication, were not being comprehensively updated. We are reminded that any annotations based on computational methods must be regularly re-evaluated. MGI curators have now screened and updated the annotations for genes associated with protein tyrosine and protein serine/threonine kinase activities.</p>
         <suppl id="S9">
            <title>
               <p>Additional file 9</p>
            </title>
            <text>
               <p><b>Supplementary Table 7</b>: Distribution of protein classes for human and mouse proteins annotated by AmiGO, UniProt, and HDTree. This table is a representation of the data used in Figure <figr fid="F1">1</figr> which is a pie chart showing the distribution of human and mouse protein classes based on annotations found in AmiGO, UniProt, and predicted by HDTree.</p>
            </text>
            <file name="1471-2105-8-284-S9.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>The acknowledgements made by Andorf et al are as follows.</p>
            <p>The authors wish to thank Masaaki Furuno, David Hill, Judith Blake, Richard Baldarelli, Piero Carninci, Yoshihide Hayashizaki and the other members of Mouse Genome Informatics, the FANTOM2 project, and AmiGO. Their work has provided invaluable resources, data, and tools to the public. We appreciate their prompt attention to the potential errors identified in this work (among thousands of correctly annotated proteins). We also would like to thank Shankar Subramaniam of the University of California, San Diego and Pierre Baldi of the University of California, Irvine for helpful comments on an earlier draft of this paper. This research was supported in part by grants from the National Science Foundation (0219699) and the National Institutes of Health (GM066387) to Vasant Honavar and Drena Dobbs. Carson Andorf has been supported in part by a fellowship funded by an Integrative Graduate Education and Research Training (IGERT) award (9972653) from the National Science Foundation. The authors are grateful to members of their research groups for helpful comments throughout the progress of this research.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Gene ontology: tool for the unification of biology</p>
            </title>
            <aug>
               <au>
                  <cnm>The Gene Ontology Consortium</cnm>
               </au>
            </aug>
            <source>Nature Genet</source>
            <pubdate>2000</pubdate>
            <volume>25</volume>
            <fpage>25</fpage>
            <lpage>29</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/75556</pubid>
                  <pubid idtype="pmpid" link="fulltext">10802651</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Protein annotation : detective work for function prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Doerks</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>1998</pubdate>
            <volume>14</volume>
            <fpage>248</fpage>
            <lpage>250</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-9525(98)01486-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">9635409</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Predicting functions from protein sequences &#8211; where are the bottlenecks?</p>
            </title>
            <aug>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>1998</pubdate>
            <volume>18</volume>
            <issue>4</issue>
            <fpage>313</fpage>
            <lpage>318</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng0498-313</pubid>
                  <pubid idtype="pmpid" link="fulltext">9537411</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Percolation of annotation errors through hierarchically structured protein sequence databases</p>
            </title>
            <aug>
               <au>
                  <snm>Gilks</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Audit</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>de Angelis</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Tsoka</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Math Biosci</source>
            <pubdate>2005</pubdate>
            <volume>193</volume>
            <issue>2</issue>
            <fpage>223</fpage>
            <lpage>234</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.mbs.2004.08.001</pubid>
                  <pubid idtype="pmpid" link="fulltext">15748731</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Modeling the percolation of annotation errors in a database of protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Gilks</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Audit</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>De Angelis</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Tsoka</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>1641</fpage>
            <lpage>1649</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.12.1641</pubid>
                  <pubid idtype="pmpid" link="fulltext">12490449</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase</p>
            </title>
            <aug>
               <au>
                  <snm>Naumoff</snm>
                  <fnm>DG</fnm>
               </au>
               <au>
                  <snm>Xu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Glansdorff</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Labedan</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>52</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">514541</pubid>
                  <pubid idtype="pmpid" link="fulltext">15287962</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-5-52</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers</p>
            </title>
            <aug>
               <au>
                  <snm>Green</snm>
                  <fnm>ML</fnm>
               </au>
               <au>
                  <snm>Karp</snm>
                  <fnm>PD</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>4035</fpage>
            <lpage>4039</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1179732</pubid>
                  <pubid idtype="pmpid" link="fulltext">16034025</pubid>
                  <pubid idtype="doi">10.1093/nar/gki711</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>A procedure for assessing GO annotation consistency</p>
            </title>
            <aug>
               <au>
                  <snm>Dolan</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Ni</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Camon</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Blake</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>136</fpage>
            <lpage>143</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/bioinformatics/bti1019</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>GOChase: correcting errors from gene ontology-based annotations for gene products</p>
            </title>
            <aug>
               <au>
                  <snm>Park</snm>
                  <fnm>YR</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>829</fpage>
            <lpage>831</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti106</pubid>
                  <pubid idtype="pmpid" link="fulltext">15513987</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Practical limits of function prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Devos</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Valencia</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2000</pubdate>
            <volume>41</volume>
            <issue>1</issue>
            <fpage>98</fpage>
            <lpage>107</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/1097-0134(20001001)41:1&lt;98::AID-PROT120>3.0.CO;2-S</pubid>
                  <pubid idtype="pmpid" link="fulltext">10944397</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Probabilistic annotation of protein sequences based on functional classifications</p>
            </title>
            <aug>
               <au>
                  <snm>Levy</snm>
                  <fnm>ED</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Gilks</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Audit</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>302</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1361783</pubid>
                  <pubid idtype="pmpid" link="fulltext">16354297</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-302</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Learning classifiers for assigning protein sequences to gene ontology functional families</p>
            </title>
            <aug>
               <au>
                  <snm>Andorf</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Silvescu</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Dobbs</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Honavar</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Fifth Int Conf Knowledge Based Computer Systems, India</source>
            <pubdate>2004</pubdate>
            <fpage>256</fpage>
            <lpage>265</lpage>
            <url>http://www.cs.iastate.edu/~honavar/Papers/nbk.pdf</url>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Learning classifiers for assigning protein sequences to Gene Ontology functional families: combining of function annotation using sequence homology with that based on amino acid k-gram composition yields more accurate classifiers than either of the individual approaches</p>
            </title>
            <aug>
               <au>
                  <snm>Andorf</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Silvescu</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Dobbs</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Honavar</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <publisher>Department of Computer Science, Iowa State University</publisher>
            <pubdate>2004</pubdate>
            <url>http://www.cs.iastate.edu/~andorfc/hdtree/HDtree2006.pdf</url>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">534519</pubid>
                  <pubid idtype="pmpid" link="fulltext">15539463</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Remote homology detection : a motif based approach</p>
            </title>
            <aug>
               <au>
                  <snm>Ben-Hur</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Brutlag</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>i26</fpage>
            <lpage>i33</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg1002</pubid>
                  <pubid idtype="pmpid" link="fulltext">12855434</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Gotrees : predicting go associations from protein domain composition using decision trees</p>
            </title>
            <aug>
               <au>
                  <snm>Hayete</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Bienkowska</snm>
                  <fnm>JR</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>2005</pubdate>
            <fpage>127</fpage>
            <lpage>138</lpage>
            <xrefbib>
               <pubid idtype="pmpid">15759620</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>GOtcha : a new method for prediction of protein function assessed by the annotation of seven genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Martin</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Berriman</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Barton</snm>
                  <fnm>GJ</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>178</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">535938</pubid>
                  <pubid idtype="pmpid" link="fulltext">15550167</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-5-178</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Prediction of protein functional domains from sequences using artificial neural networks</p>
            </title>
            <aug>
               <au>
                  <snm>Murvai</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Vlahovicek</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Szepesvari</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Pongor</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>1410</fpage>
            <lpage>1417</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">311121</pubid>
                  <pubid idtype="pmpid" link="fulltext">11483582</pubid>
                  <pubid idtype="doi">10.1101/gr.168701</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>GOPET : a tool for automated predictions of Gene Ontology terms</p>
            </title>
            <aug>
               <au>
                  <snm>Vinayagam</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>del Val</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Schubert</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Eils</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Glatting</snm>
                  <fnm>KH</fnm>
               </au>
               <au>
                  <snm>Suhai</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Konig</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>161</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1434778</pubid>
                  <pubid idtype="pmpid" link="fulltext">16549020</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-161</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Globally predicting protein functions based on co-expressed protein-protein interaction networks and ontology taxonomy similarities</p>
            </title>
            <aug>
               <au>
                  <snm>Zhu</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gao</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Guo</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>2007</pubdate>
            <volume>391</volume>
            <issue>1&#8211;2</issue>
            <fpage>113</fpage>
            <lpage>119</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.gene.2006.12.008</pubid>
                  <pubid idtype="pmpid" link="fulltext">17289301</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Protein serine/threonine phosphatases: life, death, and sleeping</p>
            </title>
            <aug>
               <au>
                  <snm>Gallego</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Virshup</snm>
                  <fnm>DM</fnm>
               </au>
            </aug>
            <source>Curr Opin Cell Biol</source>
            <pubdate>2005</pubdate>
            <volume>17</volume>
            <fpage>197</fpage>
            <lpage>202</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ceb.2005.01.002</pubid>
                  <pubid idtype="pmpid" link="fulltext">15780597</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Cytoplasmic protein tyrosine phosphatases, regulation and function: the roles of PTP1B and TC-PTP</p>
            </title>
            <aug>
               <au>
                  <snm>Bourdeau</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Dube</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Tremblay</snm>
                  <fnm>ML</fnm>
               </au>
            </aug>
            <source>Curr Opin Cell Biol</source>
            <pubdate>2005</pubdate>
            <volume>17</volume>
            <fpage>203</fpage>
            <lpage>209</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ceb.2005.02.001</pubid>
                  <pubid idtype="pmpid" link="fulltext">15780598</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>The Gene Ontology (GO) project in 2006</p>
            </title>
            <aug>
               <au>
                  <cnm>Gene Ontology Consortium</cnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2006</pubdate>
            <volume>34</volume>
            <issue>Database issue</issue>
            <fpage>D322</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1347384</pubid>
                  <pubid idtype="pmpid" link="fulltext">16381878</pubid>
                  <pubid idtype="doi">10.1093/nar/gkj021</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Machine learning in bioinformatics</p>
            </title>
            <aug>
               <au>
                  <snm>Larranaga</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Calvo</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Santana</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Brief Bioinform</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>86</fpage>
            <lpage>112</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bib/bbk007</pubid>
                  <pubid idtype="pmpid" link="fulltext">16761367</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>The Mouse Genome Database (MGD): from genes to mice &#8211; a community resource for mouse biology</p>
            </title>
            <aug>
               <au>
                  <snm>Eppig</snm>
                  <fnm>JT</fnm>
               </au>
               <au>
                  <snm>Bult</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Kadin</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>471</fpage>
            <lpage>475</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/gki113</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs</p>
            </title>
            <aug>
               <au>
                  <snm>Okazaki</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Furuno</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>420</volume>
            <fpage>563</fpage>
            <lpage>573</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature01266</pubid>
                  <pubid idtype="pmpid" link="fulltext">12466851</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>The Universal Protein Resource (UniProt)</p>
            </title>
            <aug>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>CH</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>154</fpage>
            <lpage>159</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/gki070</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <aug>
               <au>
                  <snm>Quinlan</snm>
                  <fnm>JR</fnm>
               </au>
            </aug>
            <source>C4.5: Programs for Machine Learning</source>
            <publisher>Morgan Kauffman</publisher>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B28">
            <title>
               <p>The mouse kinome: discovery and comparative genomics of all mouse protein kinases</p>
            </title>
            <aug>
               <au>
                  <snm>Caenepeel</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Charydczak</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Sudarsanam</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hunter</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Manning</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>PNAS</source>
            <pubdate>2004</pubdate>
            <volume>101</volume>
            <fpage>11707</fpage>
            <lpage>11712</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">511041</pubid>
                  <pubid idtype="pmpid" link="fulltext">15289607</pubid>
                  <pubid idtype="doi">10.1073/pnas.0306880101</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Estimating the annotation error rate of curated GO database sequence annotations</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Baumann</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <issue>1</issue>
            <fpage>170</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1892569</pubid>
                  <pubid idtype="pmpid" link="fulltext">17519041</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-8-170</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Multi-label classification: An overview</p>
            </title>
            <aug>
               <au>
                  <snm>Tsoumakas</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Katakis</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Int J Data Warehousing and Mining</source>
            <pubdate>2007</pubdate>
            <volume>3</volume>
            <issue>3</issue>
            <fpage>1</fpage>
            <lpage>13</lpage>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Hierarchical multi-label prediction of gene function</p>
            </title>
            <aug>
               <au>
                  <snm>Barutcuoglu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Schapire</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Troyanskaya</snm>
                  <fnm>OG</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>7</issue>
            <fpage>830</fpage>
            <lpage>836</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btk048</pubid>
                  <pubid idtype="pmpid" link="fulltext">16410319</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Kernel-Based Learning of Hierarchical Multilabel Classification Models</p>
            </title>
            <aug>
               <au>
                  <snm>Rousu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Saunders</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Szedmak</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Shawe-Taylor</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Mach Learn Res</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>1601</fpage>
            <lpage>1626</lpage>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Blockeel</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Schietgat</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Struyf</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dzeroski</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Clare</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases</source>
            <publisher>Berlin: Springer, Lecture Notes in Computer Science</publisher>
            <pubdate>2006</pubdate>
            <volume>4213</volume>
            <fpage>18</fpage>
            <lpage>29</lpage>
         </bibl>
         <bibl id="B34">
            <title>
               <p>A combined algorithm for genome-wide prediction of protein function</p>
            </title>
            <aug>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Pellegrini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Thompson</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Yeates</snm>
                  <fnm>TO</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1999</pubdate>
            <volume>402</volume>
            <fpage>83</fpage>
            <lpage>86</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/47048</pubid>
                  <pubid idtype="pmpid" link="fulltext">10573421</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Assigning protein functions by comparative genome analysis : protein phylogenetic profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Pellegrini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Thompson</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Yeates</snm>
                  <fnm>TO</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1999</pubdate>
            <volume>96</volume>
            <fpage>4285</fpage>
            <lpage>4288</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">16324</pubid>
                  <pubid idtype="pmpid" link="fulltext">10200254</pubid>
                  <pubid idtype="doi">10.1073/pnas.96.8.4285</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Cluster analysis and display of genome-wide expression patterns</p>
            </title>
            <aug>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Spellman</snm>
                  <fnm>PT</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>14863</fpage>
            <lpage>14868</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">24541</pubid>
                  <pubid idtype="pmpid" link="fulltext">9843981</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.25.14863</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Whole-genome annotation by using evidence integration in functional-linkage networks</p>
            </title>
            <aug>
               <au>
                  <snm>Karaoz</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Murali</snm>
                  <fnm>TM</fnm>
               </au>
               <au>
                  <snm>Letovsky</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Zheng</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Ding</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Cantor</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Kasif</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2004</pubdate>
            <volume>101</volume>
            <fpage>2888</fpage>
            <lpage>2893</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">365715</pubid>
                  <pubid idtype="pmpid" link="fulltext">14981259</pubid>
                  <pubid idtype="doi">10.1073/pnas.0307326101</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Probabilistic protein function prediction from heterogeneous genome-wide data</p>
            </title>
            <aug>
               <au>
                  <snm>Nariai</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kolaczyk</snm>
                  <fnm>ED</fnm>
               </au>
               <au>
                  <snm>Kasif</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>PLoS ONE</source>
            <pubdate>2007</pubdate>
            <volume>2</volume>
            <fpage>e337</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1828618</pubid>
                  <pubid idtype="pmpid" link="fulltext">17396164</pubid>
                  <pubid idtype="doi">10.1371/journal.pone.0000337</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration</p>
            </title>
            <aug>
               <au>
                  <snm>Xiong</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rayner</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Luo</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>268</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1481625</pubid>
                  <pubid idtype="pmpid" link="fulltext">16725034</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-268</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Data mining in bioinformatics using Weka</p>
            </title>
            <aug>
               <au>
                  <snm>Witten</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Frank</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Data Mining: Practical machine learning tools and techniques</source>
            <publisher>San Francisco: Morgan Kaufmann</publisher>
            <edition>2</edition>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Inter-element dependency models for sequence classification Technical report</p>
            </title>
            <aug>
               <au>
                  <snm>Silvescu</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Andorf</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Dobbs</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Honavar</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <publisher>Department of Computer Science, Iowa State University</publisher>
            <pubdate>2004</pubdate>
            <url>http://www.cs.iastate.edu/~silvescu/papers/nbktr/nbktr.ps</url>
         </bibl>
         <bibl id="B42">
            <aug>
               <au>
                  <snm>Cowell</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Dawid</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lauritzen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Spiegelhalter</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Probabilistic Networks and Expert Systems</source>
            <publisher>Springer</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B43">
            <aug>
               <au>
                  <snm>Mitchell</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Machine learning</source>
            <publisher>New York, USA: McGraw Hill</publisher>
            <pubdate>1997</pubdate>
         </bibl>
         <bibl id="B44">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Nucleic Acid Res</source>
            <pubdate>1997</pubdate>
            <volume>2</volume>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B45">
            <aug>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics: The Machine Learning Approach</source>
            <publisher>Cambridge, MA: MIT Press</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B46">
            <title>
               <p>Fantom</p>
            </title>
            <url>http://fantom2.gsc.riken.jp</url>
         </bibl>
      </refgrp>
   </bm>
</art>
