<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-491</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Lei</snm>
               <fnm>Zhengdeng</fnm>
               <insr iid="I1"/>
               <email>zlei2@uic.edu</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Dai</snm>
               <fnm>Yang</fnm>
               <insr iid="I1"/>
               <email>yangdai@uic.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Bioengineering (MC063), University of Illinois at Chicago, 851 South Morgan Street, Chicago, IL 60607, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>491</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/491</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17090318</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-491</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>11</day>
               <month>7</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>07</day>
               <month>11</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>07</day>
               <month>11</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Lei and Dai; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The accomplishment of the various genome sequencing projects resulted in accumulation of massive amount of gene sequence information. This calls for a large-scale computational method for predicting protein localization from sequence. The protein localization can provide valuable information about its molecular function, as well as the biological pathway in which it participates. The prediction of localization of a protein at subnuclear level is a challenging task. In our previous work we proposed an SVM-based system using protein sequence information for this prediction task. In this work, we assess protein similarity with Gene Ontology (GO) and then improve the performance of the system by adding a module of nearest neighbor classifier using a similarity measure derived from the GO annotation terms for protein sequences.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The performance of the new system proposed here was compared with our previous system using a set of proteins resided within 6 localizations collected from the Nuclear Protein Database (NPD). The overall MCC (accuracy) is elevated from 0.284 (50.0%) to 0.519 (66.5%) for single-localization proteins in leave-one-out cross-validation; and from 0.420 (65.2%) to 0.541 (65.2%) for an independent set of multi-localization proteins. The new system is available at <url>http://array.bioengr.uic.edu/subnuclear.htm</url>.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>The prediction of protein subnuclear localizations can be largely influenced by various definitions of similarity for a pair of proteins based on different similarity measures of GO terms. Using the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produced the best predictive outcome. Substantial improvement in predicting protein subnuclear localizations has been achieved by combining Gene Ontology with sequence information.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>With the completion of genomic sequencing projects, the need for automated prediction of protein subcellular or subnuclear localizations becomes increasingly important. The localization of a protein can provide valuable information about its molecular function, as well as the biological pathway in which it participates <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. The bulk of past work has focused on protein subcellular localizations <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>, and has achieved high accuracy. However, the prediction of protein localization at subnuclear level is far more challenging. We have developed the first SVM-based system using protein sequence information for this task with considerable predictive accuracy <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. In this work, we attempted to improve the performance of the system through the incorporation of information obtained from Gene Ontology (GO).</p>
         <p>GO has been developed to help manage the overwhelming mass of current biological data that are difficult to tie together into a cohesive whole from a computational perspective <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. It has become a <it>de facto </it>standard tool to annotate gene products for various databases. GO is a controlled vocabulary of terms split into three related ontologies consisting of Molecular Function (MF), Biological Processes (BP) and Cellular Components (CC). Molecular function describes activities, such as catalytic or binding activities, at the molecular level. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. A cellular component is a component of a cell, but with the proviso that it is part of some larger object such as an anatomical structure, a gene product group. A gene product might be associated with or located in one or more cellular components <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. It is active in one or more biological processes, during which it performs one or more molecular functions.</p>
         <p>Each category of GO terms is structured as a directed acyclic graph (DAG). Currently there are over 20,000 GO terms <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. The relationships between GO terms have been extensively explored and applied to various biological problems, such as search for genes with similar function. One of the key problems in these applications is how to define similarity between two GO terms. Lord <it>et al</it>. <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp> proposed a measure based on information content for the semantic similarity of GO terms. They revealed that the semantic similarity is correlated with the protein sequence similarity and this correlation is more marked in Molecular Functional annotation. However, their definition of similarity measure relies on a particular database, e.g. SWISS-PROT. Zhang <it>et al</it>. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> used a recursive procedure to define a statistical measure D-value (distribution value) for each GO term in the GO DAG to avoid the dependency on a single annotation database, and developed a gene functional similarity search tool. Gentleman <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> proposed two measures based on graph similarity: simUI and simLP. The former is the ratio of the number of common nodes in the two graphs reduced from the GO DAG and the number of nodes in their union. The latter is defined as the depth of the longest shared path from the root node. Wu <it>et al</it>. <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> predicted functional modules encoded in microbial genomes using a similarity measure similar to simLP.</p>
         <p>Although the semantic similarity between two GO terms has been extensively investigated, how to define similarity between two gene products based on GO annotations for a specific application remains unclear. Suppose that each gene product is annotated by a set of GO terms. Each GO term from one set will be paired with all GO terms in the other set. There are three general ways of defining similarity for two gene products from those GO term pairs: (1) to take the maximum value from the similarity scores of GO term pairs <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>, (2) to take average over all the similarity scores of GO term pairs <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>, and (3) to count the number of identical GO terms in the two GO term sets <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B25">25</abbr></abbrgrp>. We are particularly interested in the identification of an appropriate definition of similarity for proteins for the prediction of protein subnuclear localization. To do so, it is necessary to investigate the effect of various combinations of different measures of GO term similarity and different similarity measures of a pair of proteins on the predictive performance. This evaluation was carried out through our new predictive system expanded from the previous SVM module <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> with the addition of a nearest neighbor classification module, which was constructed based on a similarity definition between a pair of proteins.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Dataset</p>
            </st>
            <p>To provide a valid comparison with our previous system, the same dataset as in <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> was used for evaluation of the new system. The dataset was extracted from the Nuclear Protein Database (NPD) <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> using a Perl script. The NPD is a curated database that stores information on more than 2000 vertebrate proteins, chiefly from human and mouse, which are reported in the literature to be localized in the cell nucleus. Since certain proteins are associated with more than one compartment, a test dataset consisting of proteins with multiple localizations was extracted. These proteins have the same SwissProt or Entrez Protein accession numbers although localized in different compartments. This preparative procedure resulted in 92 proteins that are localized within the six compartments described below. The majority is localized in 2 compartments and the remaining portion is localized in 3 or 4 compartments. After excluding the multi-localization proteins, a non-redundant dataset was further constructed by PROSET <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> to ensure low sequence identity (&lt;50%). In order to have sufficient number of proteins for training and testing, only six localizations were selected for evaluation. These are PML BODY (38), Nuclear Lamina (55), Nuclear Splicing Speckles (56), Chromatin (61), Nucleoplasm (75), and Nucleolus (219). Each of these proteins has a single localization and the total number is 504. The 92 multi-localization proteins are not included in the set of 504 single-localization proteins for the leave-one-out cross-validation (LOOCV). Therefore, the multi-localization dataset is an independent testing set. The summary of the datasets is presented in Table <tblr tid="T1">1</tblr>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The summary of the nuclear proteins</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="center">
                        <p>Class label</p>
                     </c>
                     <c ca="center">
                        <p>Compartment</p>
                     </c>
                     <c ca="center">
                        <p>Number of sequences</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>PML BODY</p>
                     </c>
                     <c ca="center">
                        <p>38</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>Nuclear Lamina</p>
                     </c>
                     <c ca="center">
                        <p>55</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>Nuclear Splicing Speckles</p>
                     </c>
                     <c ca="center">
                        <p>56</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>Chromatin</p>
                     </c>
                     <c ca="center">
                        <p>61</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>Nucleoplasm</p>
                     </c>
                     <c ca="center">
                        <p>75</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>Nucleolus</p>
                     </c>
                     <c ca="center">
                        <p>219</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>Mutiple localizations</p>
                     </c>
                     <c ca="center">
                        <p>92</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Predictive system and evaluation criteria</p>
            </st>
            <p>Given a test protein with GO annotations, the similarity scores between this protein and all the other proteins in the training set are calculated from the similarity scores of GO term pairs (see Methods). The protein with the highest similarity score is designated as the nearest neighbor of the testing protein and its class label will be assigned to the test protein. If multiple proteins in various localizations attain the same highest score or the test protein does not have GO annotation, then the test protein will be assigned as "unpredicted". The unpredicted proteins will be passed on to the SVM module, which uses sequence information <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, for a full coverage of prediction.</p>
            <p>Since the numbers of proteins for the six localizations are unbalanced, the Matthew's correlation coefficient (MCC) was employed for the optimization of parameters and evaluation of performance <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. The overall accuracy for the multi-class classification proposed by Rost <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> was also used for the evaluation of our system. Definitions of the MCC and overall accuracy are detailed in Methods section.</p>
         </sec>
         <sec>
            <st>
               <p>Comparison of various similarity measures for GO term pairs</p>
            </st>
            <p>Three different similarity measures for GO term pairs were compared: (1) Lord's method <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, (2) SimLP as described in Bioconductor <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, and (3) Exact Match. For Lord's method, the GO term frequencies were extracted based on UniProtKB/Swiss-Prot <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. For a GO term pair, Exact Match defines the similarity score as 1 if the two GO terms are identical, 0 otherwise. SUM_Match was utilized to compute the similarity score between two proteins from similarity scores of GO term pairs. It takes the sum of similarity sores for all matched GO terms from two proteins. Note that the SUM_Match score is equivalent to the inner product of two GO term vectors if Exact Match is used for GO term similarity (see Methods for details). As shown in Table <tblr tid="T2">2</tblr>, no significant difference in performance can be observed for these three similarity measures of GO term pairs. Surprisingly, the Exact Match method, which does not utilize any DAG structure of GO, achieved competitive performance in comparison with the other two methods.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Predictive results obtained by using different similarity measures for GO term pairs</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>Semantic similarity method</p>
                     </c>
                     <c ca="center">
                        <p>Lord</p>
                     </c>
                     <c ca="center">
                        <p>SimLP</p>
                     </c>
                     <c ca="center">
                        <p>Exact Match</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Compartment</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>MCC (Accuracy %)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PML BODY</p>
                     </c>
                     <c ca="center">
                        <p>0.223 (31.6)</p>
                     </c>
                     <c ca="center">
                        <p>0.253 (34.2)</p>
                     </c>
                     <c ca="center">
                        <p>0.250 (31.6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Lamina</p>
                     </c>
                     <c ca="center">
                        <p>0.579 (60.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.578 (63.6)</p>
                     </c>
                     <c ca="center">
                        <p>0.578 (63.6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Splicing Speckles</p>
                     </c>
                     <c ca="center">
                        <p>0.598 (66.1)</p>
                     </c>
                     <c ca="center">
                        <p>0.607 (62.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.63 (62.5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Chromatin</p>
                     </c>
                     <c ca="center">
                        <p>0.511 (59.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.518 (60.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.509 (57.4)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleoplasm</p>
                     </c>
                     <c ca="center">
                        <p>0.411 (50.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.504 (56.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.483 (54.7)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleolus</p>
                     </c>
                     <c ca="center">
                        <p>0.615 (75.3)</p>
                     </c>
                     <c ca="center">
                        <p>0.656 (79.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.642 (80.8)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Overall for Single-localization</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.489 (63.7)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.519 (66.5)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.515(66.5)</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>(Based on SUM_Match: The similarity of two proteins is defined as the sum of similarity scores over all matched GO term pairs.)</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Comparison of various similarity definitions for proteins</p>
            </st>
            <p>Very few studies have focused on exploring similarity definition of proteins based on GO terms. Two simple ways are usually employed in defining the similarity between two proteins annotated by GO terms. One is to take the maximum value from the similarity scores of GO term pairs. The other is to take average over all the similarity scores of GO term pairs. However, the above two methods produced poor results especially when the proteins were annotated by many GO terms for the prediction of protein subnuclear localization. Consequently, an extensive investigation on various similarity definitions obtained from similarity scores of GO terms was warranted. As shown in Table <tblr tid="T3">3</tblr>, similarity definition has profound impact on the quality of prediction. The overall accuracy ranges from 27.0% to 66.5% and overall MCC ranges from 0.141 to 0.519 for proteins with single-location. It seems that the use of the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produces the best predictive outcome for this prediction task.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Predictive results obtained by using various similarity definitions for proteins</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Similarity Definition</p>
                     </c>
                     <c ca="center">
                        <p>MAX</p>
                     </c>
                     <c ca="center">
                        <p>AVG</p>
                     </c>
                     <c ca="center">
                        <p>SUM</p>
                     </c>
                     <c ca="center">
                        <p>AVG_BestPairs</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Compartment</p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>MCC (Accuracy %)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PML BODY</p>
                     </c>
                     <c ca="center">
                        <p>0.189 (28.9)</p>
                     </c>
                     <c ca="center">
                        <p>0.153 (34.2)</p>
                     </c>
                     <c ca="center">
                        <p>0.129 (76.3)</p>
                     </c>
                     <c ca="center">
                        <p>-0.031 (0.0)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Lamina</p>
                     </c>
                     <c ca="center">
                        <p>0.344 (45.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.535 (63.6)</p>
                     </c>
                     <c ca="center">
                        <p>0.455 (45.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.315 (61.8)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Splicing Speckles</p>
                     </c>
                     <c ca="center">
                        <p>0.377 (35.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.251 (71.4)</p>
                     </c>
                     <c ca="center">
                        <p>0.289 (33.9)</p>
                     </c>
                     <c ca="center">
                        <p>0.013 (12.5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Chromatin</p>
                     </c>
                     <c ca="center">
                        <p>0.236 (19.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.218 (16.4)</p>
                     </c>
                     <c ca="center">
                        <p>0.112 (4.9)</p>
                     </c>
                     <c ca="center">
                        <p>0.142 (8.2)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleoplasm</p>
                     </c>
                     <c ca="center">
                        <p>0.272 (29.3)</p>
                     </c>
                     <c ca="center">
                        <p>0.039 (9.3)</p>
                     </c>
                     <c ca="center">
                        <p>-0.079 (4.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.118 (6.7)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleolus</p>
                     </c>
                     <c ca="center">
                        <p>0.367 (75.8)</p>
                     </c>
                     <c ca="center">
                        <p>0.431 (44.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.214 (26.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.289 (75.3)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Overall for Single-localization</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.298 (50.8)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.271 (40.3)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.187 (27.0)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.141 (42.9)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Similarity Definition</p>
                     </c>
                     <c ca="center">
                        <p>SUM_BestPairs</p>
                     </c>
                     <c ca="center">
                        <p>AVG_Match</p>
                     </c>
                     <c ca="center">
                        <p>SUM_Match</p>
                     </c>
                     <c ca="center">
                        <p>MAX_Match</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Compartment</p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>MCC (Accuracy %)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PML BODY</p>
                     </c>
                     <c ca="center">
                        <p>0.242 (44.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.187 (34.2)</p>
                     </c>
                     <c ca="center">
                        <p>0.253 (34.2)</p>
                     </c>
                     <c ca="center">
                        <p>0.211 (31.6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Lamina</p>
                     </c>
                     <c ca="center">
                        <p>0.53 (67.3)</p>
                     </c>
                     <c ca="center">
                        <p>0.586 (60.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.578 (63.6)</p>
                     </c>
                     <c ca="center">
                        <p>0.344 (45.5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Splicing Speckles</p>
                     </c>
                     <c ca="center">
                        <p>0.438 (46.4)</p>
                     </c>
                     <c ca="center">
                        <p>0.397 (66.1)</p>
                     </c>
                     <c ca="center">
                        <p>0.607 (62.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.487 (46.4)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Chromatin</p>
                     </c>
                     <c ca="center">
                        <p>0.325 (36.1)</p>
                     </c>
                     <c ca="center">
                        <p>0.467 (45.9)</p>
                     </c>
                     <c ca="center">
                        <p>0.518 (60.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.263 (21.3)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleoplasm</p>
                     </c>
                     <c ca="center">
                        <p>0.284 (36.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.332 (32.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.504 (56.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.298 (32.0)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleolus</p>
                     </c>
                     <c ca="center">
                        <p>0.512 (66.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.615 (72.6)</p>
                     </c>
                     <c ca="center">
                        <p>0.656 (79.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.407 (76.7)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Overall for Single-localization</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.388 (54.6)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.431 (58.3)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.519 (66.5)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.335 (53.2)</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>(Based on SimLP: The GO term similarity is defined on the longest path shared by two GO terms [22].)</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Effect of using GO terms from homologs</p>
            </st>
            <p>Lord <it>et al</it>. <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> reported a problem that many GO term pairs have identical similarity values. This problem stems from two sources: (1) proteins are represented by relatively small number of GO terms; (2) the similarity measure considers only the information content <it>p</it><sub><it>ms </it></sub>(probability of the minimum subsumer) of shared parents of the query terms, meaning that the semantic distances of many different GO term pairs are identical. In order to alleviate this problem, GO terms of homologs retrieved by BLAST were used for the representation of a query protein. The parameter E-value in BLAST is crucial for the quality of homologs, as well as the number of candidate homologs. If E-value is too large, then homologs of low quality may be retrieved. On the other hand, if E-value is too small, then the number of candidate homologs retrieved becomes small. We tested the following E-value parameters: 10<sup>0</sup>, 10<sup>-1</sup>, 10<sup>-2</sup>, ..., 10<sup>-10</sup>, 10<sup>-15</sup>, 10<sup>-20</sup>, 10<sup>-30</sup>, 10<sup>-50</sup>, 10<sup>-100</sup>, 10<sup>-200</sup>, and found that E-value = 10<sup>-9 </sup>was a good trade-off value. Even with this threshold the BLAST could retrieve different numbers of hits for different query proteins. We found that up to 5 homologs were suitable to represent the query protein (see Table <tblr tid="T4">4</tblr>).</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Results obtained by using different numbers of homolog(s)</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Number of homlogs (up to n)</p>
                     </c>
                     <c ca="center">
                        <p>n = 1</p>
                     </c>
                     <c ca="center">
                        <p>n = 5</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Compartment</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>MCC (Accuracy %)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PML BODY</p>
                     </c>
                     <c ca="center">
                        <p>0.262 (39.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.253 (34.2)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Lamina</p>
                     </c>
                     <c ca="center">
                        <p>0.395 (43.6)</p>
                     </c>
                     <c ca="center">
                        <p>0.578 (63.6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Splicing Speckles</p>
                     </c>
                     <c ca="center">
                        <p>0.566 (57.1)</p>
                     </c>
                     <c ca="center">
                        <p>0.607 (62.5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Chromatin</p>
                     </c>
                     <c ca="center">
                        <p>0.474 (47.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.518 (60.7)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleoplasm</p>
                     </c>
                     <c ca="center">
                        <p>0.457 (53.3)</p>
                     </c>
                     <c ca="center">
                        <p>0.504 (56.0)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleolus</p>
                     </c>
                     <c ca="center">
                        <p>0.606 (795.)</p>
                     </c>
                     <c ca="center">
                        <p>0.656 (79.0)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Overall for Single-localization</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.460 (62.3)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.519 (66.5)</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Predictive performance of the new system</p>
            </st>
            <p>As demonstrated before, the predictive outcome is greatly influenced by the ways of combining similarity scores of GO term pairs to give the similarity between two proteins. With the appropriate similarity definition, the performance of the current system can be significantly better than that of the previous SVM system. As seen in Table <tblr tid="T5">5</tblr>, the overall MCC (accuracy) is elevated from 0.284 to 0.519 (50.0% to 66.5%) for single-localization proteins in the leave-one-out cross-validation; and from 0.420 to 0.541 (65.2%, no change in accuracy) for an independent set of multi-localization proteins. More specifically, 401 (281 true predictions and 120 false predictions) out of 504 proteins were predicted by the GO module in the LOOCV, and the remaining 103 were passed on to the SVM module. For the independent test set of proteins with multi-localizations, 82 (55 true predictions and 27 false predictions) out of 92 proteins were predicted by the GO module, and the remaining 10 were passed on to the SVM module.</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Results obtained from the previous and current systems</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Method</p>
                     </c>
                     <c ca="left">
                        <p>AA (ver1)</p>
                     </c>
                     <c ca="left">
                        <p>GO-AA (ver2)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Compartment</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>MCC (Accuracy %)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PML BODY</p>
                     </c>
                     <c ca="left">
                        <p>0.172 (29.0)</p>
                     </c>
                     <c ca="left">
                        <p>0.253 (34.2)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Lamina</p>
                     </c>
                     <c ca="left">
                        <p>0.338 (43.6)</p>
                     </c>
                     <c ca="left">
                        <p>0.578 (63.6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear Splicing Speckles</p>
                     </c>
                     <c ca="left">
                        <p>0.363(35.7)</p>
                     </c>
                     <c ca="left">
                        <p>0.607 (62.5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Chromatin</p>
                     </c>
                     <c ca="left">
                        <p>0.260 (19.7)</p>
                     </c>
                     <c ca="left">
                        <p>0.518 (60.7)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleoplasm</p>
                     </c>
                     <c ca="left">
                        <p>0.206 (22.7)</p>
                     </c>
                     <c ca="left">
                        <p>0.504 (56.0)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleolus</p>
                     </c>
                     <c ca="left">
                        <p>0.367 (76.7)</p>
                     </c>
                     <c ca="left">
                        <p>0.656 (79.0)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Overall for Single-localization</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>0.284 (50.0)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>0.519 (66.5)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Multi-localization</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>0.420 (65.2)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>0.541 (65.2)</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>It also should be noted that our system currently is designed to predict only one localization. In fact, the results shown for the proteins with multiple localizations is somewhat overestimated, as the prediction is considered correct if any one of localizations of a protein is correctly predicted.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>GO terms have been used in the prediction of protein subcellular localization <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B25">25</abbr></abbrgrp>. The similarity of two proteins was defined as the number of the exactly shared GO terms from the two proteins, or equally defined as the inner product of GO term vectors representing the two proteins (see Methods). The inner product of two GO term vectors can be considered as a special case of the similarity definition SUM_Match for two proteins used in this work. SUM_Match is essentially a weighted sum of the matched GO term pairs, where the weight is the depth of the term if SimLP is the GO term similarity; while the inner product weights uniformly 1 for all matched GO term pairs. Consequently, the more specific the two matched GO terms is, the greater the weight is; and the higher the contribution to the similarity is.</p>
         <p>It seems that the inclusion of similarity scores of all GO term pairs is in general not a good strategy for the definition of similarity between two protein sequences. The same conclusion can be drawn for the use of scores of all best GO term pairs (see Methods). The reason may be considered as follows. If two GO terms are remotely related, but sharing a common ancestor, they still have a positive score which contributes to the similarity of two proteins. However, the similarity for protein pairs based on the matched GO terms has zero contribution from those unmatched GO terms. It seems that the unmatched terms add noise to the data and thus weaken the discriminative ability of the nearest neighbour module in our system. In our study, the best performance was attained when the similarity measure of two protein sequences is defined as SUM_Match. The similarity scores of ~20,000 matched GO term pairs can be pre-computed and stored in a hash table to effectively reduce the computation time.</p>
         <p>A question that needs to be clarified in the GO-based approach is whether the prediction accuracy could be artificially inflated if the proteins in training or testing sets have their specific subnuclear class annotated in GO. We examined this issue as follows. In this study, there are six GO terms associated with the subnuclear compartments: PML body (GO:0016605), Nuclear lamina (GO:0005652), Nuclear speck (GO:0016607, with synonyms Nuclear speckle, Splicing speckle), Chromatin (GO:0000785), Nucleoplasm (GO:0005654), and Nucleolus (GO:0005730). All proteins annotated by any of the above six GO terms are listed in the supplementary file [see <supplr sid="S1">Additional file 1</supplr>]. It is observed that relatively large number of proteins are correctly annotated only in two localizations: Chromatin and Nucleolus and that some proteins are mis-annotated for their subnuclear compartments. Most of them are mistakenly labelled as Nucleoplasm (GO:0005654).</p>
         <suppl id="S1">
            <title>
               <p>Additional File 1</p>
            </title>
            <text>
               <p>GO annotation for single-localization proteins. The data provides single-localization proteins annotated by six subnuclear compartment GO terms.</p>
            </text>
            <file name="1471-2105-7-491-S1.xls">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>To assess if these specific GO terms are influential in the prediction, the performance of the GO module was compared before and after the removal of the six GO terms from the annotation list. As shown in Table <tblr tid="T6">6</tblr>, the accuracies for the compartments Nuclear Lamina, Chromatin and Nucleolus decreased slightly, and those for the compartments Nuclear Splicing Speckles and Nucleoplasm increased slightly, and there is no change for the compartment PML body. The role of the GO terms of subnuclear compartments appears to be not decisive in the identification of the subnuclear compartment of a protein. Rather, the information of the overall annotated GO terms, that is, the similarity of two proteins defined from the GO term pairs is more important.</p>
         <tbl id="T6">
            <title>
               <p>Table 6</p>
            </title>
            <caption>
               <p>Results obtained with and without the use of the six GO terms related to subnuclear compartments.</p>
            </caption>
            <tblbdy cols="3">
               <r>
                  <c ca="left">
                     <p>GO Module with BLAST homologs</p>
                  </c>
                  <c ca="center">
                     <p>With the subnuclear compartment GO terms</p>
                  </c>
                  <c ca="center">
                     <p>without the subnuclear compartment GO terms</p>
                  </c>
               </r>
               <r>
                  <c cspan="3">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Compartment</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>MCC (Accuracy %)</p>
                  </c>
               </r>
               <r>
                  <c cspan="3">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>PML BODY</p>
                  </c>
                  <c ca="center">
                     <p>0.291 (40.0)</p>
                  </c>
                  <c ca="center">
                     <p>0.290 (40.0)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Nuclear Lamina</p>
                  </c>
                  <c ca="center">
                     <p>0.626 (67.4)</p>
                  </c>
                  <c ca="center">
                     <p>0.609 (65.1)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Nuclear Splicing Speckles</p>
                  </c>
                  <c ca="center">
                     <p>0.657 (70.0)</p>
                  </c>
                  <c ca="center">
                     <p>0.640 (73.7)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Chromatin</p>
                  </c>
                  <c ca="center">
                     <p>0.544 (63.5)</p>
                  </c>
                  <c ca="center">
                     <p>0.543 (61.5)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Nucleoplasm</p>
                  </c>
                  <c ca="center">
                     <p>0.543 (58.5)</p>
                  </c>
                  <c ca="center">
                     <p>0.548 (60.0)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Nucleolus</p>
                  </c>
                  <c ca="center">
                     <p>0.744 (82.5)</p>
                  </c>
                  <c ca="center">
                     <p>0.723 (80.1)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Overall for Single-localization</p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>0.568 (70.1)</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>0.559 (69.2)</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Number of proteins predicted by the GO module</p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>401 out of 504</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>399 out of 504</b>
                     </p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>The incorporation of the GO module has substantially improved the system performance. However, the module still makes relatively high number of incorrect predictions. This error can not be corrected by the next SVM module. Therefore, it would be desirable if the system can integrate the outcomes from two modules whenever two predictions are available. We are investigating the possibility on this aspect.</p>
         <p>Our system can be combined with other subcellular localization predictors, e.g. WoLF PSORT <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, PA-SUB <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> and pTARGET <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, for genome scale prediction of protein localizations. Our system can take a list of predicted nuclear proteins obtained from the subcellular localization predictors and make a refined prediction at the subnulear level.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Gene Ontology terms have been effectively incorporated into our previous SVM-based system for the prediction of protein subnuclear localization with the use of a nearest neighbour classification module. The improvement on performance of the new system is substantial. Various similarity definitions for a pair of proteins from different similarity measures of GO terms have been examined for their effect on prediction. The use of the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produced the best predictive outcome in our study. The extensive investigation conducted in this work may provide some guidance on the determination of similarity definition for protein pairs based on GO terms in other applications.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Retrieval of GO terms</p>
            </st>
            <p>Given a protein sequence, we first BLASTed it against the Swiss-Prot database with a threshold E-value = 10<sup>-9</sup>. We selected up to 5 homologs, and submitted the Swiss-Prot accession numbers of the homologs to the QuickGO server <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> for the retrieval of predicted GO terms. The retrieved GO terms were used to represent the given protein.</p>
         </sec>
         <sec>
            <st>
               <p>Definitions of similarity between two GO terms</p>
            </st>
            <p>First we define the depth for each GO term as follows.</p>
            <p>Depth(<it>g</it><sub><it>i</it></sub>) = the distance of the longest path from GO term <it>g</it><sub><it>i </it></sub>to the root of Gene_Ontology, i.e., GO:0003673.</p>
            <p>Fig. <figr fid="F1">1</figr> shows an example of some GO depths, e.g. Depth(GO:0001838) = 7.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Depth of GO terms</p>
               </caption>
               <text>
                  <p>Depth of GO terms.</p>
               </text>
               <graphic file="1471-2105-7-491-1"/>
            </fig>
            <p>The similarity of two GO terms <it>g</it><sub>1 </sub>and g<sub>2</sub>can be defined as the depth of their most recent common ancestor (MRCA):</p>
            <p>
               <m:math name="1471-2105-7-491-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>S</m:mi>
                        <m:mi>i</m:mi>
                        <m:mi>m</m:mi>
                        <m:mo>_</m:mo>
                        <m:mi>G</m:mi>
                        <m:mi>O</m:mi>
                        <m:mo stretchy="false">(</m:mo>
                        <m:msub>
                           <m:mi>g</m:mi>
                           <m:mn>1</m:mn>
                        </m:msub>
                        <m:mo>,</m:mo>
                        <m:msub>
                           <m:mi>g</m:mi>
                           <m:mn>2</m:mn>
                        </m:msub>
                        <m:mo stretchy="false">)</m:mo>
                        <m:mo>=</m:mo>
                        <m:munder>
                           <m:mrow>
                              <m:mtext>max</m:mtext>
                           </m:mrow>
                           <m:mrow>
                              <m:msub>
                                 <m:mtext>g</m:mtext>
                                 <m:mtext>c</m:mtext>
                              </m:msub>
                              <m:mo>&#8712;</m:mo>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:msub>
                                 <m:mi>g</m:mi>
                                 <m:mn>1</m:mn>
                              </m:msub>
                              <m:mo>,</m:mo>
                              <m:msub>
                                 <m:mi>g</m:mi>
                                 <m:mn>2</m:mn>
                              </m:msub>
                              <m:mo stretchy="false">)</m:mo>
                           </m:mrow>
                        </m:munder>
                        <m:mo>{</m:mo>
                        <m:mi>D</m:mi>
                        <m:mi>e</m:mi>
                        <m:mi>p</m:mi>
                        <m:mi>t</m:mi>
                        <m:mi>h</m:mi>
                        <m:mo stretchy="false">(</m:mo>
                        <m:msub>
                           <m:mi>g</m:mi>
                           <m:mi>c</m:mi>
                        </m:msub>
                        <m:mo stretchy="false">)</m:mo>
                        <m:mo>}</m:mo>
                        <m:mo>.</m:mo>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWucqWGPbqAcqWGTbqBcqGGFbWxcqWGhbWrcqWGpbWtcqGGOaakcqWGNbWzdaWgaaWcbaGaemymaedabeaakiabcYcaSiabdEgaNnaaBaaaleaacqWGYaGmaeqaaOGaeiykaKIaeyypa0ZaaCbeaeaacqqGTbqBcqqGHbqycqqG4baEaSqaaiabbEgaNnaaBaaameaacqqGJbWyaeqaaSGaeyicI4SaemiuaaLaeiikaGIaem4zaC2aaSbaaWqaaiabdgdaXaqabaWccqGGSaalcqWGNbWzdaWgaaadbaGaemOmaidabeaaliabcMcaPaqabaGccqGG7bWEcqWGebarcqWGLbqzcqWGWbaCcqWG0baDcqWGObaAcqGGOaakcqWGNbWzdaWgaaWcbaGaem4yamgabeaakiabcMcaPiabc2ha9jabc6caUiaaxMaacaWLjaWaaeWaaeaacqaIXaqmaiaawIcacaGLPaaaaaa@60E0@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>where <it>P</it>(<it>g</it><sub>1</sub><it>, g</it><sub>2</sub>) is the set of ancestral GO terms shared by both <it>g</it><sub>1 </sub>and <it>g</it><sub>2 </sub>including themselves. When <it>g</it><sub>1 </sub>= <it>g</it><sub>2</sub>, <it>Depth</it>(<it>g</it><sub><it>c</it></sub>) = <it>Depth</it>(<it>g</it><sub>1</sub>) = <it>Depth</it>(<it>g</it><sub>2</sub>). For two GO terms from different ontologies (MF, BP, CC), their MRCA is the root GO:0003673, whose depth is zero. That means that there is no similarity between two GO terms from different ontologies.</p>
            <p>The GO term similarity described here is the same as the method simLP implemented by Gentleman <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> in Bioconductor.</p>
         </sec>
         <sec>
            <st>
               <p>Definitions of similarity between two protein sequences</p>
            </st>
            <p>Consider two proteins that are represented respectively by the sets of GO terms <it>G</it><sub>1 </sub>and <it>G</it><sub>2</sub>. The similarity Sim_Pro between the two proteins can be defined as a function of Sim_GO.</p>
            <p>For example, consider protein A (Entrez protein accession number: CAC84554) and protein B (SwissProt accession number:P46055), annotated by 3 GO terms (GO:0005488; GO:0005515; GO:0006412) and 4 GO terms (GO:0005737; GO:0006412; GO:0006415; GO:0016149), respectively. The simLP score for each GO term pair is listed in Table <tblr tid="T7">7</tblr>. The following 8 functions of combining similarity scores of GO term pairs were examined in this work:</p>
            <tbl id="T7">
               <title>
                  <p>Table 7</p>
               </title>
               <caption>
                  <p>The simLP scores for GO term pairs</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>GO: 0005737</p>
                     </c>
                     <c ca="center">
                        <p>GO: 0006412</p>
                     </c>
                     <c ca="center">
                        <p>GO: 0006415</p>
                     </c>
                     <c ca="center">
                        <p>GO:0016149</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO: 0005488</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO: 0005515</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO: 0006412</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>(a) MAX: take the maximum similarity score from the similarity scores of all pairs of GO terms. Sim_Pro = 7.</p>
            <p>(b) AVG: take the average similarity score over all pairs of GO terms. Sim_Pro = (7+7+2+2)/12 = 1.5.</p>
            <p>(c) SUM: take the sum over all pairs of GO terms, Sim_Pro = 7+7+2+2 = 18.</p>
            <p>(d) MAX_Match: same as (a), except that only the matched GO term pairs are considered, e.g. GO:0006412. Sim_Pro = 7.</p>
            <p>(e) AVG_Match: same as (b), except that only the matched GO term pairs are considered, e.g. GO:0006412. Sim_Pro = 7/1 = 7.</p>
            <p>(f) SUM_Match: same as (c), except that only the matched GO term pairs are considered, e.g. GO:0006412. Sim_Pro = 7.</p>
            <p>(g) AVG_BestPairs: Average similarity between the best paired GO terms calculated with the following pseudo codes:</p>
            <p>NumofBestPairs &#8592; min {|<it>G<sub>1</sub></it>|, |<it>G<sub>2</sub></it>|}</p>
            <p>Sim_Pro &#8592; 0</p>
            <p>While (|<it>G<sub>1</sub></it>|>0 and |<it>G<sub>2</sub></it>|>0)</p>
            <p>Max_sim_GO &#8592; max{Sim_GO(g<sub><it>i</it></sub>, g<sub><it>j</it></sub>)}, <it>g</it><sub><it>i </it></sub>&#8712; <it>G</it><sub>1</sub>, <it>g</it><sub><it>j </it></sub>&#8712; <it>G</it><sub>2</sub></p>
            <p>Sim_Pro &#8592; Sim_Pro + Max_sim_GO</p>
            <p>Delete g<sub><it>i </it></sub>from G<sub>1</sub>, and g<sub><it>j </it></sub>from <it>G</it><sub>2 </sub></p>
            <p>End while</p>
            <p>Sim_Pro &#8592; Sim_Pro/NumofBestPairs</p>
            <p>Sim_Pro = (7+2+0)/3 = 3.</p>
            <p>(h) SUM_BestPairs: same as (g), except that we do not divide Sim_Pro by NumofBestPairs, i.e., remove the last line in the pseudo codes in (g). Sim_Pro = 7+2+0 = 9.</p>
            <p>In this work, the similarity Sim_Pro of two proteins employed in the final system is based on function (f) SUM_Match:</p>
            <p>
               <m:math name="1471-2105-7-491-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mtable>
                           <m:mtr>
                              <m:mtd>
                                 <m:mrow>
                                    <m:mi>S</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mi>m</m:mi>
                                    <m:mo>_</m:mo>
                                    <m:mi>P</m:mi>
                                    <m:mi>R</m:mi>
                                    <m:mi>O</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>p</m:mi>
                                       <m:mn>1</m:mn>
                                    </m:msub>
                                    <m:mo>,</m:mo>
                                    <m:msub>
                                       <m:mi>p</m:mi>
                                       <m:mn>2</m:mn>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                    <m:mo>=</m:mo>
                                    <m:mstyle displaystyle="true">
                                       <m:munder>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>g</m:mi>
                                                <m:mi>i</m:mi>
                                             </m:msub>
                                             <m:mo>=</m:mo>
                                             <m:msub>
                                                <m:mi>g</m:mi>
                                                <m:mi>j</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                       </m:munder>
                                       <m:mrow>
                                          <m:mi>S</m:mi>
                                          <m:mi>i</m:mi>
                                          <m:mi>m</m:mi>
                                          <m:mo>_</m:mo>
                                          <m:mi>G</m:mi>
                                          <m:mi>O</m:mi>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:msub>
                                             <m:mi>g</m:mi>
                                             <m:mi>i</m:mi>
                                          </m:msub>
                                          <m:mo>,</m:mo>
                                          <m:msub>
                                             <m:mi>g</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:msub>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                    </m:mstyle>
                                    <m:mtext>,</m:mtext>
                                 </m:mrow>
                              </m:mtd>
                              <m:mtd>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>g</m:mi>
                                       <m:mi>i</m:mi>
                                    </m:msub>
                                    <m:mo>&#8712;</m:mo>
                                    <m:msub>
                                       <m:mi>G</m:mi>
                                       <m:mn>1</m:mn>
                                    </m:msub>
                                    <m:mo>,</m:mo>
                                    <m:mi/>
                                    <m:mi/>
                                    <m:msub>
                                       <m:mi>g</m:mi>
                                       <m:mi>j</m:mi>
                                    </m:msub>
                                    <m:mo>&#8712;</m:mo>
                                    <m:msub>
                                       <m:mi>G</m:mi>
                                       <m:mn>2</m:mn>
                                    </m:msub>
                                 </m:mrow>
                              </m:mtd>
                           </m:mtr>
                        </m:mtable>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>2</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaGaem4uamLaemyAaKMaemyBa0Maei4xa8LaemiuaaLaemOuaiLaem4ta8KaeiikaGIaemiCaa3aaSbaaSqaaiabdgdaXaqabaGccqGGSaalcqWGWbaCdaWgaaWcbaGaemOmaidabeaakiabcMcaPiabg2da9maaqafabaGaem4uamLaemyAaKMaemyBa0Maei4xa8Laem4raCKaem4ta8KaeiikaGIaem4zaC2aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGNbWzdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaem4zaC2aaSbaaWqaaiabdMgaPbqabaWccqGH9aqpcqWGNbWzdaWgaaadbaGaemOAaOgabeaaaSqab0GaeyyeIuoakiabbYcaSaqaaiabdEgaNnaaBaaaleaacqWGPbqAaeqaaOGaeyicI4Saem4raC0aaSbaaSqaaiabdgdaXaqabaGccqWGSaalcqWGGaaicqWGGaaicqWGNbWzdaWgaaWcbaGaemOAaOgabeaakiabgIGiolabdEeahnaaBaaaleaacqWGYaGmaeqaaaaakiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@6B52@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>where Sim_GO is defined in (1). Alternatively, if Sim_GO is defined as a constant 1, the Sim_Pro is exactly the Inner Product of two GO vectors (see below).</p>
         </sec>
         <sec>
            <st>
               <p>Inner product of two GO term vectors</p>
            </st>
            <p>The Inner Product of two GO term vectors has been used in previous study for the prediction of protein subcellular localization <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B25">25</abbr></abbrgrp>. A vector with a length equal to the number of all appeared GO terms is prepared for a given protein. An entry is assigned a value 1 if the corresponding GO term is used for the annotation of the protein, 0 otherwise. Then each protein is represented by a binary vector. The similarity between two proteins is defined as the inner product of the two corresponding GO term vectors. Alternatively, Inner Product is the same as the total number of the matched GO terms from the annotation lists of the two proteins.</p>
         </sec>
         <sec>
            <st>
               <p>Nearest neighbor classification</p>
            </st>
            <p>Our system includes a K-Nearest Neighbor (KNN) model. The best result was achieved with K = 1. A protein is assigned with a localization label of its nearest neighbor that has the highest similarity score Sim_Pro. If the protein does not have associated GO terms or has multiple nearest neighbors in various classes, then the second SVM module built on sequence information <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> will be called to give a prediction.</p>
         </sec>
         <sec>
            <st>
               <p>The SVM module</p>
            </st>
            <p>In our previous work <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, we built an SVM system for prediction of protein subnuclear localizations based solely on protein sequence information. New SVM kernel functions were introduced for the measure of sequence similarity. The k-peptide vectors are first mapped by a matrix of high-scored pairs of k-peptides which are measured by BLOSUM62 scores. The kernels, measuring the similarity for sequences, are then defined on the mapped vectors. By combining these new encoding methods, a multi-class classification system for the prediction of protein subnuclear localizations was established.</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation</p>
            </st>
            <p>Since the numbers of proteins for various localizations are unbalanced, the Matthew's correlation coefficient (MCC) was employed for the optimization of parameters and evaluation of performance <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>:</p>
            <p>
               <m:math name="1471-2105-7-491-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>M</m:mi>
                        <m:mi>C</m:mi>
                        <m:msub>
                           <m:mi>C</m:mi>
                           <m:mi>n</m:mi>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>p</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                              <m:msub>
                                 <m:mi>s</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                              <m:mo>&#8722;</m:mo>
                              <m:msub>
                                 <m:mi>u</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                              <m:msub>
                                 <m:mi>o</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                           </m:mrow>
                           <m:mrow>
                              <m:msqrt>
                                 <m:mrow>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>p</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                    <m:mo>+</m:mo>
                                    <m:msub>
                                       <m:mi>u</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>p</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                    <m:mo>+</m:mo>
                                    <m:msub>
                                       <m:mi>o</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>s</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                    <m:mo>+</m:mo>
                                    <m:msub>
                                       <m:mi>u</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>s</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                    <m:mo>+</m:mo>
                                    <m:msub>
                                       <m:mi>o</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:msqrt>
                           </m:mrow>
                        </m:mfrac>
                        <m:mo>,</m:mo>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtcqWGdbWqcqWGdbWqdaWgaaWcbaGaemOBa4gabeaakiabg2da9maalaaabaGaemiCaa3aaSbaaSqaaiabd6gaUbqabaGccqWGZbWCdaWgaaWcbaGaemOBa4gabeaakiabgkHiTiabdwha1naaBaaaleaacqWGUbGBaeqaaOGaem4Ba82aaSbaaSqaaiabd6gaUbqabaaakeaadaGcaaqaaiabcIcaOiabdchaWnaaBaaaleaacqWGUbGBaeqaaOGaey4kaSIaemyDau3aaSbaaSqaaiabd6gaUbqabaGccqGGPaqkcqGGOaakcqWGWbaCdaWgaaWcbaGaemOBa4gabeaakiabgUcaRiabd+gaVnaaBaaaleaacqWGUbGBaeqaaOGaeiykaKIaeiikaGIaem4Cam3aaSbaaSqaaiabd6gaUbqabaGccqGHRaWkcqWG1bqDdaWgaaWcbaGaemOBa4gabeaakiabcMcaPiabcIcaOiabdohaZnaaBaaaleaacqWGUbGBaeqaaOGaey4kaSIaem4Ba82aaSbaaSqaaiabd6gaUbqabaGccqGGPaqkaSqabaaaaOGaeiilaWcaaa@633A@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>where <it>p</it><sub><it>n </it></sub>is the number of correctly predicted proteins of the location <it>n</it>, <it>s</it><sub><it>n </it></sub>is the number of correctly predicted proteins not in the location <it>n</it>, <it>u</it><sub><it>n </it></sub>is the number of under-predicted proteins, and <it>o</it><sub><it>n </it></sub>the number of over-predicted proteins.</p>
            <p>Also, the overall accuracy for the multi-class classification proposed by Rost <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> was used for the evaluation of our system. Suppose there are <it>m </it>= <it>m</it><sub>1 </sub>+ <it>m</it><sub>2 </sub>+ &#8230; + <it>m</it><sub><it>N </it></sub>test proteins, where <it>m</it><sub><it>n </it></sub>is the number of proteins belonging to class <it>n</it>(<it>n </it>= 1,...,<it>N</it>). Suppose further that out of the proteins considered, <it>p</it><sub><it>n </it></sub>proteins are predicted to belong to class <it>n</it>. Then <it>p </it>= <it>p</it><sub>1 </sub>+ <it>p</it><sub>2 </sub>+ &#8230; + <it>p</it><sub><it>N </it></sub>is the number of correctly predicted proteins. The accuracy for class <it>n </it>is</p>
            <p>
               <m:math name="1471-2105-7-491-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>a</m:mi>
                        <m:mi>c</m:mi>
                        <m:msub>
                           <m:mi>c</m:mi>
                           <m:mi>n</m:mi>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>p</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                           </m:mrow>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>m</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                           </m:mrow>
                        </m:mfrac>
                        <m:mo>,</m:mo>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGHbqycqWGJbWycqWGJbWydaWgaaWcbaGaemOBa4gabeaakiabg2da9maalaaabaGaemiCaa3aaSbaaSqaaiabd6gaUbqabaaakeaacqWGTbqBdaWgaaWcbaGaemOBa4gabeaaaaGccqGGSaalaaa@3A28@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>and the overall accuracy, denoted by Q<sub>acc</sub>, is defined as</p>
            <p>
               <m:math name="1471-2105-7-491-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:msub>
                           <m:mi>Q</m:mi>
                           <m:mrow>
                              <m:mi>a</m:mi>
                              <m:mi>c</m:mi>
                              <m:mi>c</m:mi>
                           </m:mrow>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munderover>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>n</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                              <m:mi>N</m:mi>
                           </m:munderover>
                           <m:mrow>
                              <m:mi>a</m:mi>
                              <m:mi>c</m:mi>
                              <m:msub>
                                 <m:mi>c</m:mi>
                                 <m:mi>n</m:mi>
                              </m:msub>
                              <m:mo>&#215;</m:mo>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>m</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                 </m:mrow>
                                 <m:mi>m</m:mi>
                              </m:mfrac>
                           </m:mrow>
                        </m:mstyle>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munderover>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>n</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                              <m:mi>N</m:mi>
                           </m:munderover>
                           <m:mrow>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>p</m:mi>
                                       <m:mi>n</m:mi>
                                    </m:msub>
                                 </m:mrow>
                                 <m:mi>m</m:mi>
                              </m:mfrac>
                              <m:mo>=</m:mo>
                              <m:mfrac>
                                 <m:mi>p</m:mi>
                                 <m:mi>m</m:mi>
                              </m:mfrac>
                           </m:mrow>
                        </m:mstyle>
                        <m:mo>.</m:mo>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaGaemyyaeMaem4yamMaem4yamgabeaakiabg2da9maaqahabaGaemyyaeMaem4yamMaem4yam2aaSbaaSqaaiabd6gaUbqabaGccqGHxdaTdaWcaaqaaiabd2gaTnaaBaaaleaacqWGUbGBaeqaaaGcbaGaemyBa0gaaaWcbaGaemOBa4Maeyypa0JaeGymaedabaGaemOta4eaniabggHiLdGccqGH9aqpdaaeWbqaamaalaaabaGaemiCaa3aaSbaaSqaaiabd6gaUbqabaaakeaacqWGTbqBaaGaeyypa0ZaaSaaaeaacqWGWbaCaeaacqWGTbqBaaaaleaacqWGUbGBcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoakiabc6caUaaa@56E3@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>Project name: Subnuclear Compartments Prediction System (Version 2.0)</p>
         <p>Project home page: <url>http://array.bioengr.uic.edu/subnuclear.htm</url></p>
         <p>Operating system(s): Linux</p>
         <p>Programming language: Perl</p>
         <p>License: None</p>
         <p>Any restrictions to use by non-academics: None</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>LZ designed the system, implemented programs and carried out the detail study. YD conceived the idea of this work, supervised project and participated in manuscript preparation. All authors have read and approved the final manuscript.</p>
      </sec>
      <sec>
         <st>
            <p>Note</p>
         </st>
         <p>AA : SVM module based on protein sequence information</p>
         <p>GO-AA: Combination of Gene Ontology module and sequence information module</p>
         <p>Lord: The GO term similarity is defined on information content by Lord <it>et al</it>. <abbrgrp><abbr bid="B20">20</abbr></abbrgrp></p>
         <p>SimLP: The GO term similarity is defined as the longest path shared by two GO terms <abbrgrp><abbr bid="B22">22</abbr></abbrgrp></p>
         <p>Exact Match: The GO term similarity is defined as 1 if two GO terms are identical, 0 otherwise.</p>
         <p>MAX: The similarity of two proteins is defined as the maximum of the similarity scores of all GO term pairs</p>
         <p>AVG: The similarity of two proteins is defined as the average of the similarity scores of all GO term pairs</p>
         <p>SUM: The similarity of two proteins is defined as the sum of similarity scores over all GO term pairs</p>
         <p>MAX_Match: The similarity of two proteins is defined as the maximum of similarity scores of all matched GO term pairs</p>
         <p>AVG_Match: The similarity of two proteins is defined as the average of similarity scores of all matched GO term pairs</p>
         <p>SUM_Match: The similarity of two proteins is defined as the sum of similarity scores over all matched GO term pairs</p>
         <p>AVG_BestPairs: The similarity of two proteins is defined as the average of similarity scores of the best paired GO terms</p>
         <p>SUM_BestPairs: The similarity of two proteins is defined as the sum of similarity scores over all best paired GO terms</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This research is supported in part by National Science Foundation (EIA-022-0301) and Naval Research Laboratory (N00173-03-1-G016). The authors would like to thank Peter Larsen for discussion on Gene Ontology and careful reading of the manuscript. We thank anonymous referees for their valuable suggestions.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Significance of subnuclear localization of key players of inositol lipid cycle</p>
            </title>
            <aug>
               <au>
                  <snm>Cocco</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Manzoli</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Barnabei</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Martelli</snm>
                  <fnm>AM</fnm>
               </au>
            </aug>
            <source>Adv Enzyme Regul</source>
            <pubdate>2004</pubdate>
            <volume>44</volume>
            <fpage>51</fpage>
            <lpage>60</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15581482</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Nuclear localization is required for Dishevelled function in Wnt/beta-catenin signaling</p>
            </title>
            <aug>
               <au>
                  <snm>Itoh</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Brott</snm>
                  <fnm>BK</fnm>
               </au>
               <au>
                  <snm>Bae</snm>
                  <fnm>GU</fnm>
               </au>
               <au>
                  <snm>Ratcliffe</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Sokol</snm>
                  <fnm>SY</fnm>
               </au>
            </aug>
            <source>J Biol</source>
            <pubdate>2005</pubdate>
            <volume>4</volume>
            <issue>1</issue>
            <fpage>3</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">551520</pubid>
                  <pubid idtype="pmpid" link="fulltext">15720724</pubid>
                  <pubid idtype="doi">10.1186/jbiol20</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization</p>
            </title>
            <aug>
               <au>
                  <snm>Nakai</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Horton</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Trends Biochem Sci</source>
            <pubdate>1999</pubdate>
            <volume>24</volume>
            <issue>1</issue>
            <fpage>34</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0968-0004(98)01336-X</pubid>
                  <pubid idtype="pmpid" link="fulltext">10087920</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>An overview on predicting the subcellular location of a protein</p>
            </title>
            <aug>
               <au>
                  <snm>Feng</snm>
                  <fnm>ZP</fnm>
               </au>
            </aug>
            <source>In Silico Biol</source>
            <pubdate>2002</pubdate>
            <volume>2</volume>
            <issue>3</issue>
            <fpage>291</fpage>
            <lpage>303</lpage>
            <xrefbib>
               <pubid idtype="pmpid">12542414</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria</p>
            </title>
            <aug>
               <au>
                  <snm>Gardy</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Spencer</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ester</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Tusnady</snm>
                  <fnm>GE</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Hua</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>deFays</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Lambert</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Nakai</snm>
                  <fnm>K</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>13</issue>
            <fpage>3613</fpage>
            <lpage>3617</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">169008</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824378</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg602</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Better prediction of sub-cellular localization by combining evolutionary and structural information</p>
            </title>
            <aug>
               <au>
                  <snm>Nair</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2003</pubdate>
            <volume>53</volume>
            <issue>4</issue>
            <fpage>917</fpage>
            <lpage>930</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.10507</pubid>
                  <pubid idtype="pmpid" link="fulltext">14635133</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Automatic annotation of protein motif function with Gene Ontology terms</p>
            </title>
            <aug>
               <au>
                  <snm>Lu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Zhai</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Gopalakrishnan</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Buchanan</snm>
                  <fnm>BG</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>122</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">517493</pubid>
                  <pubid idtype="pmpid" link="fulltext">15345032</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-5-122</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Learnability-based further prediction of gene functions in Gene Ontology</p>
            </title>
            <aug>
               <au>
                  <snm>Tu</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Guo</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>X</fnm>
               </au>
            </aug>
            <source>Genomics</source>
            <pubdate>2004</pubdate>
            <volume>84</volume>
            <issue>6</issue>
            <fpage>922</fpage>
            <lpage>928</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ygeno.2004.08.005</pubid>
                  <pubid idtype="pmpid" link="fulltext">15533709</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Predicting 22 protein localizations in budding yeast</p>
            </title>
            <aug>
               <au>
                  <snm>Cai</snm>
                  <fnm>YD</fnm>
               </au>
               <au>
                  <snm>Chou</snm>
                  <fnm>KC</fnm>
               </au>
            </aug>
            <source>Biochem Biophys Res Commun</source>
            <pubdate>2004</pubdate>
            <volume>323</volume>
            <issue>2</issue>
            <fpage>425</fpage>
            <lpage>428</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.bbrc.2004.08.113</pubid>
                  <pubid idtype="pmpid" link="fulltext">15369769</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Gardy</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Laird</snm>
                  <fnm>MR</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Rey</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Walsh</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Ester</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Brinkman</snm>
                  <fnm>FS</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>5</issue>
            <fpage>617</fpage>
            <lpage>623</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti057</pubid>
                  <pubid idtype="pmpid" link="fulltext">15501914</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>PSLpred: prediction of subcellular localization of bacterial proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Bhasin</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Garg</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Raghava</snm>
                  <fnm>GP</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>10</issue>
            <fpage>2522</fpage>
            <lpage>2524</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti309</pubid>
                  <pubid idtype="pmpid" link="fulltext">15699023</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties</p>
            </title>
            <aug>
               <au>
                  <snm>Sarda</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Chua</snm>
                  <fnm>GH</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>KB</fnm>
               </au>
               <au>
                  <snm>Krishnan</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>1</issue>
            <fpage>152</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1182350</pubid>
                  <pubid idtype="pmpid" link="fulltext">15963230</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-152</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Wang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sung</snm>
                  <fnm>WK</fnm>
               </au>
               <au>
                  <snm>Krishnan</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>KB</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>1</issue>
            <fpage>174</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1190155</pubid>
                  <pubid idtype="pmpid" link="fulltext">16011808</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-174</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Domain rearrangements in protein evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Bjorklund</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Ekman</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Light</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Frey-Skott</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Elofsson</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2005</pubdate>
            <volume>353</volume>
            <issue>4</issue>
            <fpage>911</fpage>
            <lpage>923</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2005.08.067</pubid>
                  <pubid idtype="pmpid" link="fulltext">16198373</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Mimicking cellular sorting improves prediction of subcellular localization</p>
            </title>
            <aug>
               <au>
                  <snm>Nair</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2005</pubdate>
            <volume>348</volume>
            <issue>1</issue>
            <fpage>85</fpage>
            <lpage>100</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2005.02.025</pubid>
                  <pubid idtype="pmpid" link="fulltext">15808855</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>An SVM-based system for predicting protein subnuclear localizations</p>
            </title>
            <aug>
               <au>
                  <snm>Lei</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Dai</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>291</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1325059</pubid>
                  <pubid idtype="pmpid" link="fulltext">16336650</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-291</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>The Gene Ontology (GO) database and informatics resource</p>
            </title>
            <aug>
               <au>
                  <snm>Harris</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Clark</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ireland</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lomax</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ashburner</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Foulger</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Eilbeck</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Lewis</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Marshall</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Mungall</snm>
                  <fnm>C</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>D258</fpage>
            <lpage>261</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308770</pubid>
                  <pubid idtype="pmpid" link="fulltext">14681407</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh066</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <url>http://www.geneontology.org/</url>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Semantic similarity measures as tools for exploring the gene ontology</p>
            </title>
            <aug>
               <au>
                  <snm>Lord</snm>
                  <fnm>PW</fnm>
               </au>
               <au>
                  <snm>Stevens</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Brass</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Goble</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>2003</pubdate>
            <fpage>601</fpage>
            <lpage>612</lpage>
            <xrefbib>
               <pubid idtype="pmpid">12603061</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Lord</snm>
                  <fnm>PW</fnm>
               </au>
               <au>
                  <snm>Stevens</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Brass</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Goble</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>10</issue>
            <fpage>1275</fpage>
            <lpage>1283</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg153</pubid>
                  <pubid idtype="pmpid" link="fulltext">12835272</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Gene functional similarity search tool (GFSST)</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sheng</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Russo</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>Osborne</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Buetow</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>135</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1421445</pubid>
                  <pubid idtype="pmpid" link="fulltext">16536867</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-135</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Visualizing and Distances Using GO</p>
            </title>
            <aug>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <pubdate>2005</pubdate>
            <url>http://www.bioconductor.org/repository/devel/vignette/GOvis.pdf</url>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Prediction of functional modules based on comparative genome analysis and Gene Ontology application</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Su</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Mao</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Olman</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Xu</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <issue>9</issue>
            <fpage>2822</fpage>
            <lpage>2837</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1130488</pubid>
                  <pubid idtype="pmpid" link="fulltext">15901854</pubid>
                  <pubid idtype="doi">10.1093/nar/gki573</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Prediction of yeast protein-protein interaction network: insights from the Gene Ontology and annotations</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Zhu</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Guo</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>DY</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2006</pubdate>
            <volume>34</volume>
            <issue>7</issue>
            <fpage>2137</fpage>
            <lpage>2150</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1449908</pubid>
                  <pubid idtype="pmpid" link="fulltext">16641319</pubid>
                  <pubid idtype="doi">10.1093/nar/gkl219</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Predicting protein localization in budding yeast</p>
            </title>
            <aug>
               <au>
                  <snm>Chou</snm>
                  <fnm>KC</fnm>
               </au>
               <au>
                  <snm>Cai</snm>
                  <fnm>YD</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>7</issue>
            <fpage>944</fpage>
            <lpage>950</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti104</pubid>
                  <pubid idtype="pmpid" link="fulltext">15513989</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>The Nuclear Protein Database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome</p>
            </title>
            <aug>
               <au>
                  <snm>Dellaire</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Farrall</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bickmore</snm>
                  <fnm>WA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>1</issue>
            <fpage>328</fpage>
            <lpage>330</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165465</pubid>
                  <pubid idtype="pmpid" link="fulltext">12520015</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg018</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>PROSET &#8211; a fast procedure to create non-redundant sets of protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Brendel</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Mathl Comput Modelling</source>
            <pubdate>1992</pubdate>
            <volume>16</volume>
            <fpage>37</fpage>
            <lpage>43</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0895-7177(92)90150-J</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Comparison of the predicted and observed secondary structure of T4 phage lysozyme</p>
            </title>
            <aug>
               <au>
                  <snm>Matthews</snm>
                  <fnm>BW</fnm>
               </au>
            </aug>
            <source>Biochim Biophys Acta</source>
            <pubdate>1975</pubdate>
            <volume>405</volume>
            <issue>2</issue>
            <fpage>442</fpage>
            <lpage>451</lpage>
            <xrefbib>
               <pubid idtype="pmpid">1180967</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Prediction of protein secondary structure at better than 70% accuracy</p>
            </title>
            <aug>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1993</pubdate>
            <volume>232</volume>
            <issue>2</issue>
            <fpage>584</fpage>
            <lpage>599</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1993.1413</pubid>
                  <pubid idtype="pmpid" link="fulltext">8345525</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <url>http://www.pir.uniprot.org/database/download.shtml</url>
         </bibl>
         <bibl id="B31">
            <url>http://www.ebi.ac.uk/ego/</url>
         </bibl>
         <bibl id="B32">
            <url>http://wolfpsort.seq.cbrc.jp/</url>
         </bibl>
         <bibl id="B33">
            <url>http://www.cs.ualberta.ca/~bioinfo/PA/Sub/</url>
         </bibl>
         <bibl id="B34">
            <url>http://bioinformatics.albany.edu/~ptarget</url>
         </bibl>
      </refgrp>
   </bm>
</art>
