<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-10-S15-S8</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Proceedings</dochead>
      <bibl>
         <title>
            <p>Protein subcellular localization prediction of eukaryotes using a knowledge-based approach</p>
         </title>
         <aug>
            <au id="A1"><snm>Lin</snm><fnm>Hsin-Nan</fnm><insr iid="I1"/><insr iid="I2"/><insr iid="I3"/><email>arith@iis.sinica.edu.tw</email></au>
            <au id="A2"><snm>Chen</snm><fnm>Ching-Tai</fnm><insr iid="I1"/><insr iid="I2"/><insr iid="I3"/><email>caster@iis.sinica.edu.tw</email></au>
            <au id="A3"><snm>Sung</snm><fnm>Ting-Yi</fnm><insr iid="I2"/><email>tsung@iis.sinica.edu.tw</email></au>
            <au id="A4"><snm>Ho</snm><fnm>Shinn-Ying</fnm><insr iid="I3"/><email>syho@mail.nctu.edu.tw</email></au>
            <au ca="yes" id="A5"><snm>Hsu</snm><fnm>Wen-Lian</fnm><insr iid="I2"/><email>hsu@iis.sinica.edu.tw</email></au>
         </aug>
         <insg>
            <ins id="I1"><p>Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan, Republic of China</p></ins>
            <ins id="I2"><p>Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan, Republic of China</p></ins>
            <ins id="I3"><p>Institute of Bioinformatics, National Chiao Tung University, Hsinchu, Taiwan, Republic of China</p></ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <supplement>
            <title>
               <p>Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics</p>
            </title>
            <editor>Shoba Ranganathan, Frank Eisenhaber, Joo Chuan Tong and Tin Wee Tan</editor>
            <note>Proceedings</note>
         </supplement>
         <conference>
            <title>
               <p>Asia Pacific Bioinformatics Network (APBioNet) Eighth International Conference on Bioinformatics (InCoB2009)</p>
            </title>
            <location>Singapore</location>
            <date-range>7-11 September 2009</date-range>
            <url>http://incob.apbionet.org/incob09/</url>
         </conference>
         <issn>1471-2105</issn>
         <pubdate>2009</pubdate>
         <volume>10</volume>
         <issue>Suppl 15</issue>
         <fpage>S8</fpage>
         <url>http://www.biomedcentral.com/1471-2105/10/S15/S8</url>
         <xrefbib><pubidlist><pubid idtype="pmpid">19958518</pubid><pubid idtype="doi">10.1186/1471-2105-10-S15-S8</pubid></pubidlist></xrefbib>
      </bibl>
      <history><pub><date><day>3</day><month>12</month><year>2009</year></date></pub></history>
      <cpyrt><year>2009</year><collab>Lin et al; licensee BioMed Central Ltd.</collab><note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. However, determining the localization sites of a protein through wet-lab experiments can be time-consuming and labor-intensive. Thus, computational approaches become highly desirable. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In this study, we propose a knowledge based method, called KnowPred<sub>site</sub>, to predict the localization site(s) of both single-localized and multi-localized proteins. Based on the local similarity, we can identify the "related sequences" for prediction. We construct a knowledge base to record the possible sequence variations for protein sequences. When predicting the localization annotation of a query protein, we search against the knowledge base and used a scoring mechanism to determine the predicted sites. We downloaded the dataset from ngLOC, which consisted of ten distinct subcellular organelles from 1923 species, and performed ten-fold cross validation experiments to evaluate KnowPred<sub>site</sub>'s performance. The experiment results show that KnowPred<sub>site </sub>achieves higher prediction accuracy than ngLOC and Blast-hit method. For single-localized proteins, the overall accuracy of KnowPred<sub>site </sub>is 91.7%. For multi-localized proteins, the overall accuracy of KnowPred<sub>site </sub>is 72.1%, which is significantly higher than that of ngLOC by 12.4%. Notably, half of the proteins in the dataset that cannot find any Blast hit sequence above a specified threshold can still be correctly predicted by KnowPred<sub>site</sub>.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>KnowPred<sub>site </sub>demonstrates the power of identifying related sequences in the knowledge base. The experiment results show that even though the sequence similarity is low, the local similarity is effective for prediction. Experiment results show that KnowPred<sub>site </sub>is a highly accurate prediction method for both single- and multi-localized proteins. It is worth-mentioning the prediction process of KnowPred<sub>site </sub>is transparent and biologically interpretable and it shows a set of template sequences to generate the prediction result. The KnowPred<sub>site </sub>prediction server is available at <url>http://bio-cluster.iis.sinica.edu.tw/kbloc/</url>.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Protein subcellular localization (PSL) is important to elucidate protein functions as proteins cooperate towards a common function in the same subcellular compartment <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. It is also essential to annotate genomes, to design proteomics experiments, and to identify potential diagnostic, drug and vaccine targets <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Determining the localization sites of a protein through experiments can be time-consuming and labor-intensive. With the large number of sequences that continue to emerge from the genome sequencing projects, computational methods for protein subcellular localization at a proteome scale become increasingly important.</p>
         <p>Most existing PSL predictors are based on machine learning algorithms. They can be categorized by the feature sets used for building prediction models. A group of methods use features derived from primary sequence <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>; some utilize various biological features extracted from literature or public databases <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. Other features are also used in different methods, e.g., phylogenetic profiling <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, domain projection <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, sequence homology <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, and compartment-specific features <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>.</p>
         <p>A simple and reliable way to predict localization site is to inherit subcellular localization from homologous proteins. Therefore, in <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> a hybrid method was proposed, which combined an SVM based method with a sequence comparison tool to find homology to improve the performance. However, some homologous proteins are not similar in sequences, but in structures. For example, the sequence identity between proteins <it>1aab </it>and <it>1j46 </it>is only 16.7% but they are structurally homologous and classified into the same family (<it>HMG-box</it>) in the SCOP classification. For such cases, it is difficult to discover the homologous relationship using sequence comparison methods. Profile-profile alignment methods <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp> are capable of identifying remote homology; nevertheless, they are relatively slow.</p>
         <p>Most of the PSL prediction systems are established particularly for single-localized proteins. A significant number of eukaryotic proteins are, however, known to be localized into multiple subcellular organelles <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>. In fact, proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles. This type of proteins may take a high proportion, even more than 35% <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. In addition, the majority of existing computational methods have the following disadvantages <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>: 1) they only predict a limited number of locations; 2) they are limited to subsets of proteomes which contain signal peptide sequences or with prior structural/functional information; 3) the datasets used for training are for specific species, which is not sufficiently robust to represent the entire proteomes. Thus, most of the computational methods are not sufficient for proteome-wide prediction of PSL across various species.</p>
         <p>Thus in this study, we propose a knowledge based approach, called KnowPred<sub>site</sub>, using local sequence similarity to find useful proteins as templates for site prediction of the query protein. It is designed to predict localization site(s) of single- and multi-localized proteins and is applicable to proteome-wide prediction. Furthermore, it only requires protein sequence information and no functional or structural information is required. Notably, prediction results can be explained by the template proteins which are used to vote for the localization sites. The Knowledge-based prediction scheme has been shown to be effective in predicting protein secondary structure <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp> and local structure <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. To evaluate our knowledge-based site prediction method, we used the ngLOC dataset <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> to perform ten-fold cross validation to compare with existing methods. The dataset consists of ten subcellular proteomes from 1923 species with single- and multi-localized proteins. KnowPred<sub>site </sub>achieved 91.7% accuracy for single-localized proteins and 72.1% accuracy with both sites correctly predicted for multiple localized proteins.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>The main idea behind KnowPred<sub>site</sub></p>
            </st>
            <p>KnowPred<sub>site </sub>predicts PSL based on a knowledge base, which is constructed to capture local sequence similarity of two proteins even when they have sequence identity less than 25%. However, such local similarity is difficult to be detected using the traditional alignment algorithm due to the low sequence similarity. Therefore we adopt the transitivity relationship, which was firstly used in <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> for clustering protein sequences, to capture local similarity between protein sequences. Transitivity refers to deducing a possible similarity between protein <it>A </it>and protein <it>C </it>from the existence of a third protein <it>B</it>, such that <it>A </it>and <it>B </it>as well as <it>B </it>and <it>C </it>are homologues if the sequence identity between <it>A </it>and <it>B </it>as well as that between <it>B </it>and <it>C </it>is above the predefined threshold. Figure <figr fid="F1">1(a)</figr> shows an example of transitivity relationship among protein <it>A</it>, protein <it>B</it>, and protein <it>C</it>. Protein <it>A </it>and protein <it>B </it>share sequence identity of 34%, and protein <it>B </it>and protein <it>C </it>share sequence identity of 27%, whereas protein <it>A </it>and protein <it>C </it>only share sequence identity of 12%. Using the transitivity relationship, remote homologous relationship and local similarity between protein <it>A </it>and protein <it>C </it>can be detected.</p>
            <fig id="F1"><title><p>Figure 1</p></title><caption><p>Two different transitivity relationships</p></caption><text>
   <p><b>Two different transitivity relationships</b>. (a) Protein <it>A </it>and protein <it>B </it>share sequence identity of 34%, and protein <it>B </it>and protein <it>C </it>share sequence identity of 27%, whereas protein <it>A </it>and protein <it>C </it>only share sequence identity of 12%. We infer the homologous relationship between <it>A </it>and protein <it>C </it>through protein <it>B</it>. (b) Protein <it>A </it>and protein <it>C </it>are aligned with protein <it>B1 </it>and protein <it>B2</it>. The peptide fragments of <it>B1 </it>and <it>B2 </it>besieged by the rectangles are identical, the two corresponding peptide fragments of <it>A </it>and <it>C </it>are considered to be similar.</p>
</text><graphic file="1471-2105-10-S15-S8-1"/></fig>
            <p>In this paper, we apply the transitivity concept to peptide fragments instead of the protein sequences to obtain local similarities between remotely homologues. Protein <it>A </it>and protein <it>C </it>share local similarity if there is a peptide fragment <it>similar </it>(formal definition of peptide similarity will be discussed in next subsection) to subsequences in protein <it>A </it>and protein <it>C</it>. Figure <figr fid="F1">1(b)</figr> illustrates the idea, in which protein <it>A </it>and <it>C </it>are aligned with protein <it>B1 </it>and protein <it>B2 </it>(<it>B1 </it>and <it>B2 </it>can be identical, homologous or non-homologous). If there is a peptide fragment shared by both <it>B1 </it>and <it>B2</it>, the corresponding peptide fragments in protein <it>A </it>and protein <it>C </it>are inferred as locally similar between protein <it>A </it>and protein <it>C</it>. The shared peptide may represent a possible sequence variation in evolution. Moreover, if protein <it>A </it>and protein <it>C </it>are remotely homologous, there is likely more "shared" sequence fragments in different protein <it>B</it>'s to characterize their similarity. However, not all such proteins <it>A </it>and <it>C </it>which share local similarity are homologous. Some local similarities may arise without common ancestry. Short sequences may be similar by chance, and sequences may be similar because both are selected to bind to a particular protein. In order to avoid ambiguity, we define such proteins <it>A </it>and <it>C </it>which share local similarity as "<it>related sequences</it>".</p>
         </sec>
         <sec>
            <st>
               <p>Construction of the knowledge base <it>SPKB</it></p>
            </st>
            <p>Given a dataset of proteins with known localization sites, we construct a knowledge base, called <it>Similar-Peptide Knowledge Base </it>(or <it>SPKB </it>in short). The dataset used to construct <it>SPKB </it>will be described in the Result section. To construct the knowledge base, we first use the native sequence of each protein in the dataset to extract the fixed-length peptide fragments by using a sliding window of length <it>w</it>. Each peptide sequence as well as its protein source and the localization site information are stored in <it>SPKB</it>. Since the performance of knowledge-based methods relies on the size of the knowledge base, we then perform PSI-BLAST search with parameters <it>j </it>= 3, <it>e </it>= 0.001 on each protein in the dataset against the NCBInr database to find similar sequences. Since the NCBInr database contains only the protein sequence information, the localization annotation of peptides generated by similar sequences is determined as follows. Specifically, given a query protein sequence <it>q</it>, PSI-BLAST would generate a large number of significant local pairwise alignments called <it>high-scoring segment pairs </it>(HSPs) between <it>q </it>and its similar proteins. An example of an HSP is shown in Figure <figr fid="F2">2</figr>. Statistically significant BLAST hits usually signify sequence homology. We assumed that in an HSP, the <it>similar peptide sequences </it>in the counterpart sequence (denoted by "<it>Sbjct</it>") represent the possible sequence variations to the corresponding peptide in the query (denoted by "<it>Query</it>"), i.e., the protein <it>q</it>. We use the same sliding window of length <it>w </it>to generate all peptide fragments in each HSP. Two amino acids aligned together in an HSP are said to be <it>interchangeable </it>if they have a positive score in the BLOSUM62 (an interchangeable residue pair is represented as an amino acid letter or a plus symbol in an HSP). The number of amino acid pairs being interchangeable within a sliding window represents the <it>similarity level </it>of the two peptide fragments. A peptide in <it>Sbjct </it>is called a <it>similar peptide </it>if it has at least <it>k </it>residues interchangeable to those of the corresponding peptide in <it>Query</it>. A similar peptide is used to signify local sequence similarity between <it>Sbjct </it>and <it>Query </it>and thus is assigned the localization annotation of the protein <it>q</it>.</p>
            <fig id="F2"><title><p>Figure 2</p></title><caption><p>A real example of HSP found by PSI-BLAST</p></caption><text>
   <p><b>A real example of HSP found by PSI-BLAST</b>. We define that MYSKILL (assuming that the window size is 7) is a similar peptide of MYKKILY and we treat it as an extended sequence feature of the query protein. The similarity level of MYSKILL and MYKKILY is 5 since there are five interchangeable residue pairs within that window. We can generate multiple similar peptides from protein gi|2622094 (Sbjct) for the query protein.</p>
</text><graphic file="1471-2105-10-S15-S8-2"/></fig>
            <p>Performing PSI-BLAST search for all proteins in the dataset, we can generate a huge number, possibly multi-millions, of similar peptides with localization annotation. Each record in the knowledge base is indexed by a similar peptide, and stores its similar peptide sequences and protein sources (those that are used as query proteins in the PSI-BLAST searches), similarity level and localization site information (inferred from the corresponding protein sources). Note that a similar peptide may occur multiple times in different HSPs of a single PSI-BLAST search result, i.e., derived from different similar proteins found in the PSI-BLAST search. We cluster them together and store the frequency in the peptide record. Table <tblr tid="T1">1</tblr> shows a record of the similar peptide MYSKILL (assuming that the window size is 7), which is generated by performing PSI-BLAST search on the three proteins (<it>A</it>, <it>B</it>, and <it>C</it>) with known localization sites, respectively. The frequencies of MYSKILL in the PSI-BLAST search results of proteins <it>A</it>, <it>B</it>, and <it>C </it>are 21, 12, and 17, respectively. The localization site information is inherited from the three protein sources.</p>
            <tbl id="T1"><title><p>Table 1</p></title><caption><p>A similar peptide example.</p></caption><tblbdy cols="5">
      <r>
         <c ca="left" cspan="5">
            <p>
               <b>Similar Peptide: MYSKILL</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Protein Source</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Localization Sites</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Native Peptide Sequence</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Similarity Level</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Frequency</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>A</it>
            </p>
         </c>
         <c ca="left">
            <p>Cytoplasm</p>
         </c>
         <c ca="left">
            <p>MYKKILY</p>
         </c>
         <c ca="left">
            <p>5</p>
         </c>
         <c ca="left">
            <p>21</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>B</it>
            </p>
         </c>
         <c ca="left">
            <p>Nuclear</p>
         </c>
         <c ca="left">
            <p>MYSSIIL</p>
         </c>
         <c ca="left">
            <p>4</p>
         </c>
         <c ca="left">
            <p>12</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <it>C</it>
            </p>
         </c>
         <c ca="left">
            <p>Cytoplasm Extracellular</p>
         </c>
         <c ca="left">
            <p>MYSSILY</p>
         </c>
         <c ca="left">
            <p>5</p>
         </c>
         <c ca="left">
            <p>17</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>Three protein sources with known localization sites contain peptides that are aligned and similar to the peptide MYSKILL in their HSPs. The similarity level indicates the number of amino acid pairs that are interchangeable between the native peptide sequence and the similar peptide sequence. The frequency represents the number of occurrences they are aligned in HSPs.</p>
   </tblfn></tbl>
         </sec>
         <sec>
            <st>
               <p>KnowPred<sub>site</sub>: a localization prediction method using <it>SPKB</it></p>
            </st>
            <p>The main idea of KnowPred<sub>site </sub>is illustrated in Figure <figr fid="F3">3</figr>. Given a target protein <it>t</it>, whose localization annotation is unknown and to be predicted, we perform PSI-BLAST search and use the same procedure as described in the last subsection for knowledge base construction to generate all similar peptides of <it>t </it>and their frequencies from its native sequence and HSPs. Each similar peptide <it>hp </it>is then matched against <it>SPKB</it>, and the peptide record with index <it>hp </it>is called a <it>hit</it>. For each hit, we calculate two types of scores associated with each localization site <it>i</it>: the voting score <it>s</it><sub><it>i </it></sub>and the confidence score <it>CS</it>(<it>i</it>). The calculation of the voting score <it>s</it><sub><it>i </it></sub>is as follows: Let <it>f </it>denote the frequency of <it>hp </it>found in all <it>t</it>'s HSPs. For each record in <it>SPKB</it>, we calculate the score <it>w</it><sub><it>i </it></sub>associated with each localization site by summing up the frequencies of the similar peptides that contain the specific site. For example, for the peptide record MYSKILL shown in Table <tblr tid="T1">1</tblr>, the score of cytoplasm is 38 (21+17; since protein source <it>A </it>and <it>C </it>are both localized into cytoplasm), and those of nuclear and extracellular are 12 and 17, respectively. Then the voting score <it>s</it><sub><it>i </it></sub>is defined as <it>f </it>multiplied by (<it>w</it><sub><it>i</it></sub>/total frequencies in that record). For example, if MYSKILL is a similar peptide of <it>t </it>and its frequency is 10 in <it>t's </it>HSPs, then the voting scores of cytoplasm, nuclear, and extracellular are 7.6 (=10 &#215; 38/50), 2.4 (=10 &#215; 12/50), and 3.4 (=10 &#215; 17/50), respectively, while those of other localization sites are all 0.</p>
            <fig id="F3"><title><p>Figure 3</p></title><caption><p>The main algorithm of KnowPred<sub>site</sub></p></caption><text>
   <p><b>The main algorithm of KnowPred<sub>site</sub></b>.</p>
</text><graphic file="1471-2105-10-S15-S8-3"/></fig>
            <p>The localization site prediction of the protein <it>t </it>is determined by the confidence score <it>CS</it>(<it>i</it>), which is the total voting score aggregated from all hit records. Finally, each <it>CS</it>(<it>i</it>) is divided by the summation of all frequencies <it>f </it>of all <it>t'</it>s hits and then multiplied by 100 to normalize the confidence score in the range of 0 and 100. KnowPred<sub>site </sub>predicts <it>t </it>being localized into the site with the highest confidence score for single-localized proteins or into the sites with the two highest confidence scores for multi-localized proteins (All multi-localized proteins in ngLOC dataset have two localization sites).</p>
            <p>To differentiate single-localized proteins from those that are multi-localized, we followed King and Guda's method <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> to calculate the multi-localized confidence score (<it>MLCS</it>) associated with a protein <it>t</it>, which gives a relative measure of the likelihood that the protein <it>t </it>is multi-localized. It is derived from the two highest confidence scores (denoted as <it>CS</it><sub>1 </sub>and <it>CS</it><sub>2</sub>) and is defined as follows:</p>
            <p>
               <display-formula>
                  <graphic file="1471-2105-10-S15-S8-i1.gif"/>
               </display-formula>
            </p>
            <p>and <it>MLCS</it>(<it>t</it>) is bounded by 100, i.e., when the calculated <it>MLCS</it>(<it>t</it>) is over 100, it is assigned 100.</p>
         </sec>
         <sec>
            <st>
               <p>BLAST-hit prediction method</p>
            </st>
            <p>Since BLAST is the most popular method for sequence comparison, we implemented a simple prediction method based on the BLAST search result. Given a dataset of proteins with known localization site(s), to predict the localization site(s) of a test protein <it>t </it>we first perform the BLAST search against the dataset and then assign the localization annotations of the best BLAST hit to the protein <it>t</it>. If there is no hit at the e-value cutoff 0.001, no annotation will be assigned to the protein <it>t</it>. As reported by Jones and Swindells, the e-value of 0.001 generally produces a safe searching <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. The performance of BLAST-based prediction method is usually treated as the baseline to compare with those of other methods <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation measure</p>
            </st>
            <p>The performance is estimated using the following measurements. To assess the performance in each localization site, precision, accuracy and Matthew's correlation coefficient (<it>MCC</it>) are calculated by Equations (1) to (3), respectively. The overall accuracy is defined in Equation (4).</p>
            <p>
               <display-formula id="M1">
                  <graphic file="1471-2105-10-S15-S8-i2.gif"/>
               </display-formula>
            </p>
            <p>
               <display-formula id="M2">
                  <graphic file="1471-2105-10-S15-S8-i3.gif"/>
               </display-formula>
            </p>
            <p>
               <display-formula id="M3">
                  <graphic file="1471-2105-10-S15-S8-i4.gif"/>
               </display-formula>
            </p>
            <p>
               <display-formula id="M4">
                  <graphic file="1471-2105-10-S15-S8-i5.gif"/>
               </display-formula>
            </p>
            <p>where <it>TP</it><sub><it>i</it></sub>, <it>TN</it><sub><it>i</it></sub>, <it>FP</it><sub><it>i</it></sub>, <it>FN</it><sub><it>i</it></sub>, and <it>N</it><sub><it>i </it></sub>are, respectively, the number of true positives, true negatives, false positives, false negatives, and proteins in localization site <it>i</it>. <it>MCC</it>, which considers both under- and over-predictions, provides a complementary measure of the predictive performance, where <it>MCC </it>= 1 indicates a perfect prediction, <it>MCC </it>= 0 indicates a completely random assignment, and <it>MCC </it>= -1 indicates a perfectly reverse correlation.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>KnowPred<sub>site </sub>was implemented as a parallel program under the Linux environment. It was implemented using C++ and MPICH library. We used the ngLOC dataset <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> to construct the knowledge base and test the performance of KnowPred<sub>site</sub>. The dataset is compiled from 1923 different species and contains 28056 protein sequences (listed in Additional file <supplr sid="S1">1</supplr>), including 25887 single localized proteins and 2169 multi-localized proteins. There are ten different subcellular locations among these proteins, which are Cytoplasm (CYT), Cytoskeleton (CSK), Endoplasmic Reticulum (END), Extracellular (EXC), Golgi Apparatus (GOL), Lysosome (LYS), Mitochondria (MIT), Nuclear (NUC), Plasma Membrane (PLA), and Perixosome (POX).</p>
         <suppl id="S1">
            <title>
               <p>Additional file 1</p>
            </title>
            <text>
               <p><b>ngLOC dataset</b>. The file contains whole ngLOC dataset, in which the row starts with '&gt;' represents the protein name, the next row represents localization site. The localization site is numbered from 1 to 10, denoting Cytoplasm (CYT), Cytoskeleton (CSK), Endoplasmic Reticulum (END), Extracellular (EXC), Golgi Apparatus (GOL), Lysosome (LYS), Mitochondria (MIT), Nuclear (NUC), Plasma Membrane (PLA), and Perixosome (POX). The ngLOC dataset can be also downloaded via <url>http://bio-cluster.iis.sinica.edu.tw/kbloc/DataSet.htm</url>.</p>
            </text>
            <file name="1471-2105-10-S15-S8-S1.txt">
   <p>Click here for file</p>
</file>
         </suppl>
         <p>We conducted two types of experiment on the dataset. First, in order to take advantages of local similarities from as many proteins as possible, we conducted the leave-one-out cross validation experiment to determine the parameters and to evaluate the performance of KnowPred<sub>site</sub>. In this experiment, each protein was in turn used as the test protein and the remaining 28055 proteins were used to construct the knowledge base. Second, we compared the performance of KnowPred<sub>site </sub>with existing methods. Since the dataset is from ngLOC and ngLOC has been shown to be better than PSORT <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, pTARGET <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and PLOC <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> using the same dataset, we directly compare KnowPred<sub>site </sub>against ngLOC using ten-fold cross validation. In this experiment, all proteins were partitioned into 10 subsets, and each subset was in turn used as the test set and the remaining nine subsets were used to construct the knowledge base.</p>
         <sec>
            <st>
               <p>Determining window size w and similarity threshold <it>k </it>for KnowPred<sub>site</sub></p>
            </st>
            <p>KnowPred<sub>site </sub>aims to utilize the localization annotations of similar peptides. The determination of similar relations, which depends on the window size <it>w </it>and the threshold of similarity level <it>k</it>, can affect the performance of KnowPred<sub>site</sub>. Using a smaller <it>w</it>, similar peptides have a higher probability to be hit against the knowledge base; however, shorter peptide sequences are likely to appear in many unrelated proteins. Given a fixed <it>w</it>, there is also a trade-off in choosing the threshold of similarity level <it>k</it>. A smaller <it>k </it>produces looser similarity relations, which leads to extracting more, but less reliable, similar peptides. To make an appropriate selection of <it>w </it>and <it>k</it>, we conducted a leave-one-out cross validation experiments on only the single-localized proteins in the ngLOC dataset for <it>w </it>ranging from 3 to 11 and <it>k </it>ranging from 0 to <it>w</it>.</p>
            <p>Figure <figr fid="F4">4</figr> shows the overall accuracies of KnowPred<sub>site </sub>using different window size <it>w </it>with fixed similarity threshold (<it>k </it>= 0). It shows that the appropriate window size is 7 or 8. Then we further investigate the performance using different thresholds of similarity levels. Table <tblr tid="T2">2</tblr> shows the overall accuracies ranging from 90.9% to 92.0% for all combinations of window sizes (<it>w </it>= 7, 8) and similarity thresholds. According to the experiment results, we chose the combination of <it>w </it>= 7 and <it>k </it>= 6 for the following experiments since they provided the best accuracy 92.0%.</p>
            <tbl id="T2"><title><p>Table 2</p></title><caption><p>The overall accuracies using different thresholds of similarity levels for window size 7 and 8.</p></caption><tblbdy cols="10">
      <r>
         <c ca="left">
            <p>
               <b>Similarity Level Threshold <it>k</it></b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>0</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>1</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>2</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>3</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>4</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>5</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>6</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>7</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>8</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>Overall Accuracy </it>(%)</p>
            <p>w = 7</p>
         </c>
         <c ca="center">
            <p>91.2</p>
         </c>
         <c ca="center">
            <p>91.2</p>
         </c>
         <c ca="center">
            <p>91.3</p>
         </c>
         <c ca="center">
            <p>91.4</p>
         </c>
         <c ca="center">
            <p>91.5</p>
         </c>
         <c ca="center">
            <p>91.8</p>
         </c>
         <c ca="center">
            <p>92.0</p>
         </c>
         <c ca="center">
            <p>91.6</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><it>Overall Accuracy </it>(%)</p>
            <p>w = 8</p>
         </c>
         <c ca="center">
            <p>91.4</p>
         </c>
         <c ca="center">
            <p>91.4</p>
         </c>
         <c ca="center">
            <p>91.4</p>
         </c>
         <c ca="center">
            <p>91.4</p>
         </c>
         <c ca="center">
            <p>91.4</p>
         </c>
         <c ca="center">
            <p>91.5</p>
         </c>
         <c ca="center">
            <p>91.6</p>
         </c>
         <c ca="center">
            <p>91.7</p>
         </c>
         <c ca="center">
            <p>90.9</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The combination of w = 7 and k = 6 provides the best accuracy. Some results are shown to have identical overall accuracies due to the rounding off to the first decimal place.</p>
   </tblfn></tbl>
            <fig id="F4"><title><p>Figure 4</p></title><caption><p>The overall accuracies of KnowPred<sub>site </sub>using different size of similar peptide length</p></caption><text>
   <p><b>The overall accuracies of KnowPred<sub>site </sub>using different size of similar peptide length</b>.</p>
</text><graphic file="1471-2105-10-S15-S8-4"/></fig>
         </sec>
         <sec>
            <st>
               <p>Prediction performance of KnowPred<sub>site</sub></p>
            </st>
            <p>After the best parameters have been determined, we conducted a ten-fold cross validation experiment on the entire dataset to compare KnowPred<sub>site </sub>with ngLOC and Blast-hit prediction. We used the top <it>N </it>accuracy for evaluation, where <it>N </it>ranges from 1 to 4. A protein is considered to be correctly predicted when the real localization site(s) rank among the top <it>N </it>of the predicted sites. (Top 1 accuracy is simply the <it>Accuracy </it>defined in Equation (4).) Notably, for multi-localized proteins, the accuracy is measured in two ways: first, at least one site correctly predicted and second, both sites correctly predicted. Using the first measurement, a true positive is a multi-localized protein with at least one localization site correctly predicted; whereas a true positive using the second measurement is a multi-localized protein with both sites correctly predicted.</p>
            <p>The prediction performance of KnowPred<sub>site</sub>, ngLOC, and Blast-hit is summarized in Table <tblr tid="T3">3</tblr>, in which KnowPred<sub>site </sub>performance is reported with ten-fold cross validation and leave-one-out cross validation as denoted by <sup>#</sup>KnowPred<sub>site </sub>and *KnowPred<sub>site</sub>, respectively. It is observed that KnowPred<sub>site </sub>outperforms ngLOC and Blast-hit. (The prediction results of single- and multi-localized proteins by KnowPred<sub>site </sub>can be found in Additional file <supplr sid="S2">2</supplr> to Additional file <supplr sid="S5">5</supplr>. Additional file <supplr sid="S2">2</supplr> lists the prediction results for single-localized proteins using leave-one-out cross validation; Additional file <supplr sid="S3">3</supplr> lists the prediction results for single-localized proteins using ten-fold cross validation; Additional file <supplr sid="S4">4</supplr> lists the prediction results for multi-localized proteins using leave-one-out cross validation; Additional file <supplr sid="S5">5</supplr> lists the prediction results for multi-localized proteins using ten-fold cross validation.)</p>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>KnowPred<sub>site </sub>prediction results for single-localized proteins using leave-one-out cross validation</b>. Each row is a prediction result for a protein sequence. Columns A, and B represent protein name and localization site annotation, respectively. Columns C to L are the confidence scores corresponding to each localization site. Columns N to Q are the Top 1 to Top 4 accuracies.</p>
               </text>
               <file name="1471-2105-10-S15-S8-S2.csv">
   <p>Click here for file</p>
</file>
            </suppl>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <text>
                  <p><b>KnowPred<sub>site </sub>prediction results for single-localized proteins using ten-fold cross validation</b>. The columns' definition is the same as that for Additional File <supplr sid="S2">2</supplr>.</p>
               </text>
               <file name="1471-2105-10-S15-S8-S3.csv">
   <p>Click here for file</p>
</file>
            </suppl>
            <suppl id="S4">
               <title>
                  <p>Additional file 4</p>
               </title>
               <text>
                  <p><b>KnowPred<sub>site </sub>prediction results for multi-localized proteins using leave-one-out cross validation</b>. Each row is a prediction result for a protein sequence. Columns A to L are the same to Additional File <supplr sid="S2">2</supplr>. Columns N to Q are the Top 1 to Top 4 accuracies based on the "at least one correct" criterion. Columns S to U are Top 2 to Top 4 accuracies based on the "both correct" criterion.</p>
               </text>
               <file name="1471-2105-10-S15-S8-S4.csv">
   <p>Click here for file</p>
</file>
            </suppl>
            <suppl id="S5">
               <title>
                  <p>Additional file 5</p>
               </title>
               <text>
                  <p><b>KnowPred<sub>site </sub>prediction results for multi-localized proteins using ten-fold cross validation</b>. The columns' definition is the same as that for Additional file <supplr sid="S4">4</supplr>.</p>
               </text>
               <file name="1471-2105-10-S15-S8-S5.csv">
   <p>Click here for file</p>
</file>
            </suppl>
            <tbl id="T3"><title><p>Table 3</p></title><caption><p>Prediction performance of KnowPred<sub>site</sub>, ngLOC, and Blast-hit</p></caption><tblbdy cols="6">
      <r>
         <c ca="left">
            <p>
               <b><it>Overall Accuracy </it>(%)</b>
            </p>
         </c>
         <c ca="right">
            <p>
               <b>Methods</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Top 1</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Top 2</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Top 3</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Top 4</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Single-localized</p>
         </c>
         <c ca="right">
            <p>*KnowPred<sub>site</sub></p>
         </c>
         <c ca="center">
            <p>92.0</p>
         </c>
         <c ca="center">
            <p>95.7</p>
         </c>
         <c ca="center">
            <p>96.8</p>
         </c>
         <c ca="center">
            <p>98.1</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p><sup>#</sup>KnowPred<sub>site</sub></p>
         </c>
         <c ca="center">
            <p>91.7</p>
         </c>
         <c ca="center">
            <p>95.4</p>
         </c>
         <c ca="center">
            <p>96.6</p>
         </c>
         <c ca="center">
            <p>97.9</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p>ngLOC</p>
         </c>
         <c ca="center">
            <p>88.8</p>
         </c>
         <c ca="center">
            <p>92.2</p>
         </c>
         <c ca="center">
            <p>94.5</p>
         </c>
         <c ca="center">
            <p>96.3</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p>Blast-hit</p>
         </c>
         <c ca="center">
            <p>86.0</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Multi-localized</p>
            <p>(at least 1 correct)</p>
         </c>
         <c ca="right">
            <p>*KnowPred<sub>site</sub></p>
         </c>
         <c ca="center">
            <p>90.8</p>
         </c>
         <c ca="center">
            <p>96.4</p>
         </c>
         <c ca="center">
            <p>98.2</p>
         </c>
         <c ca="center">
            <p>98.9</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p><sup>#</sup>KnowPred<sub>site</sub></p>
         </c>
         <c ca="center">
            <p>90.1</p>
         </c>
         <c ca="center">
            <p>96.1</p>
         </c>
         <c ca="center">
            <p>98.1</p>
         </c>
         <c ca="center">
            <p>98.9</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p>ngLOC</p>
         </c>
         <c ca="center">
            <p>81.9</p>
         </c>
         <c ca="center">
            <p>92.0</p>
         </c>
         <c ca="center">
            <p>96.1</p>
         </c>
         <c ca="center">
            <p>97.4</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p>Blast-hit</p>
         </c>
         <c ca="center">
            <p>78.8</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Multi-localized</p>
            <p>(both correct)</p>
         </c>
         <c ca="right">
            <p>*KnowPred<sub>site</sub></p>
         </c>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>74.3</p>
         </c>
         <c ca="center">
            <p>83.3</p>
         </c>
         <c ca="center">
            <p>88.7</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p><sup>#</sup>KnowPred<sub>site</sub></p>
         </c>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>72.1</p>
         </c>
         <c ca="center">
            <p>82.2</p>
         </c>
         <c ca="center">
            <p>87.5</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p>ngLOC</p>
         </c>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>59.7</p>
         </c>
         <c ca="center">
            <p>73.8</p>
         </c>
         <c ca="center">
            <p>83.2</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="right">
            <p>Blast-hit</p>
         </c>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>45.7</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>*KnowPredsite represents the experiment result using leave-one-out cross validation; #KnowPredsite represents the experiment result using 10-fold cross validation.</p>
   </tblfn></tbl>
            <p>For single-localized proteins, the overall accuracies of KnowPred<sub>site </sub>are from 91.7 to 98.1 when the correct prediction is considered within the top 1 to top 4 most probable sites. Those of ngLOC are from 88.8% to 96.3%. The accuracy of Blast-hit is 86.0%, which means 86.0% of single-localized proteins could be correctly predicted by BLAST searches. It is noteworthy that 2114 sequences among all single-localized proteins failed to find significant similar proteins by Blast-hit method; however, 58.8% of them were correctly predicted by KnowPred<sub>site</sub>. It shows that the local similarity helps identify related sequences for subcellular localization prediction.</p>
            <p>The experiment result shows that KnowPred<sub>site </sub>has much higher accuracy on multi-localized proteins than the other methods. Using the first accuracy measurement, i.e., at least one site correctly predicted, KnowPred<sub>site </sub>achieves more than 90% of the top 1 accuracy, which is higher than ngLOC by 8.2%. Using the tighter second accuracy measurement, KnowPred<sub>site </sub>achieves 72.1% of the top 2 accuracy, which is higher than ngLOC by 12.4%. Further observing the top N accuracy, we find that KnowPred<sub>site </sub>is more able to narrow down the number of false positives than ngLOC.</p>
            <p>The top 1 and top 2 accuracies of the Blast-hit method are 78.8% and 45.7% for the two accuracy measurements. Notably, 318 proteins among all multi-localized proteins failed to find any significant Blast hit; however, 73.3% and 49.7% of them were correctly predicted by KnowPred<sub>site </sub>using the two accuracy measurements, respectively.</p>
         </sec>
         <sec>
            <st>
               <p>Site-specific prediction performance</p>
            </st>
            <p>In contrast to the overall accuracy of the dataset reported in Table <tblr tid="T3">3</tblr>, we further analyze the prediction performance on each of the 10 distinct localization sites. The results are summarized in Table <tblr tid="T4">4</tblr>. Among the 10 localization sites, the precision ranges from 75.7% to 98.5% and the accuracy<sub><it>i </it></sub>ranges from 52.0% to 96.4%. It is observed that higher occurrence of the localization site, e.g., EXC (29.1%) and PLA (25.2%), leads to better prediction, e.g., the precision and accuracy on EXC are 98.5% and 93.9%, respectively. Low occurrence of the localization site can deteriorate prediction, for example, CSK (1%) and GOL (1.1%) have MCC<sub>i</sub> of 0.645 and 0.746, respectively. However, if the similar peptide records of a site have higher specificity, prediction performance can be good despite low occurrence. For example, the precision and accuracy on LYS (0.6%) and POX (0.8%) are 87.2% and 81.9%, and 87.3% and 85.1%, respectively. Furthermore, it is noteworthy that although CYT represents 11.1% of the dataset, its MCC<sub>i</sub> is 0.774, much lower than other highly occurring sites. Its low MCC<sub>i</sub> is due to low precision since KnowPredsite yields more false positives for CYT. High false positives usually occur when the similar peptide records of a site have lower specificity and higher diversity. As a result, proteins of other localization sites are misclassified as CYT.</p>
            <tbl id="T4"><title><p>Table 4</p></title><caption><p>Prediction performance of KnowPred<sub>site</sub><b/> for each site using precision, accuracy, and MCC.</p></caption><tblbdy cols="5">
      <r>
         <c ca="left">
            <p>
               <b>Site <it>i</it></b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Occurrence in the dataset (%)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b><it>Precision </it>(%)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b><it>Accuracy</it><sub><it>i </it></sub>(%)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>
                  <it>MCC</it>
               </b>
               <sub>
                  <it>i</it>
               </sub>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>CYT</p>
         </c>
         <c ca="center">
            <p>11.1</p>
         </c>
         <c ca="center">
            <p>75.7</p>
         </c>
         <c ca="center">
            <p>84.4</p>
         </c>
         <c ca="center">
            <p>0.774</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>CSK</p>
         </c>
         <c ca="center">
            <p>1.0</p>
         </c>
         <c ca="center">
            <p>81.1</p>
         </c>
         <c ca="center">
            <p>52.0</p>
         </c>
         <c ca="center">
            <p>0.645</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>END</p>
         </c>
         <c ca="center">
            <p>3.6</p>
         </c>
         <c ca="center">
            <p>92.9</p>
         </c>
         <c ca="center">
            <p>84.1</p>
         </c>
         <c ca="center">
            <p>0.88</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EXC</p>
         </c>
         <c ca="center">
            <p>29.1</p>
         </c>
         <c ca="center">
            <p>98.5</p>
         </c>
         <c ca="center">
            <p>93.9</p>
         </c>
         <c ca="center">
            <p>0.946</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>GOL</p>
         </c>
         <c ca="center">
            <p>1.1</p>
         </c>
         <c ca="center">
            <p>79.1</p>
         </c>
         <c ca="center">
            <p>70.9</p>
         </c>
         <c ca="center">
            <p>0.746</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>LYS</p>
         </c>
         <c ca="center">
            <p>0.6</p>
         </c>
         <c ca="center">
            <p>87.2</p>
         </c>
         <c ca="center">
            <p>81.9</p>
         </c>
         <c ca="center">
            <p>0.844</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>MIT</p>
         </c>
         <c ca="center">
            <p>9.4</p>
         </c>
         <c ca="center">
            <p>96.7</p>
         </c>
         <c ca="center">
            <p>86.9</p>
         </c>
         <c ca="center">
            <p>0.907</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>NUC</p>
         </c>
         <c ca="center">
            <p>18.0</p>
         </c>
         <c ca="center">
            <p>87.3</p>
         </c>
         <c ca="center">
            <p>93.8</p>
         </c>
         <c ca="center">
            <p>0.884</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>PLA</p>
         </c>
         <c ca="center">
            <p>25.2</p>
         </c>
         <c ca="center">
            <p>94.4</p>
         </c>
         <c ca="center">
            <p>96.4</p>
         </c>
         <c ca="center">
            <p>0.938</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>POX</p>
         </c>
         <c ca="center">
            <p>0.8</p>
         </c>
         <c ca="center">
            <p>87.3</p>
         </c>
         <c ca="center">
            <p>85.1</p>
         </c>
         <c ca="center">
            <p>0.861</p>
         </c>
      </r>
   </tblbdy></tbl>
            <p>Figure <figr fid="F5">5</figr> shows the site-specific comparison between KnowPred<sub>site </sub>and ngLOC in terms of accuracy and MCC. KnowPred<sub>site </sub>outperforms ngLOC in eight localization sites (CSK, END, EXC, GOL, MIT, NUC, PLA, POX) in terms of MCC. The two sites where ngLOC performs better are CYT (0.777 for ngLOC and 0.774 for KnowPred<sub>site</sub>) and LYS (0.902 for ngLOC and 0.844 for KnowPred<sub>site</sub>). In terms of accuracy, KnowPred<sub>site </sub>outperforms ngLOC in all sites except for LYS (representing around 0.6% of the whole dataset), where ngLOC and KnowPred<sub>site </sub>yields 85.5% and 81.9% of accuracy, respectively.</p>
            <fig id="F5"><title><p>Figure 5</p></title><caption><p>Matthew's correlation coefficient (<it>MCC</it>) and accuracy comparison between KnowPred<sub>site </sub>and ngLOC</p></caption><text>
   <p><b>Matthew's correlation coefficient (<it>MCC</it>) and accuracy comparison between KnowPred<sub>site </sub>and ngLOC</b>.</p>
</text><graphic file="1471-2105-10-S15-S8-5"/></fig>
         </sec>
         <sec>
            <st>
               <p>Evaluation of the multi-localized confidence score (MLCS)</p>
            </st>
            <p>A significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles; therefore, it is important to differentiate single-localized proteins from multi-localized proteins. We used the entire ngLOC dataset to compare different MLCS thresholds on the correct distinction between single-localized and multi-localized proteins. Specifically, we used the portions of true positives in the multi-localized proteins and true negatives in the single-localized proteins as the performance measures. A true positive represents a multi-localized protein whose MLCS is above the threshold and a true negative represents a single-localized protein whose MLCS is below the threshold.</p>
            <p>We illustrate the cumulative percentages of true positive and true negative versus the MLCS threshold in Figure <figr fid="F6">6</figr>, which shows that the true negative curve is increasing along the MLCS axis whereas the true positive curve is decreasing. If the MLCS threshold is set to be 40, 60.7% of multi-localized proteins are true positives and 96.5% of single-localized proteins are true negatives. It shows that 60.7% of multi-localized proteins obtained MLCS of 40 or better, whereas only 3.5% of single-localized proteins within this range. If the MLCS threshold is set to be 20, 86.3% of multi-localized proteins are true positives and 82.8% of single-localized proteins are true negatives. In ngLOC, the best result shows that 76% of multi-localized proteins belong to true positives and 81% of single-localized proteins belong to true negatives when 40 of MLCS threshold is applied. The result shows that KnowPred<sub>site </sub>better differentiate multi-localized proteins from those that are single-localized.</p>
            <fig id="F6"><title><p>Figure 6</p></title><caption><p>MLCS analysis</p></caption><text>
   <p><b>MLCS analysis</b>. A true positive represents a multi-localized protein whose MLCS is above the threshold and a true negative represents a single-localized protein whose MLCS is below the threshold. We compare the ratio of true positives/true negatives to the total number of multi-/single-localized proteins.</p>
</text><graphic file="1471-2105-10-S15-S8-6"/></fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Unlike most machine learning methods that the parameters of the prediction models are not biologically interpretable, the prediction result of KnowPred<sub>site </sub>is interpretable and the prediction process is transparent and traceable. To predict the localization sites of a protein, KnowPred<sub>site</sub> shows the template sequences and their associated contributive confidence scores for a query protein. Such information is useful for interpretation of the prediction results. In this section, we select the four sequences EF1A2_RABIT, RASH_HUMAN, MCA3_MOUSE, and CFDP2_BOVIN from the ngLOC dataset, to demonstrate the interpretation of KnowPred<sub>site </sub>prediction results.</p>
         <p>The prediction result of each of the first three proteins and its template sequences extracted from the knowledge base used for prediction are shown in Table <tblr tid="T5">5</tblr>, <tblr tid="T6">6</tblr>, <tblr tid="T7">7</tblr>, respectively. In each table, the prediction result shows the MLCS and the confidence score of each localization site that the query protein would be localized into. Moreover, the template proteins which are used to vote for the localization sites are shown in each table. We only list the top eight template proteins which contribute most to the confidence scores of the query sequence. For each template sequence, its contribution to confidence score of each localization site and the sequence identity to the query protein calculated by ClustalW (denoted by SI) are shown.</p>
         <tbl id="T5"><title><p>Table 5</p></title><caption><p>Prediction result of EF1A2_RABIT.</p></caption><tblbdy cols="12">
      <r>
         <c ca="left">
            <p>
               <b>Query</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CYT</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CSK</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>END</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>EXC</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>GOL</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>LYS</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MIT</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>NUC*</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>PLA</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>POX</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MLCS</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A2_RABIT</p>
         </c>
         <c ca="center">
            <p>95.45</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>1.45</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0.04</p>
         </c>
         <c ca="center">
            <p>2.97</p>
         </c>
         <c ca="center">
            <p>0.05</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>7.40</p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Template</p>
         </c>
         <c ca="center">
            <p>CYT</p>
         </c>
         <c ca="center">
            <p>CSK</p>
         </c>
         <c ca="center">
            <p>END</p>
         </c>
         <c ca="center">
            <p>EXC</p>
         </c>
         <c ca="center">
            <p>GOL</p>
         </c>
         <c ca="center">
            <p>LYS</p>
         </c>
         <c ca="center">
            <p>MIT</p>
         </c>
         <c ca="center">
            <p>NUC</p>
         </c>
         <c ca="center">
            <p>PLA</p>
         </c>
         <c ca="center">
            <p>POX</p>
         </c>
         <c ca="center">
            <p>SI</p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A2_RAT</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>2.94</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>99.78</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A_CHICK</p>
         </c>
         <c ca="center">
            <p>2.77</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>92.22</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A1_HUMAN</p>
         </c>
         <c ca="center">
            <p>2.75</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>92.22</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A1_RAT</p>
         </c>
         <c ca="center">
            <p>2.75</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>92.22</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A0_XENLA</p>
         </c>
         <c ca="center">
            <p>2.69</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>90.06</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A_BRARE</p>
         </c>
         <c ca="center">
            <p>2.64</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>90.06</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A2_XENLA</p>
         </c>
         <c ca="center">
            <p>2.64</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>88.79</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1A3_XENLA</p>
         </c>
         <c ca="center">
            <p>2.60</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>88.55</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>*: correct answer; SI: sequence identity.</p>
   </tblfn></tbl>
         <tbl id="T6"><title><p>Table 6</p></title><caption><p>Prediction result of RASH_HUMAN.</p></caption><tblbdy cols="12">
      <r>
         <c ca="left">
            <p>
               <b>Query</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CYT*</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CSK</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>END</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>EXC</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>GOL*</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>LYS</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MIT</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>NUC</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>PLA</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>POX</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MLCS</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>RASH_HUMAN</p>
         </c>
         <c ca="center">
            <p>18.95</p>
         </c>
         <c ca="center">
            <p>0.06</p>
         </c>
         <c ca="center">
            <p>0.09</p>
         </c>
         <c ca="center">
            <p>0.09</p>
         </c>
         <c ca="center">
            <p>13.74</p>
         </c>
         <c ca="center">
            <p>0.04</p>
         </c>
         <c ca="center">
            <p>0.24</p>
         </c>
         <c ca="center">
            <p>0.25</p>
         </c>
         <c ca="center">
            <p>83.61</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>36.24</p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Template</p>
         </c>
         <c ca="center">
            <p>CYT</p>
         </c>
         <c ca="center">
            <p>CSK</p>
         </c>
         <c ca="center">
            <p>END</p>
         </c>
         <c ca="center">
            <p>EXC</p>
         </c>
         <c ca="center">
            <p>GOL</p>
         </c>
         <c ca="center">
            <p>LYS</p>
         </c>
         <c ca="center">
            <p>MIT</p>
         </c>
         <c ca="center">
            <p>NUC</p>
         </c>
         <c ca="center">
            <p>PLA</p>
         </c>
         <c ca="center">
            <p>POX</p>
         </c>
         <c ca="center">
            <p>SI</p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>RASK_HUMAN</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>13.88</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>86.32</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>RASK_MOUSE</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>13.81</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>86.32</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>RASN_HUMAN</p>
         </c>
         <c ca="center">
            <p>13.19</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>13.19</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>85.19</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>LET60_CAEEL</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>10.55</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>74.07</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>RAS3_RHIRA</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>5.05</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>57.07</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>RAS1_RHIRA</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>4.88</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>58.62</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>RAS2_RHIRA</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>4.33</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>35.20</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>RAS_LIMLI</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>4.15</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>46.03</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>*: correct answer; SI: sequence identity.</p>
   </tblfn></tbl>
         <tbl id="T7"><title><p>Table 7</p></title><caption><p>Prediction result of MCA3_MOUSE. Templates marked with '+' are those that have the same localization annotation with the query protein.</p></caption><tblbdy cols="12">
      <r>
         <c ca="left">
            <p>
               <b>Query</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CYT*</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CSK</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>END</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>EXC</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>GOL</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>LYS</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MIT</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>NUC*</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>PLA</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>POX</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MLCS</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>MCA3_MOUSE</p>
         </c>
         <c ca="center">
            <p>95.46</p>
         </c>
         <c ca="center">
            <p>0.3</p>
         </c>
         <c ca="center">
            <p>0.27</p>
         </c>
         <c ca="center">
            <p>0.36</p>
         </c>
         <c ca="center">
            <p>0.2</p>
         </c>
         <c ca="center">
            <p>0.01</p>
         </c>
         <c ca="center">
            <p>1.13</p>
         </c>
         <c ca="center">
            <p>93.59</p>
         </c>
         <c ca="center">
            <p>1.82</p>
         </c>
         <c ca="center">
            <p>0.22</p>
         </c>
         <c ca="center">
            <p>100</p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Template</p>
         </c>
         <c ca="center">
            <p>CYT</p>
         </c>
         <c ca="center">
            <p>CSK</p>
         </c>
         <c ca="center">
            <p>END</p>
         </c>
         <c ca="center">
            <p>EXC</p>
         </c>
         <c ca="center">
            <p>GOL</p>
         </c>
         <c ca="center">
            <p>LYS</p>
         </c>
         <c ca="center">
            <p>MIT</p>
         </c>
         <c ca="center">
            <p>NUC</p>
         </c>
         <c ca="center">
            <p>PLA</p>
         </c>
         <c ca="center">
            <p>POX</p>
         </c>
         <c ca="center">
            <p>SI</p>
         </c>
      </r>
      <r>
         <c cspan="12">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>MCA3_HUMAN<sup>+</sup></p>
         </c>
         <c ca="center">
            <p>89.16</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>89.16</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>88.51</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1G1_YEAST<sup>+</sup></p>
         </c>
         <c ca="center">
            <p>2.74</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>2.47</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>8.67</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>EF1G2_YEAST</p>
         </c>
         <c ca="center">
            <p>0.49</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0.49</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>8.50</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>GSTA_PLEPL</p>
         </c>
         <c ca="center">
            <p>0.35</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>15.86</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>SYEC_YEAST</p>
         </c>
         <c ca="center">
            <p>0.16</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>3.86</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>CCNA1_MOUSE</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0.15</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>7.36</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>NU155_RAT<sup>+</sup></p>
         </c>
         <c ca="center">
            <p>0.14</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0.14</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>3.17</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>GCYB2_HUMAN</p>
         </c>
         <c ca="center">
            <p>0.14</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>0</p>
         </c>
         <c ca="center">
            <p>4.86</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>*: correct answer; SI: sequence identity.</p>
   </tblfn></tbl>
         <p>In the example of EF1A2_RABIT shown in Table <tblr tid="T5">5</tblr>, KnowPred<sub>site </sub>predicts it being single-localized at cytoplasm (CYT) since MLCS is very low (7.40) and CYT has the highest confidence score. However, the localization site of EF1A2_RABIT reported in the ngLOC dataset is nuclear (NUC). Examining the eight template proteins, we find that they all have high sequence identities with EF1A2_RABIT and most of them are localized into CYT except EF1A2_RAT localized into NUC. According to the Gene Ontology annotation, it is localized into CYT and NUC, which are the two sites with the highest confidence scores in KnowPred<sub>site</sub>'s prediction.</p>
         <p>In the example of RASH_HUMAN shown in Table <tblr tid="T6">6</tblr>, KnowPred<sub>site </sub>predicts RASH_HUMAN being localized into plasma membrane (PLA) and cytoplasm (CYT). However, the correct localization site is cytoplasm and Golgi apparatus (GOL). Referring to the prediction result, the confidence score of PLA is much higher than those of CYT and GOL. It is also observed that most of the template proteins are localized into PLA. According to the annotation in Gene Ontology and SwissProt, RASH_HUMAN is localized into PLA and GOL, and the template protein, RASN_HUMAN, is also localized into PLA and GOL. If applying the new annotation data, KnowPred<sub>site </sub>can predict RASH_HUMAN correctly.</p>
         <p>As for MCA3_MOUSE shown in Table <tblr tid="T7">7</tblr>, KnowPred<sub>site </sub>predicts its MLCS 100 and it being localized into cytoplasm (CYT) and nuclear (NUC) correctly. Examining the template proteins, we observe that KnowPred<sub>site </sub>identifies some related proteins, i.e., which have the same localization with the query protein. EF1G1_YEAST and NU155_RAT, even though they share very low sequence identity 8.67% and 3.17%, respectively, with the query protein. Notably, the two template proteins rank second and seventh, respectively, among all template proteins. Furthermore, though GSTA_PLEPL has higher sequence identity (15.86%) with the query protein than EF1G1_YEAST, the confidence score contributed by EF1G1_YEAST is much higher than that by GSTA_PLEPL (2.74 vs. 0.35). It shows that the contributive confidence score is not necessary to be positively correlated with the sequence identity when template sequences are dissimilar with the query sequence. In this example, EF1G1_YEAST shares more local similarities (peptide fragments) with the query protein than GSTA_PLEPL does. If MCA3_HUMAN, the one that shares 88.51% sequence identity with the query protein, is taken out from the template pool, KnowPred<sub>site </sub>can still predict correctly for protein MCA3_MOUSE.</p>
         <p>For the multi-localized proteins, there are 318 proteins unable to find similar sequences by the Blast-hit method. However, the localization sites of around half of them can be correctly predicted by KnowPred<sub>site</sub>. We randomly choose an example, CFDP2_BOVIN, to demonstrate the KnowPred<sub>site</sub>'s capability of identifying related sequences from the template pool. The two highest confidence scores of CFDP2_BOVIN are 32.07 (CYT) and 41.18 (NUC). Among the top 100 templates (ranked by the contribution to the confidence scores), 12 of them are localized into CYT and NUC, 18 are localized into CYT only, and 32 are localized into NUC only. Their sequence identities against CFDP2_BOVIN are very low, ranging from 3.47% to 13.8%. The result suggests that local similarity captured by our method is beneficial for PSL prediction when global sequence similarity is very low.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>In this paper, we propose a highly accurate subcellular localization prediction method for single- and multi-localized proteins, called KnowPred<sub>site</sub>, which is based on a knowledge base instead of frequently used machine learning approaches. The knowledge base, called <it>SPKB</it>, is constructed from a given dataset of proteins with known localization site annotation to capture local similarity between proteins so that related proteins with the same localization can be identified. Using these related proteins obtained form the knowledge base, the localization site of a query protein can be better predicted.</p>
         <p>We used the ngLOC dataset to evaluate the performance of KnowPred<sub>site</sub>. The dataset consists of 25887 single-localized proteins and 2169 multi-localized proteins of ten subcellular proteomes from 1923 species. In order to compare KnowPred<sub>site </sub>with ngLOC and the baseline Blast-hit method, we performed ten-fold cross validation on the dataset. The experiment results show that KnowPred<sub>site </sub>achieves higher prediction accuracy than ngLOC and Blast-hit. Particularly, on multi-localized sequences KnowPred<sub>site </sub>outperformed ngLOC by 8.2% in accuracy when a protein is correctly predicted if at least one site is correctly identified and by 12.4% in accuracy when a protein is correctly predicted if both sites are correctly identified.</p>
         <p>A major advantage of knowledge base approaches is that the prediction process is transparent and interpretable. We can examine the prediction process to see how KnowPred<sub>site </sub>generates the prediction. Furthermore, with close observation from the prediction results in our experiments as described in the Discussion section, we find that KnowPred<sub>site </sub>can efficiently use local similarity to identify related sequences even when their sequence identity is low so as to predict localization site with high accuracy.</p>
         <p>When more proteins have known localization sites, most machine learning based methods need to retrain the prediction models, In contrast, KnowPred<sub>site </sub>can be easily improved by incrementally expanding the knowledge base, i.e., adding new peptide records or updating existing records with new protein sources and their localization site information. This feature indicates the expansibility and efficiency in maintaining the KnowPred<sub>site </sub>prediction system.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The authors declare that they have no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>Hsin-Nan Lin developed the method, carried out the computational predictions. Ching-Tai Chen and Hsin-Nan Lin were involved in the literature survey, result interpretation, statistical analysis, and manuscript writing. Ting-Yi Sung, Shinn-Ying Ho and Wen-Lian Hsu coordinated the study and revised the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
      <sec>
         <st>
            <p>Note</p>
         </st>
         <p>Other papers from the meeting have been published as part of <it>BMC Genomics </it>Volume 10 Supplement 3, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology, available online at <url>http://www.biomedcentral.com/1471-2164/10?issue=S3</url>.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Thanks to all who developed PSI-BLAST and made it publicly available. We also appreciate Chia-Yu Su for helpful discussions.</p>
            <p>This article has been published as part of <it>BMC Bioinformatics </it>Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/10?issue=S15</url>.</p>
         </sec>
      </ack>
      <refgrp><bibl id="B1"><title><p>Better prediction of sub-cellular localization by combining evolutionary and structural information</p></title><aug><au><snm>Nair</snm><fnm>R</fnm></au><au><snm>Rost</snm><fnm>B</fnm></au></aug><source>Proteins</source><pubdate>2003</pubdate><volume>53</volume><issue>4</issue><fpage>917</fpage><lpage>930</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/prot.10507</pubid><pubid idtype="pmpid" link="fulltext">14635133</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis</p></title><aug><au><snm>Gardy</snm><fnm>JL</fnm></au><au><snm>Laird</snm><fnm>MR</fnm></au><au><snm>Chen</snm><fnm>F</fnm></au><au><snm>Rey</snm><fnm>S</fnm></au><au><snm>Walsh</snm><fnm>CJ</fnm></au><au><snm>Ester</snm><fnm>M</fnm></au><au><snm>Brinkman</snm><fnm>FS</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><issue>5</issue><fpage>617</fpage><lpage>623</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti057</pubid><pubid idtype="pmpid" link="fulltext">15501914</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition</p></title><aug><au><snm>Hoglund</snm><fnm>A</fnm></au><au><snm>Donnes</snm><fnm>P</fnm></au><au><snm>Blum</snm><fnm>T</fnm></au><au><snm>Adolph</snm><fnm>HW</fnm></au><au><snm>Kohlbacher</snm><fnm>O</fnm></au></aug><source>Bioinformatics</source><pubdate>2006</pubdate><volume>22</volume><issue>10</issue><fpage>1158</fpage><lpage>1165</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl002</pubid><pubid idtype="pmpid" link="fulltext">16428265</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines</p></title><aug><au><snm>Wang</snm><fnm>JR</fnm></au><au><snm>Sung</snm><fnm>WK</fnm></au><au><snm>Krishnan</snm><fnm>A</fnm></au><au><snm>Li</snm><fnm>KB</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2005</pubdate><volume>6</volume><fpage>174</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-6-174</pubid><pubid idtype="pmcid">1190155</pubid><pubid idtype="pmpid" link="fulltext">16011808</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Prediction of protein subcellular localization</p></title><aug><au><snm>Yu</snm><fnm>CS</fnm></au><au><snm>Chen</snm><fnm>YC</fnm></au><au><snm>Lu</snm><fnm>CH</fnm></au><au><snm>Hwang</snm><fnm>JK</fnm></au></aug><source>Proteins</source><pubdate>2006</pubdate><volume>64</volume><issue>3</issue><fpage>643</fpage><lpage>651</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/prot.21018</pubid><pubid idtype="pmpid" link="fulltext">16752418</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions</p></title><aug><au><snm>Yu</snm><fnm>CS</fnm></au><au><snm>Lin</snm><fnm>CJ</fnm></au><au><snm>Hwang</snm><fnm>JK</fnm></au></aug><source>Protein Sci</source><pubdate>2004</pubdate><volume>13</volume><issue>5</issue><fpage>1402</fpage><lpage>1406</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1110/ps.03479604</pubid><pubid idtype="pmcid">2286765</pubid><pubid idtype="pmpid" link="fulltext">15096640</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis</p></title><aug><au><snm>Chang</snm><fnm>JM</fnm></au><au><snm>Su</snm><fnm>EC</fnm></au><au><snm>Lo</snm><fnm>A</fnm></au><au><snm>Chiu</snm><fnm>HS</fnm></au><au><snm>Sung</snm><fnm>TY</fnm></au><au><snm>Hsu</snm><fnm>WL</fnm></au></aug><source>Proteins</source><pubdate>2008</pubdate><volume>72</volume><issue>2</issue><fpage>693</fpage><lpage>710</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/prot.21944</pubid><pubid idtype="pmpid" link="fulltext">18260102</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>PSLpred: prediction of subcellular localization of bacterial proteins</p></title><aug><au><snm>Bhasin</snm><fnm>M</fnm></au><au><snm>Garg</snm><fnm>A</fnm></au><au><snm>Raghava</snm><fnm>GP</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><issue>10</issue><fpage>2522</fpage><lpage>2524</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti309</pubid><pubid idtype="pmpid" link="fulltext">15699023</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>Predicting protein localization in budding yeast</p></title><aug><au><snm>Chou</snm><fnm>KC</fnm></au><au><snm>Cai</snm><fnm>YD</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><issue>7</issue><fpage>944</fpage><lpage>950</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti104</pubid><pubid idtype="pmpid" link="fulltext">15513989</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria</p></title><aug><au><snm>Gardy</snm><fnm>JL</fnm></au><au><snm>Spencer</snm><fnm>C</fnm></au><au><snm>Wang</snm><fnm>K</fnm></au><au><snm>Ester</snm><fnm>M</fnm></au><au><snm>Tusnady</snm><fnm>GE</fnm></au><au><snm>Simon</snm><fnm>I</fnm></au><au><snm>Hua</snm><fnm>S</fnm></au><au><snm>deFays</snm><fnm>K</fnm></au><au><snm>Lambert</snm><fnm>C</fnm></au><au><snm>Nakai</snm><fnm>K</fnm></au><etal/></aug><source>Nucleic Acids Res</source><pubdate>2003</pubdate><volume>31</volume><issue>13</issue><fpage>3613</fpage><lpage>3617</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkg602</pubid><pubid idtype="pmcid">169008</pubid><pubid idtype="pmpid" link="fulltext">12824378</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>PLPD: reliable protein localization prediction from imbalanced and overlapped datasets</p></title><aug><au><snm>Lee</snm><fnm>K</fnm></au><au><snm>Kim</snm><fnm>DW</fnm></au><au><snm>Na</snm><fnm>D</fnm></au><au><snm>Lee</snm><fnm>KH</fnm></au><au><snm>Lee</snm><fnm>D</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2006</pubdate><volume>34</volume><issue>17</issue><fpage>4655</fpage><lpage>4666</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkl638</pubid><pubid idtype="pmcid">1636404</pubid><pubid idtype="pmpid" link="fulltext">16966337</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>Mimicking cellular sorting improves prediction of subcellular localization</p></title><aug><au><snm>Nair</snm><fnm>R</fnm></au><au><snm>Rost</snm><fnm>B</fnm></au></aug><source>J Mol Biol</source><pubdate>2005</pubdate><volume>348</volume><issue>1</issue><fpage>85</fpage><lpage>100</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.jmb.2005.02.025</pubid><pubid idtype="pmpid" link="fulltext">15808855</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization</p></title><aug><au><snm>Huang</snm><fnm>WL</fnm></au><au><snm>Tung</snm><fnm>CW</fnm></au><au><snm>Ho</snm><fnm>SW</fnm></au><au><snm>Hwang</snm><fnm>SF</fnm></au><au><snm>Ho</snm><fnm>SY</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2008</pubdate><volume>9</volume><fpage>80</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-9-80</pubid><pubid idtype="pmcid">2262056</pubid><pubid idtype="pmpid" link="fulltext">18241343</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Localizing proteins in the cell from their phylogenetic profiles</p></title><aug><au><snm>Marcotte</snm><fnm>EM</fnm></au><au><snm>Xenarios</snm><fnm>I</fnm></au><au><snm>Bliek</snm><mnm>van Der</mnm><fnm>AM</fnm></au><au><snm>Eisenberg</snm><fnm>D</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2000</pubdate><volume>97</volume><issue>22</issue><fpage>12115</fpage><lpage>12120</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.220399497</pubid><pubid idtype="pmcid">17303</pubid><pubid idtype="pmpid" link="fulltext">11035803</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Predicting protein cellular localization using a domain projection method</p></title><aug><au><snm>Mott</snm><fnm>R</fnm></au><au><snm>Schultz</snm><fnm>J</fnm></au><au><snm>Bork</snm><fnm>P</fnm></au><au><snm>Ponting</snm><fnm>CP</fnm></au></aug><source>Genome Res</source><pubdate>2002</pubdate><volume>12</volume><issue>8</issue><fpage>1168</fpage><lpage>1174</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.96802</pubid><pubid idtype="pmcid">186639</pubid><pubid idtype="pmpid" link="fulltext">12176924</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Protein subcellular localization prediction based on compartment-specific features and structure conservation</p></title><aug><au><snm>Su</snm><fnm>EC</fnm></au><au><snm>Chiu</snm><fnm>HS</fnm></au><au><snm>Lo</snm><fnm>A</fnm></au><au><snm>Hwang</snm><fnm>JK</fnm></au><au><snm>Sung</snm><fnm>TY</fnm></au><au><snm>Hsu</snm><fnm>WL</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2007</pubdate><volume>8</volume><fpage>330</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-8-330</pubid><pubid idtype="pmcid">2040162</pubid><pubid idtype="pmpid" link="fulltext">17825110</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>Comparison of sequence profiles. Strategies for structural predictions using sequence information</p></title><aug><au><snm>Rychlewski</snm><fnm>L</fnm></au><au><snm>Jaroszewski</snm><fnm>L</fnm></au><au><snm>Li</snm><fnm>WZ</fnm></au><au><snm>Godzik</snm><fnm>A</fnm></au></aug><source>Protein Science</source><pubdate>2000</pubdate><volume>9</volume><issue>2</issue><fpage>232</fpage><lpage>241</lpage><xrefbib><pubidlist><pubid idtype="pmcid">2144550</pubid><pubid idtype="pmpid" link="fulltext">10716175</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance</p></title><aug><au><snm>Sadreyev</snm><fnm>R</fnm></au><au><snm>Grishin</snm><fnm>N</fnm></au></aug><source>Journal of Molecular Biology</source><pubdate>2003</pubdate><volume>326</volume><issue>1</issue><fpage>317</fpage><lpage>336</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0022-2836(02)01371-2</pubid><pubid idtype="pmpid" link="fulltext">12547212</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments</p></title><aug><au><snm>Przybylski</snm><fnm>D</fnm></au><au><snm>Rost</snm><fnm>B</fnm></au></aug><source>Nucleic Acids Research</source><pubdate>2007</pubdate><volume>35</volume><issue>7</issue><fpage>2238</fpage><lpage>2246</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkm107</pubid><pubid idtype="pmcid">1874647</pubid><pubid idtype="pmpid" link="fulltext">17369271</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>Searching databases of conserved sequence regions by aligning protein multiple-alignments</p></title><aug><au><snm>Pietrokovski</snm><fnm>S</fnm></au></aug><source>Nucleic Acids Research</source><pubdate>1996</pubdate><volume>24</volume><issue>19</issue><fpage>3836</fpage><lpage>3845</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/24.19.3836</pubid><pubid idtype="pmcid">146152</pubid><pubid idtype="pmpid" link="fulltext">8871566</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>Within the twilight zone: A sensitive profile-profile comparison tool based on information theory</p></title><aug><au><snm>Yona</snm><fnm>G</fnm></au><au><snm>Levitt</snm><fnm>M</fnm></au></aug><source>Journal of Molecular Biology</source><pubdate>2002</pubdate><volume>315</volume><issue>5</issue><fpage>1257</fpage><lpage>1275</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1006/jmbi.2001.5293</pubid><pubid idtype="pmpid" link="fulltext">11827492</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>DBMLoc: a Database of proteins with multiple subcellular localizations</p></title><aug><au><snm>Zhang</snm><fnm>S</fnm></au><au><snm>Xia</snm><fnm>X</fnm></au><au><snm>Shen</snm><fnm>J</fnm></au><au><snm>Zhou</snm><fnm>Y</fnm></au><au><snm>Sun</snm><fnm>Z</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2008</pubdate><volume>9</volume><fpage>127</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-9-127</pubid><pubid idtype="pmcid">2292141</pubid><pubid idtype="pmpid" link="fulltext">18304364</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes</p></title><aug><au><snm>King</snm><fnm>BR</fnm></au><au><snm>Guda</snm><fnm>C</fnm></au></aug><source>Genome Biology</source><pubdate>2007</pubdate><volume>8</volume><issue>5</issue><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2007-8-5-r68</pubid><pubid idtype="pmcid">1929137</pubid><pubid idtype="pmpid" link="fulltext">17472741</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>HYPROSP II--a knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence</p></title><aug><au><snm>Lin</snm><fnm>HN</fnm></au><au><snm>Chang</snm><fnm>JM</fnm></au><au><snm>Wu</snm><fnm>KP</fnm></au><au><snm>Sung</snm><fnm>TY</fnm></au><au><snm>Hsu</snm><fnm>WL</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><issue>15</issue><fpage>3227</fpage><lpage>3233</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti524</pubid><pubid idtype="pmpid" link="fulltext">15932901</pubid></pubidlist></xrefbib></bibl><bibl id="B25"><title><p>HYPROSP: a hybrid protein secondary structure prediction algorithm--a knowledge-based approach</p></title><aug><au><snm>Wu</snm><fnm>KP</fnm></au><au><snm>Lin</snm><fnm>HN</fnm></au><au><snm>Chang</snm><fnm>JM</fnm></au><au><snm>Sung</snm><fnm>TY</fnm></au><au><snm>Hsu</snm><fnm>WL</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2004</pubdate><volume>32</volume><issue>17</issue><fpage>5059</fpage><lpage>5065</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkh836</pubid><pubid idtype="pmcid">521652</pubid><pubid idtype="pmpid" link="fulltext">15448186</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><title><p>HYPLOSP: a knowledge-based approach to protein local structure prediction</p></title><aug><au><snm>Chen</snm><fnm>CT</fnm></au><au><snm>Lin</snm><fnm>HN</fnm></au><au><snm>Sung</snm><fnm>TY</fnm></au><au><snm>Hsu</snm><fnm>WL</fnm></au></aug><source>J Bioinform Comput Biol</source><pubdate>2006</pubdate><volume>4</volume><issue>6</issue><fpage>1287</fpage><lpage>1307</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1142/S0219720006002466</pubid><pubid idtype="pmpid" link="fulltext">17245815</pubid></pubidlist></xrefbib></bibl><bibl id="B27"><title><p>Clustering protein sequences-structure prediction by transitive homology</p></title><aug><au><snm>Bolten</snm><fnm>E</fnm></au><au><snm>Schliep</snm><fnm>A</fnm></au><au><snm>Schneckener</snm><fnm>S</fnm></au><au><snm>Schomburg</snm><fnm>D</fnm></au><au><snm>Schrader</snm><fnm>R</fnm></au></aug><source>Bioinformatics</source><pubdate>2001</pubdate><volume>17</volume><issue>10</issue><fpage>935</fpage><lpage>941</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/17.10.935</pubid><pubid idtype="pmpid" link="fulltext">11673238</pubid></pubidlist></xrefbib></bibl><bibl id="B28"><title><p>Getting the most from PSI-BLAST</p></title><aug><au><snm>Jones</snm><fnm>DT</fnm></au><au><snm>Swindells</snm><fnm>MB</fnm></au></aug><source>Trends in Biochemical Sciences</source><pubdate>2002</pubdate><volume>27</volume><issue>3</issue><fpage>161</fpage><lpage>164</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0968-0004(01)02039-4</pubid><pubid idtype="pmpid" link="fulltext">11893514</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>Predicting protein function from domain content</p></title><aug><au><snm>Forslund</snm><fnm>K</fnm></au><au><snm>Sonnhammer</snm><fnm>ELL</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><issue>15</issue><fpage>1681</fpage><lpage>1687</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn312</pubid><pubid idtype="pmpid" link="fulltext">18591194</pubid></pubidlist></xrefbib></bibl><bibl id="B30"><title><p>PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization</p></title><aug><au><snm>Nakai</snm><fnm>K</fnm></au><au><snm>Horton</snm><fnm>P</fnm></au></aug><source>Trends Biochem Sci</source><pubdate>1999</pubdate><volume>24</volume><issue>1</issue><fpage>34</fpage><lpage>36</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0968-0004(98)01336-X</pubid><pubid idtype="pmpid" link="fulltext">10087920</pubid></pubidlist></xrefbib></bibl><bibl id="B31"><title><p>pTARGET: a new method for predicting protein subcellular localization in eukaryotes</p></title><aug><au><snm>Guda</snm><fnm>C</fnm></au><au><snm>Subramaniam</snm><fnm>S</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><issue>24</issue><fpage>4434</fpage><lpage>4434</lpage><xrefbib><pubid idtype="doi">10.1093/bioinformatics/bti758</pubid></xrefbib></bibl><bibl id="B32"><title><p>Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs</p></title><aug><au><snm>Park</snm><fnm>KJ</fnm></au><au><snm>Kanehisa</snm><fnm>M</fnm></au></aug><source>Bioinformatics</source><pubdate>2003</pubdate><volume>19</volume><issue>13</issue><fpage>1656</fpage><lpage>1663</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btg222</pubid><pubid idtype="pmpid" link="fulltext">12967962</pubid></pubidlist></xrefbib></bibl></refgrp>
   </bm>
</art>