<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-9-310</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Computational identification of ubiquitylation sites from protein sequences</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Tung</snm>
               <fnm>Chun-Wei</fnm>
               <insr iid="I1"/>
               <email>cwtung@livemail.tw</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Ho</snm>
               <fnm>Shinn-Ying</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>syho@mail.nctu.edu.tw</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Institute of Bioinformatics, National Chiao Tung University, Hsinchu 300, Taiwan</p>
            </ins>
            <ins id="I2">
               <p>Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>310</fpage>
         <url>http://www.biomedcentral.com/1471-2105/9/310</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18625080</pubid>
               <pubid idtype="doi">10.1186/1471-2105-9-310</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>25</day>
               <month>2</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>15</day>
               <month>7</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>15</day>
               <month>7</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Tung and Ho; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Ubiquitylation plays an important role in regulating protein functions. Recently, experimental methods were developed toward effective identification of ubiquitylation sites. To efficiently explore more undiscovered ubiquitylation sites, this study aims to develop an accurate sequence-based prediction method to identify promising ubiquitylation sites.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We established an ubiquitylation dataset consisting of 157 ubiquitylation sites and 3676 putative non-ubiquitylation sites extracted from 105 proteins in the UbiProt database. This study first evaluates promising sequence-based features and classifiers for the prediction of ubiquitylation sites by assessing three kinds of features (amino acid identity, evolutionary information, and physicochemical property) and three classifiers (support vector machine, <it>k</it>-nearest neighbor, and Na&#239;veBayes). Results show that the set of used 531 physicochemical properties and support vector machine (SVM) are the best kind of features and classifier respectively that their combination has a prediction accuracy of 72.19% using leave-one-out cross-validation.</p>
               <p>Consequently, an informative physicochemical property mining algorithm (IPMA) is proposed to select an informative subset of 531 physicochemical properties. A prediction system UbiPred was implemented by using an SVM with the feature set of 31 informative physicochemical properties selected by IPMA, which can improve the accuracy from 72.19% to 84.44%. To further analyze the informative physicochemical properties, a decision tree method C5.0 was used to acquire if-then rule-based knowledge of predicting ubiquitylation sites. UbiPred can screen promising ubiquitylation sites from putative non-ubiquitylation sites using prediction scores. By applying UbiPred, 23 promising ubiquitylation sites were identified from an independent dataset of 3424 putative non-ubiquitylation sites, which were also validated by using the obtained prediction rules.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We have proposed an algorithm IPMA for mining informative physicochemical properties from protein sequences to build an SVM-based prediction system UbiPred. UbiPred can predict ubiquitylation sites accompanied with a prediction score each to help biologists in identifying promising sites for experimental verification. UbiPred has been implemented as a web server and is available at <url>http://iclab.life.nctu.edu.tw/ubipred</url>.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Ubiquitylation (also called ubiquitination) is an important mechanism of post-translational modification that ubiquitin will be linked to specific lysine residues of target proteins by forming isopeptide bonds. Three enzymes including activating enzyme (E1), conjugating enzyme (E2), and ubiquitin ligase (E3) are involved in the ubiquitylation process. Another enzyme E4 can help to stabilize and extend polyubiquitin chain <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. The first discovered function of ubiquitylation is to target proteins for subsequent degradation by the ATP-dependent ubiquitin-proteasome system. Subsequently, many regulatory functions of ubiquitylation were discovered including the regulation of DNA repair and transcription, control of signal transduction, and implication of endocytosis and sorting <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>.</p>
         <p>Because of the important regulatory roles of ubiquitylation, numerous methods were developed to purify ubiquitylated proteins <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Also, the growing number of studies of large-scale identification of ubiquitylated proteins and analysis of ubiquitin-related proteome reflect the importance of identifying ubiquitylation proteins and sites <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. The three steps affinity purification, proteolytic digestion, and analysis using mass spectrometry were applied in most of these studies <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. These works cost a lot of experimental efforts. Therefore, this study focuses on the computational identification of ubiquitylation sites from protein sequences by developing an accurate prediction method.</p>
         <p>Using both informative features and an appropriate classifier is essential to design an effective system for prediction of ubiquitylation sites. In the past, numerous sequence-derived features have been proposed to discriminate protein or residue functions. For example, the AutoMotif server utilized six kinds of features and support vector machine (SVM) to predict post translational modifications <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The POPI server used physicochemical properties as efficient features to predict peptide immunogenicity <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. In this study, three kinds of useful features which can be extracted from protein sequences are evaluated: conventional amino acid identity <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B13">13</abbr></abbrgrp>, evolutionary information <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>, and physicochemical property <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B16">16</abbr></abbrgrp>. At the same time, three machine learning classifiers, <it>k</it>-nearest neighbor, Na&#239;veBayes, and SVM are also evaluated.</p>
         <p>We established an ubiquitylation dataset (UBIDATA) consisting of 157 ubiquitylation sites and 3676 putative non-ubiquitylation sites extracted from 105 proteins in UbiProt, a database of ubiquitylated proteins <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. For predicting functions of a residue in a protein, it is well recognized that nearby residues will influence the property and structure of a central residue. The environmental information will be useful to enhance prediction accuracy that is extensively used in previous studies <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. We constructed ten datasets with window sizes 11, 13,..., 29 from UBIDATA to evaluate all combinations of the evaluated features and classifiers. According to the prediction accuracies of using 10-fold cross-validation (10-CV), the physicochemical property and SVM are the best kind of features and classifier, respectively.</p>
         <p>In order to provide insights into the underlying mechanism of ubiquitylation and advance the prediction accuracy, an informative physicochemical property mining algorithm (IPMA) is proposed to further select an informative subset of 531 physicochemical properties based on an inheritable bi-objective genetic algorithm <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. This approach to identifying a problem-dependent set of informative physicochemical properties served as input features to SVM is shown to be effective in predicting both protein subnuclear localization <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> and immunogenicity of MHC class I binding peptides <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. By applying IPMA to mine informative physicochemical properties and tune SVM parameters while maximizing the 10-CV accuracy, a set of 31 informative physicochemical properties was obtained. Based on the informative physicochemical properties, a decision tree method C5.0 <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> was used to acquire if-then rule-based knowledge for biologists to further understand the mechanism of ubiquitylation.</p>
         <p>A prediction system UbiPred for predicting ubiquitylation sites was implemented by utilizing the 31 informative physicochemical properties. UbiPred performs well with a prediction accuracy of 84.44% using leave-one-out cross-validation (LOOCV), compared with the SVM-based methods using amino acid identity (65.67%), evolutionary information (66.33%) and all physicochemical properties (72.19%). Besides the prediction accuracy, the receiver operating characteristic (ROC) curve is commonly used to evaluate the discrimination ability of a classifier. The larger the area under the ROC curve, the better discrimination ability a classifier has. The area under the ROC curve of UbiPred is as high as 0.85 by using the decision value of SVM as a tuning parameter. UbiPred has been implemented as a web server and is available online <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>.</p>
         <p>Because there are still many ubiquitylation sites to be discovered <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, UbiPred can predict ubiquitylation sites accompanied with a prediction score (ranged from 0 to 1) each to help biologists in selecting the most promising sites for experimental verification. By selecting the sites with scores larger than 0.85 from an independent dataset of 3424 putative non-ubiquitylation sites, 23 promising ubiquitylation sites can be identified. The <it>in silico </it>validation by using the prediction rules obtained from C5.0 provides another confirmation in identifying the 23 promising sites as ubiquitylation sites.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Assessments of features and classifiers</p>
            </st>
            <p>The dataset UBIDATA consists of 157 ubiquitylation sites and 3676 putative non-ubiquitylation sites extracted from 105 proteins in UbiProt <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. Ten datasets with window sizes <it>w </it>= 11, 13,..., 29 were constructed from UBIDATA to assess three kinds of sequence-based features and three classifiers: IBk (<it>k</it>-nearest neighbor classifier), Na&#239;veBayes, and SVM (see the section Methods). In assessing the feature of physicochemical properties, all <it>n </it>= 531 properties available were used. Five versions of the classifier IBk with <it>k </it>= 1, 3,..., 9 were evaluated to find the best value of <it>k </it>for classification. For Na&#239;veBayes, both the normal distribution and the estimated distribution were applied to evaluate prediction performances.</p>
            <p>Figure <figr fid="F1">1</figr> shows the accuracies of 10-CV using IBk, Na&#239;veBayes, and SVM with the three kinds of features. For each kind of features, the SVM performs best compared with the other classifiers. The best performances of SVM using the features, amino acid identity (<it>w </it>= 13), evolutionary information (<it>w </it>= 13), and physicochemical property (<it>w </it>= 17), are 68.00%, 66.67%, and 72.85%, where the corresponding values of SVM parameters (<it>C</it>, &#947;) are (2, 2<sup>-2</sup>), (1, 2<sup>-7</sup>) and (1, 2<sup>-4</sup>), respectively. The results reveal that the physicochemical property is the best kind of features to the SVM for predicting ubiquitylation sites, compared with amino acid identity and evolutionary information.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Performance comparisons among various classifiers with the three kinds of features</p>
               </caption>
               <text>
                  <p>Performance comparisons among various classifiers with the three kinds of features. (a) physicochemical property, (b) amino acid identity, and (c) evolutionary information.</p>
               </text>
               <graphic file="1471-2105-9-310-1"/>
            </fig>
            <p>Figure <figr fid="F2">2</figr> shows the sequence logo of the 151 positive samples with <it>w </it>= 21 generated by the WebLogo tool <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. The sequence logo with low information content reveals disadvantages of the SVM using the two position-based features, amino acid identity and evolutionary information, compared with the non-position based features, physicochemical properties using averaged measurement of amino acids in a sequence.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>The sequence logo of the 151 positive samples with <it>w </it>= 21</p>
               </caption>
               <text>
                  <p>The sequence logo of the 151 positive samples with <it>w </it>= 21. (a) information content and (b) frequency plot.</p>
               </text>
               <graphic file="1471-2105-9-310-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Informative physicochemical properties</p>
            </st>
            <p>Most of the 531 physicochemical properties may be irrelevant features or even interfere with prediction of the SVM classifier. Therefore, it is important to mine informative physicochemical properties for advancing the prediction accuracy. IPMA determines a feature set of <it>r </it>informative physicochemical properties and the values of SVM parameters (<it>C </it>and &#947;) for a given window size <it>w</it>. Because of the non-deterministic nature of IPMA, the obtained solutions would be different for each run. To obtain the features with robust performance, 30 independent runs of IPMA were performed for each window size <it>w</it>.</p>
            <p>The highest, mean, and lowest prediction accuracies of IPMA using 10-CV are shown in Fig. <figr fid="F3">3</figr>. For comparison, the decision tree method C5.0 <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> with the ability of feature selection based on information gain was also evaluated. The accuracies of C5.0 and SVM with the properties selected by C5.0 for various window sizes are also given in Fig. <figr fid="F3">3</figr>. For all window sizes, the accuracies of SVM using informative physicochemical properties mined by IPMA are better than those of C5.0, SVM using all 531 physicochemical properties, and SVM using the C5.0-selected properties. Considering the mean accuracies of SVM with informative physicochemical properties in Fig. <figr fid="F3">3</figr>, the best window size is <it>w </it>= 21.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Performance comparisons between the SVM with informative physicochemical properties (SVM+IPCP) and other compared classifiers</p>
               </caption>
               <text>
                  <p>Performance comparisons between the SVM with informative physicochemical properties (SVM+IPCP) and other compared classifiers.</p>
               </text>
               <graphic file="1471-2105-9-310-3"/>
            </fig>
            <p>Figure <figr fid="F4">4</figr> shows the best 10-CV accuracies of using IPMA with <it>w </it>= 21 for various numbers of features from 30 independent runs. The accuracy of <it>w </it>= 21 can be improved from 69.87% to 85.43% by using <it>m </it>= 31 out of <it>n </it>= 531 physicochemical properties, where the values of SVM parameters are <it>C </it>= 4 and &#947; = 0.5. The 31 informative physicochemical properties constitute a good feature set obtained by considering the inter-correlation among properties.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>The best 10-CV accuracies of prediction using SVM with the window size 21 for various numbers of features (properties) selected by IPMA from 30 independent runs</p>
               </caption>
               <text>
                  <p>The best 10-CV accuracies of prediction using SVM with the window size 21 for various numbers of features (properties) selected by IPMA from 30 independent runs.</p>
               </text>
               <graphic file="1471-2105-9-310-4"/>
            </fig>
            <p>The quantified effectiveness of individual physicochemical properties on prediction is useful to characterize the ubiquitylation mechanism by physicochemical properties. Orthogonal experimental design with factor analysis <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp> can be used to estimate the individual effects of physicochemical properties according to the value of main effect difference (MED) <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B16">16</abbr></abbrgrp>. The property with the largest value of MED is the most effective in predicting ubiquitylation sites.</p>
            <p>According to MED, the 31 informative properties are ranked and their descriptions are shown in Table <tblr tid="T1">1</tblr>. The most effective property with <it>MED </it>= 31.79 is NADH010102 denoting "hydropathy scale based on self-information values in the two-state model of 9% accessibility". The least effective properties with <it>MED </it>= 1.32 are NAKH900101 and QIAN880129 denoting "amino acid composition of total protein" and "weights for coil at the window position of -4", respectively. The ranked informative physicochemical properties provide valuable information to biologists for further experimental verification.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The 31 informative physicochemical properties mined by IPMA.</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>AAindex identity</p>
                     </c>
                     <c ca="left">
                        <p>Description</p>
                     </c>
                     <c ca="left">
                        <p>MED</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NADH010102</p>
                     </c>
                     <c ca="left">
                        <p>Hydropathy scale based on self-information values in the two-state model of 9% accessibility</p>
                     </c>
                     <c ca="left">
                        <p>31.79</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>BROC820102</p>
                     </c>
                     <c ca="left">
                        <p>Retention coefficient in HFBA</p>
                     </c>
                     <c ca="left">
                        <p>29.80</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>MEIH800102</p>
                     </c>
                     <c ca="left">
                        <p>Average reduced distance for side chain</p>
                     </c>
                     <c ca="left">
                        <p>28.48</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LEVM780101</p>
                     </c>
                     <c ca="left">
                        <p>Normalized frequency of alpha-helix, with weights</p>
                     </c>
                     <c ca="left">
                        <p>25.17</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GUYH850104</p>
                     </c>
                     <c ca="left">
                        <p>Apparent partition energies calculated from Janin index</p>
                     </c>
                     <c ca="left">
                        <p>23.84</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CORJ870101</p>
                     </c>
                     <c ca="left">
                        <p>NNEIG index</p>
                     </c>
                     <c ca="left">
                        <p>23.18</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>RACS770102</p>
                     </c>
                     <c ca="left">
                        <p>Average reduced distance for side chain</p>
                     </c>
                     <c ca="left">
                        <p>22.52</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GEOR030108</p>
                     </c>
                     <c ca="left">
                        <p>Linker propensity from helical (annotated by DSSP) dataset</p>
                     </c>
                     <c ca="left">
                        <p>22.52</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>HARY940101</p>
                     </c>
                     <c ca="left">
                        <p>Mean volumes of residues buried in protein interiors</p>
                     </c>
                     <c ca="left">
                        <p>21.85</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GRAR740102</p>
                     </c>
                     <c ca="left">
                        <p>Polarity</p>
                     </c>
                     <c ca="left">
                        <p>19.87</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GUYH850105</p>
                     </c>
                     <c ca="left">
                        <p>Apparent partition energies calculated from Chothia index</p>
                     </c>
                     <c ca="left">
                        <p>19.87</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>MEIH800103</p>
                     </c>
                     <c ca="left">
                        <p>Average side chain orientation angle</p>
                     </c>
                     <c ca="left">
                        <p>17.88</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>KRIW790102</p>
                     </c>
                     <c ca="left">
                        <p>Fraction of site occupied by water</p>
                     </c>
                     <c ca="left">
                        <p>17.88</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LEVM780106</p>
                     </c>
                     <c ca="left">
                        <p>Normalized frequency of reverse turn, unweighted</p>
                     </c>
                     <c ca="left">
                        <p>14.57</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>BULH740102</p>
                     </c>
                     <c ca="left">
                        <p>Apparent partial specific volume</p>
                     </c>
                     <c ca="left">
                        <p>13.25</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>FAUJ880101</p>
                     </c>
                     <c ca="left">
                        <p>Graph shape index</p>
                     </c>
                     <c ca="left">
                        <p>11.92</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PUNT030102</p>
                     </c>
                     <c ca="left">
                        <p>Knowledge-based membrane-propensity scale from 3D_Helix in MPtopo databases</p>
                     </c>
                     <c ca="left">
                        <p>10.60</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>HUTJ700103</p>
                     </c>
                     <c ca="left">
                        <p>Entropy of formation</p>
                     </c>
                     <c ca="left">
                        <p>9.93</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>EISD840101</p>
                     </c>
                     <c ca="left">
                        <p>Consensus normalized hydrophobicity scale</p>
                     </c>
                     <c ca="left">
                        <p>8.61</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CEDJ970105</p>
                     </c>
                     <c ca="left">
                        <p>Composition of amino acids in nuclear proteins (percent)</p>
                     </c>
                     <c ca="left">
                        <p>7.28</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ZIMJ680102</p>
                     </c>
                     <c ca="left">
                        <p>Bulkiness</p>
                     </c>
                     <c ca="left">
                        <p>7.28</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CEDJ970103</p>
                     </c>
                     <c ca="left">
                        <p>Composition of amino acids in membrane proteins (percent)</p>
                     </c>
                     <c ca="left">
                        <p>5.96</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CHOC760103</p>
                     </c>
                     <c ca="left">
                        <p>Proportion of residues 95% buried</p>
                     </c>
                     <c ca="left">
                        <p>5.30</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CEDJ970102</p>
                     </c>
                     <c ca="left">
                        <p>Composition of amino acids in anchored proteins (percent)</p>
                     </c>
                     <c ca="left">
                        <p>5.30</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ROSM880102</p>
                     </c>
                     <c ca="left">
                        <p>Side chain hydropathy, corrected for solvation</p>
                     </c>
                     <c ca="left">
                        <p>4.64</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>BROC820101</p>
                     </c>
                     <c ca="left">
                        <p>Retention coefficient in TFA</p>
                     </c>
                     <c ca="left">
                        <p>4.64</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>FAUJ830101</p>
                     </c>
                     <c ca="left">
                        <p>Hydrophobic parameter pi</p>
                     </c>
                     <c ca="left">
                        <p>1.99</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NAKH920101</p>
                     </c>
                     <c ca="left">
                        <p>AA composition of CYT of single-spanning proteins</p>
                     </c>
                     <c ca="left">
                        <p>1.99</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ZHOH040102</p>
                     </c>
                     <c ca="left">
                        <p>The relative stability scale extracted from mutation experiments</p>
                     </c>
                     <c ca="left">
                        <p>1.99</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NAKH900101</p>
                     </c>
                     <c ca="left">
                        <p>AA composition of total proteins</p>
                     </c>
                     <c ca="left">
                        <p>1.32</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>QIAN880129</p>
                     </c>
                     <c ca="left">
                        <p>Weights for coil at the window position of -4</p>
                     </c>
                     <c ca="left">
                        <p>1.32</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Knowledge of data mining</p>
            </st>
            <p>Although the prediction accuracy of SVM is rather high compared with the other classifiers evaluated, it is not easy for biologist to interpret the prediction rules. In order to acquire interpretable knowledge from experimental data, C5.0 was applied to construct a compact decision tree by using the 31 informative physicochemical properties selected by IPMA on the whole training dataset. Figure <figr fid="F5">5</figr> shows a constructed decision tree by C5.0. By utilizing this decision tree to classify the whole training dataset, the accuracy is 72.5%. This decision tree can be directly converted into a set of eight interpretable rules <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, consisting of three and five if-then rules for ubiquitylation sites and non-ubiquitylation sites, respectively.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>The derived decision tree by using C5.0 and the features of informative physicochemical properties for classification of ubiquitylation sites</p>
               </caption>
               <text>
                  <p>The derived decision tree by using C5.0 and the features of informative physicochemical properties for classification of ubiquitylation sites.</p>
               </text>
               <graphic file="1471-2105-9-310-5"/>
            </fig>
            <p>To obtain rather simple rules for easy interpretation, five concise if-then rules obtained from C5.0 are shown in Table <tblr tid="T2">2</tblr>. The first rule with the highest confidence value 0.96 can be interpreted as 'given a peptide with a central residue lysine (<it>w </it>= 21), if the average reduced distance for side chain <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> (property MEIH800102) is less than or equal to 0.95381, then the residue is a non-ubiquitylation site with a confidence value 0.96'. This rule covers 23 sites in the training dataset and no site is misclassified by this rule.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Five concise if-then rules with confidence larger than 0.5 obtained by using C5.0 and 31 informative physicochemical properties.</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="left">
                        <p>#</p>
                     </c>
                     <c ca="center">
                        <p>Rule</p>
                     </c>
                     <c ca="center">
                        <p>Confidence</p>
                     </c>
                     <c ca="center">
                        <p>Ubiquitylation sites</p>
                     </c>
                     <c ca="center">
                        <p>Covered samples</p>
                     </c>
                     <c ca="center">
                        <p>Misclassified samples</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>MEIH800102 &lt; = 0.95381</p>
                     </c>
                     <c ca="center">
                        <p>0.96</p>
                     </c>
                     <c ca="center">
                        <p>N</p>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p>HARY940101 > 135.2 AND CORJ870101 > 49.70762</p>
                     </c>
                     <c ca="center">
                        <p>0.90</p>
                     </c>
                     <c ca="center">
                        <p>N</p>
                     </c>
                     <c ca="center">
                        <p>49</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p>CEDJ970105 > 6.805556</p>
                     </c>
                     <c ca="center">
                        <p>0.85</p>
                     </c>
                     <c ca="center">
                        <p>N</p>
                     </c>
                     <c ca="center">
                        <p>18</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>4</p>
                     </c>
                     <c ca="left">
                        <p>GEOR030108 &lt; = 0.931333</p>
                     </c>
                     <c ca="center">
                        <p>0.75</p>
                     </c>
                     <c ca="center">
                        <p>N</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>5</p>
                     </c>
                     <c ca="left">
                        <p>MEIH800102 > 0.95381</p>
                     </c>
                     <c ca="center">
                        <p>0.54</p>
                     </c>
                     <c ca="center">
                        <p>Y</p>
                     </c>
                     <c ca="center">
                        <p>279</p>
                     </c>
                     <c ca="center">
                        <p>128</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>There is only one of five classification rules for identifying ubiquitylation sites with a moderate confidence value 0.54. This rule means that if the average reduced distance for side chain is larger than 0.95381, then the residue is an ubiquitylation site with a confidence value 0.54. This rule reveals that the ubiquitylation sites are not easily discriminated from non-ubiquitylation sites. Furthermore, the property MEIH800102 plays an important role in predicting ubiquitylation sites. Examining the MED value (28.48) of MEIH800102 in Table <tblr tid="T1">1</tblr>, it is rather consistent that MEIH800102 is an informative property with a rank 3.</p>
            <p>The second rule means that if the mean volume of residues buried in protein interiors <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> (property HARY940101) is larger than 135.2 and the NNEIG index <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> (property CORJ870101) is larger than 49.70762, then the residue is a non-ubiquitylation site with a confidence value 0.90'. This rule covers 49 samples in the training dataset and 4 of them are misclassified by this rule.</p>
            <p>The third rule indicates that if the composition of amino acids in nuclear proteins (percent) <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> is larger than 6.805556, then the residue is a non-ubiquitylation site with a confidence value 0.85'. This rule covers 18 samples in the training dataset and 2 of them are misclassified.</p>
            <p>The fourth rule indicates that if the linker propensity from helical (annotated by DSSP) dataset <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> is less than or equal to 0.931333, then the residue is a non-ubiquitylation site with a confidence value 0.75'. This rule covers 10 samples in the training dataset and 2 of them are misclassified.</p>
         </sec>
         <sec>
            <st>
               <p>Prediction system UbiPred</p>
            </st>
            <p>The 31 informative physicochemical properties (shown in Table <tblr tid="T1">1</tblr>) with <it>w </it>= 21, <it>C </it>= 4, and &#947; = 0.5 were used to implement a prediction system UbiPred for identifying ubiquitylation sites. The system flow of the prediction server UbiPred is shown in Fig. <figr fid="F6">6</figr>. The input to UbiPred is a protein sequence. UbiPred will automatically encode the peptide with a central residue lysine of size <it>w </it>= 21 using the 31 informative physicochemical properties. Subsequently, the lysine residues will be annotated in terms of both ubiquitylation and a prediction score.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>The system flow of the prediction server UbiPred</p>
               </caption>
               <text>
                  <p>The system flow of the prediction server UbiPred.</p>
               </text>
               <graphic file="1471-2105-9-310-6"/>
            </fig>
            <p>For comparisons with UbiPred, the same LOOCV performances of SVM using the three kinds of features: all physicochemical properties, amino acid identity, and evolutionary information are also evaluated using their corresponding best parameter settings obtained from the previous learning results, shown in Table <tblr tid="T3">3</tblr>.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>The LOOCV performances of the SVM with various kinds of features: </p>
               </caption>
               <tblbdy cols="10">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Feature</p>
                     </c>
                     <c ca="center">
                        <p>Window size <it>w</it></p>
                     </c>
                     <c ca="center">
                        <p>C</p>
                     </c>
                     <c ca="right">
                        <p>
                           <it>&#947;</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>ACC (%)</p>
                     </c>
                     <c ca="center">
                        <p>SEN (%)</p>
                     </c>
                     <c ca="center">
                        <p>SPE (%)</p>
                     </c>
                     <c ca="center">
                        <p>MCC</p>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>Informative physicochemical properties (UbiPred)</p>
                     </c>
                     <c ca="center">
                        <p>21</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>2<sup>-1</sup></p>
                     </c>
                     <c ca="center">
                        <p>84.44</p>
                     </c>
                     <c ca="center">
                        <p>83.44</p>
                     </c>
                     <c ca="center">
                        <p>85.43</p>
                     </c>
                     <c ca="center">
                        <p>0.69</p>
                     </c>
                     <c ca="center">
                        <p>0.85</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p>All physicochemical properties</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2<sup>-4</sup></p>
                     </c>
                     <c ca="center">
                        <p>72.19</p>
                     </c>
                     <c ca="center">
                        <p>70.86</p>
                     </c>
                     <c ca="center">
                        <p>73.51</p>
                     </c>
                     <c ca="center">
                        <p>0.44</p>
                     </c>
                     <c ca="center">
                        <p>0.74</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p>Amino acid identity</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>2<sup>-2</sup></p>
                     </c>
                     <c ca="center">
                        <p>65.67</p>
                     </c>
                     <c ca="center">
                        <p>57.33</p>
                     </c>
                     <c ca="center">
                        <p>74.00</p>
                     </c>
                     <c ca="center">
                        <p>0.32</p>
                     </c>
                     <c ca="center">
                        <p>0.70</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>4</p>
                     </c>
                     <c ca="left">
                        <p>Evolutionary information</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2<sup>-7</sup></p>
                     </c>
                     <c ca="center">
                        <p>66.33</p>
                     </c>
                     <c ca="center">
                        <p>72.00</p>
                     </c>
                     <c ca="center">
                        <p>60.67</p>
                     </c>
                     <c ca="center">
                        <p>0.33</p>
                     </c>
                     <c ca="center">
                        <p>0.71</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>informative physicochemical properties (UbiPred), amino acid identity, evolutionary information, and all physicochemical properties.</p>
               </tblfn>
            </tbl>
            <p>Four measurements were used for evaluation of prediction performances including sensitivity (SEN), specificity (SPE), accuracy (ACC), and Matthew's correlation coefficient (MCC), defined as follows: SEN = TP/(TP + FN), SPE = TN/(TN + FP), ACC = (TP + TN)/(TP + FP + TN + FN), and MCC = ((TP &#215; TN)-(FN &#215; FP))/((TP + FN)(TN + FP)(TP + FP)(TN + FN)), where TP, TN, FP and FN are the numbers of true positive, true negative, false positive and false negative, respectively.</p>
            <p>UbiPred performs well with a prediction accuracy of 84.44%, compared with the SVMs with physicochemical property (72.19%), amino acid identity (65.67%) and evolutionary information (66.33%). The SEN, SPE and MCC performances of UbiPred are 83.44%, 85.43% and 0.69, respectively. To compare UbiPred with other methods in terms of robustness abilities, the nonparametric method of using a ROC curve is utilized by using the decision value of SVM as a tuning parameter. The area under the ROC curve (AUC) is calculated, as shown in Fig. <figr fid="F7">7</figr>. UbiPred with AUC = 0.85 performs well, compared with the SVM-based methods using all physicochemical properties (0.74), amino acid identity (0.70) and evolutionary information (0.71).</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Performance comparison of SVM with various features, informative physicochemical properties (UbiPred), amino acid identity, evolutionary information, and all physicochemical properties, in terms of receiver operating characteristic curves</p>
               </caption>
               <text>
                  <p>Performance comparison of SVM with various features, informative physicochemical properties (UbiPred), amino acid identity, evolutionary information, and all physicochemical properties, in terms of receiver operating characteristic curves.</p>
               </text>
               <graphic file="1471-2105-9-310-7"/>
            </fig>
            <p>The problem of sequence redundancy may result in overestimation of prediction performance. To address this issue, six thresholds of sequence identity (90%, 80%,..., 40%) were applied to construct six additional datasets from the dataset of <it>w </it>= 21 by using CD-HIT <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. The numbers of positive and negative samples of datasets with various sequence identity thresholds are shown in Table <tblr tid="T4">4</tblr>. By using the strictest threshold 40%, there are only 36 redundant samples and the resulting dataset consists of 145 negative samples and 121 positive samples. By applying LOOCV to evaluate prediction accuracies on these datasets, good performance (> 79%) was obtained by using SVM with the mined 31 informative physicochemical properties and SVM parameters (shown in Table <tblr tid="T4">4</tblr>). The results show the effectiveness of the proposed UbiPred.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>The LOOCV performances of the SVM with 31 informative physicochemical properties on datasets of various sequence identity thresholds.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="center">
                        <p>Sequence identity threshold</p>
                     </c>
                     <c ca="center">
                        <p>Accuracy(%)</p>
                     </c>
                     <c ca="center">
                        <p>Number of positive samples</p>
                     </c>
                     <c ca="center">
                        <p>Number of negative samples</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>100%</p>
                     </c>
                     <c ca="center">
                        <p>84.44</p>
                     </c>
                     <c ca="center">
                        <p>151</p>
                     </c>
                     <c ca="center">
                        <p>151</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>90%</p>
                     </c>
                     <c ca="center">
                        <p>82.71</p>
                     </c>
                     <c ca="center">
                        <p>145</p>
                     </c>
                     <c ca="center">
                        <p>150</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>80%</p>
                     </c>
                     <c ca="center">
                        <p>81.72</p>
                     </c>
                     <c ca="center">
                        <p>141</p>
                     </c>
                     <c ca="center">
                        <p>149</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>70%</p>
                     </c>
                     <c ca="center">
                        <p>80.63</p>
                     </c>
                     <c ca="center">
                        <p>136</p>
                     </c>
                     <c ca="center">
                        <p>148</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>60%</p>
                     </c>
                     <c ca="center">
                        <p>81.23</p>
                     </c>
                     <c ca="center">
                        <p>131</p>
                     </c>
                     <c ca="center">
                        <p>146</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>50%</p>
                     </c>
                     <c ca="center">
                        <p>80.80</p>
                     </c>
                     <c ca="center">
                        <p>130</p>
                     </c>
                     <c ca="center">
                        <p>146</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>40%</p>
                     </c>
                     <c ca="center">
                        <p>79.70</p>
                     </c>
                     <c ca="center">
                        <p>121</p>
                     </c>
                     <c ca="center">
                        <p>145</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Screening promising ubiquitylation sites</p>
            </st>
            <p>Recently, a new experimental method was proposed with 2.4-fold increase in the number of identified ubiquitylation sites, compared with previous methods <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. It implies that there may be still many undiscovered ubiquitylation sites. To identify promising ubiquitylation sites from putative non-ubiquitylation sites, a scoring method is designed by normalizing the range of the decision values of SVM obtained from the training dataset of <it>w </it>= 21 into the range [0, 1] of prediction scores. Normally, the default threshold value 0 used by the SVM classifier for discriminating ubiquitylation sites from non-ubiquitylation sites is mapped to a prediction score 0.5. The site with a prediction score close to 1 has a high possibility to be an ubiquitylation site. If the high prediction score 0.85 instead of 0.5 was adopted when classifying the peptides in the training dataset for all window sizes, there would be no false positive.</p>
            <p>The prediction system UbiPred is applied to score 3424 putative non-ubiquitylation sites in an independent dataset that are not included in the training dataset of <it>w </it>= 21, as shown in Fig. <figr fid="F8">8</figr>. The screening result is shown in Fig. <figr fid="F9">9</figr> using a histogram of prediction scores. There are 1218 putative non-ubiquitylation sites with scores larger than 0.5. There are 23 peptides with scores larger than 0.85, which are the most promising ubiquitylation sites, listed in Table <tblr tid="T5">5</tblr>. The detailed information can be found in the website of UbiPred <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. The sequence logo of the 23 peptides shown in Fig. <figr fid="F10">10</figr> represents low information content similar to the sequence logo of the 151 positive samples in training dataset.</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>The schema for illustrating the training data (302 samples) and the independent dataset (3424 putative non-ubiquitylation sites) using <it>w </it>= 21 as an example</p>
               </caption>
               <text>
                  <p>The schema for illustrating the training data (302 samples) and the independent dataset (3424 putative non-ubiquitylation sites) using <it>w </it>= 21 as an example.</p>
               </text>
               <graphic file="1471-2105-9-310-8"/>
            </fig>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>Histogram result of UbiPred using prediction scores from evaluating 3424 putative non-ubiquitylation sites in an independent dataset</p>
               </caption>
               <text>
                  <p><b>Histogram result of UbiPred using prediction scores from evaluating 3424 putative non-ubiquitylation sites in an independent dataset</b>. The site with a score close to 1 has a high possibility to be an ubiquitylation site.</p>
               </text>
               <graphic file="1471-2105-9-310-9"/>
            </fig>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>List of 23 promising ubiquitylation sites identified from an independent dataset of 3424 putative non-ubiquitylation sites.</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c ca="left">
                        <p>Accession number</p>
                     </c>
                     <c ca="left">
                        <p>Position</p>
                     </c>
                     <c ca="left">
                        <p>Score</p>
                     </c>
                     <c ca="left">
                        <p>Accession number</p>
                     </c>
                     <c ca="left">
                        <p>Position</p>
                     </c>
                     <c ca="left">
                        <p>Score</p>
                     </c>
                     <c ca="left">
                        <p>Accession number</p>
                     </c>
                     <c ca="left">
                        <p>Position</p>
                     </c>
                     <c ca="left">
                        <p>Score</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>P19358</p>
                     </c>
                     <c ca="left">
                        <p>114</p>
                     </c>
                     <c ca="left">
                        <p>0.99</p>
                     </c>
                     <c ca="left">
                        <p>P39976</p>
                     </c>
                     <c ca="left">
                        <p>323</p>
                     </c>
                     <c ca="left">
                        <p>0.90</p>
                     </c>
                     <c ca="left">
                        <p>P38080</p>
                     </c>
                     <c ca="left">
                        <p>809</p>
                     </c>
                     <c ca="left">
                        <p>0.87</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Q9Y6K9</p>
                     </c>
                     <c ca="left">
                        <p>35</p>
                     </c>
                     <c ca="left">
                        <p>0.96</p>
                     </c>
                     <c ca="left">
                        <p>P38261</p>
                     </c>
                     <c ca="left">
                        <p>147</p>
                     </c>
                     <c ca="left">
                        <p>0.89</p>
                     </c>
                     <c ca="left">
                        <p>P10592</p>
                     </c>
                     <c ca="left">
                        <p>54</p>
                     </c>
                     <c ca="left">
                        <p>0.87</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>P25694</p>
                     </c>
                     <c ca="left">
                        <p>6</p>
                     </c>
                     <c ca="left">
                        <p>0.96</p>
                     </c>
                     <c ca="left">
                        <p>P25360</p>
                     </c>
                     <c ca="left">
                        <p>846</p>
                     </c>
                     <c ca="left">
                        <p>0.89</p>
                     </c>
                     <c ca="left">
                        <p>P38080</p>
                     </c>
                     <c ca="left">
                        <p>792</p>
                     </c>
                     <c ca="left">
                        <p>0.87</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>P40087</p>
                     </c>
                     <c ca="left">
                        <p>325</p>
                     </c>
                     <c ca="left">
                        <p>0.95</p>
                     </c>
                     <c ca="left">
                        <p>P09936</p>
                     </c>
                     <c ca="left">
                        <p>195</p>
                     </c>
                     <c ca="left">
                        <p>0.88</p>
                     </c>
                     <c ca="left">
                        <p>P12866</p>
                     </c>
                     <c ca="left">
                        <p>129</p>
                     </c>
                     <c ca="left">
                        <p>0.86</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Q08412</p>
                     </c>
                     <c ca="left">
                        <p>232</p>
                     </c>
                     <c ca="left">
                        <p>0.93</p>
                     </c>
                     <c ca="left">
                        <p>P10591</p>
                     </c>
                     <c ca="left">
                        <p>54</p>
                     </c>
                     <c ca="left">
                        <p>0.88</p>
                     </c>
                     <c ca="left">
                        <p>Q05911</p>
                     </c>
                     <c ca="left">
                        <p>460</p>
                     </c>
                     <c ca="left">
                        <p>0.86</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>P04629</p>
                     </c>
                     <c ca="left">
                        <p>609</p>
                     </c>
                     <c ca="left">
                        <p>0.91</p>
                     </c>
                     <c ca="left">
                        <p>Q06408</p>
                     </c>
                     <c ca="left">
                        <p>156</p>
                     </c>
                     <c ca="left">
                        <p>0.87</p>
                     </c>
                     <c ca="left">
                        <p>P40087</p>
                     </c>
                     <c ca="left">
                        <p>410</p>
                     </c>
                     <c ca="left">
                        <p>0.86</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>P16603</p>
                     </c>
                     <c ca="left">
                        <p>165</p>
                     </c>
                     <c ca="left">
                        <p>0.91</p>
                     </c>
                     <c ca="left">
                        <p>P37303</p>
                     </c>
                     <c ca="left">
                        <p>283</p>
                     </c>
                     <c ca="left">
                        <p>0.87</p>
                     </c>
                     <c ca="left">
                        <p>P38075</p>
                     </c>
                     <c ca="left">
                        <p>10</p>
                     </c>
                     <c ca="left">
                        <p>0.86</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>P31539</p>
                     </c>
                     <c ca="left">
                        <p>626</p>
                     </c>
                     <c ca="left">
                        <p>0.91</p>
                     </c>
                     <c ca="left">
                        <p>P32467</p>
                     </c>
                     <c ca="left">
                        <p>38</p>
                     </c>
                     <c ca="left">
                        <p>0.87</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>The sequence logo of the 23 peptides of promising ubiquitylation sites with <it>w </it>= 21</p>
               </caption>
               <text>
                  <p>The sequence logo of the 23 peptides of promising ubiquitylation sites with <it>w </it>= 21. (a) Information content and (b) Frequency plot.</p>
               </text>
               <graphic file="1471-2105-9-310-10"/>
            </fig>
            <p>For further validating the 23 peptides as ubiquitylation sites, the five prediction rules obtained from C5.0 (shown in Table <tblr tid="T2">2</tblr>) were applied to the 23 peptides. Results show that all the 23 promising peptides are classified as ubiquitylation sites. For example, the average value of property MEIH800102 for the 23 peptides is 1.001 which is larger than the threshold of 0.95. This value is close to that (1.007) of the 151 positive samples in training dataset. Note that the smallest and largest index values of MEIH800102 for 20 amino acids are 0.73 and 1.23, respectively. The prediction system UbiPred can predict ubiquitylation sites with prediction scores to identify the most promising ubiquitylation sites for experimental verification or future research.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Ubiquitylation plays many important regulatory roles in the physiology of eukaryotic cell. Nowadays, many experimental studies are working on identifying ubiquitylated proteins and their ubiquitylation sites. To efficiently identify promising ubiquitylation sites by computational prediction methods is helpful to save experimental efforts. In this study, the combinations of three kinds of features (amino acid identity, evolutionary information, and all physicochemical properties) and three classifiers (support vector machine, <it>k</it>-nearest neighbor, and Na&#239;veBayes) were evaluated for predicting ubiquitylation sites. The ubiquitylation dataset consists of 157 ubiquitylation sites and 3676 putative non-ubiquitylation sites extracted from 105 proteins in the UbiProt database. Results show that the best prediction method is the combination of using an SVM classifier and all physicochemical properties.</p>
         <p>It is well recognized that irrelevant information will interfere with classifiers. This study proposes an algorithm IPMA to identify a small set of informative physicochemical properties to advance the prediction performance and further understand the underlying mechanism of ubiquitylation. The derived 31 informative physicochemical properties improve the prediction accuracy from 72.19% to 84.44%, and the properties were ranked in terms of their individual effectiveness of prediction. A decision tree method C5.0 was also applied to derive the rule-based knowledge and analyze the 31 informative physicochemical properties. Five concise rules provide a human-interpretable way to biologist for distinguishing ubiquitylation sites from non-ubiquitylation sites.</p>
         <p>Finally, the system UbiPred for predicting ubiquitylation sites is designed by using the 31 informative physicochemical properties. The web server of UbiPred has been implemented and is available online <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. The prediction scores of UbiPred can be utilized to identify promising ubiquitylation sites for experimental verification. In this study, 23 promising ubiquitylation sites whose prediction scores are larger than 0.85 were identified from an independent dataset of 3424 putative non-ubiquitylation sites and were also validated by the five concise rules obtained from the training dataset.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Establishment of datasets</p>
            </st>
            <p>To evaluate the two proposed methods IPMA and UbiPred, a positive dataset UBIDATA consisting of 157 ubiquitylation sites from 105 proteins was established by extracting annotated proteins from the UbiProt database <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. By mapping the ubiquitylation sites to the corresponding 105 protein sequences retrieved from the UniProt Knowledgebase (Swiss-Prot and TrEMBL) <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, the 3676 lysine residues with no annotation of ubiquitylation sites were regarded as putative non-ubiquitylation sites. A sliding window method is applied to the central residue to be predicted for gleaning environment information. A positive sample is denoted as a sequence of size <it>w </it>with a central residue lysine which is an ubiquitylation site. If the central residue lysine is not an ubiquitylation site, the sequence is regarded as a negative sample. Only one of the samples with the same sequences and annotation of ubiquitylation sites was used. All the inconsistent samples which have the same sequences but not the same annotation were discarded. The 10 positive datasets were constructed using various values of <it>w </it>from UBIDATA, which have 149 samples of <it>w </it>= 11, 150 samples of <it>w </it>= 13 and 15, and 151 samples of <it>w </it>= 17, 19,..., 29. Due to the discard of duplicate and inconsistent samples, different values of <it>w </it>would result in different sample numbers of datasets.</p>
            <p>For training an SVM classifier, both positive and negative samples are necessary. The dataset of post-translational modification including phosphorylation and ubiquitylation sites is unbalanced that the number of positive samples is much smaller than that of negative samples. The negative samples for training the SVM classifier were selected randomly from the 3676 putative non-ubiquitylation sites. In this study, the number of negative samples is the same with that of positive samples in the dataset. For example, there are 151 negative samples in the dataset of <it>w </it>= 21. The rest (e.g., 3424 samples with no annotation of ubiquitylation sites for <it>w </it>= 21) are formed as an independent dataset to be scored for identifying promising ubiquitylation sites (see Fig. <figr fid="F8">8</figr>). Notably, since the value of <it>C </it>for tuning the error penalty (see the next section) is determined subsequently according to the performance measurement of SVM, it is not obligatory to select a matched number of negative peptides for training the SVM classifier. The used datasets of various windows sizes can be publicly downloaded from the web server of UbiPred <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Assessment of features and classifiers</p>
            </st>
            <p>Support vector machine (SVM) is a very popular and powerful method to deal with classification, prediction, and regression problems. To cope with the over-fitting problem arising from a small training dataset, SVM aims to find a linear separation hyperplane which maximizes the distance between two classes to create a classifier. Given training vectors <b>x</b><sub><it>i </it></sub>&#8712; <it>R</it><sup><it>n </it></sup>and their class values <it>y</it><sub><it>i </it></sub>&#8712; {-1, 1}, <it>i </it>= 1,..., <it>N</it>, SVM solves the problem of minimizing <inline-formula><m:math name="1471-2105-9-310-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mn>1</m:mn><m:mn>2</m:mn></m:mfrac><m:msup><m:mstyle mathvariant="bold" mathsize="normal"><m:mi>w</m:mi></m:mstyle><m:mtext>T</m:mtext></m:msup><m:mstyle mathvariant="bold" mathsize="normal"><m:mi>w</m:mi></m:mstyle><m:mo>+</m:mo><m:mi>C</m:mi><m:mstyle displaystyle="true"><m:munderover><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>N</m:mi></m:munderover><m:mrow><m:msub><m:mi>&#958;</m:mi><m:mi>i</m:mi></m:msub></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaGccqWH3bWDdaahaaWcbeqaaiabbsfaubaakiabhEha3jabgUcaRiabdoeadnaaqahabaGaeqOVdG3aaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaa@3EA3@</m:annotation></m:semantics></m:math></inline-formula>, subject to <it>y</it><sub><it>i </it></sub>(<b>w</b><sup>T </sup><b>x</b><sub><it>i </it></sub>+ <it>b</it>) &#8805; 1 - <it>&#958;</it><sub><it>i </it></sub>and <it>&#958;</it><sub><it>i </it></sub>&#8805; 0, where <b>w </b>is a normal vector perpendicular to the hyperplane and <it>&#958;</it><sub><it>i </it></sub>are slake variables for allowing misclassifications. The cost parameter <it>C </it>(> 0) controls the trade-off between the margin and the training error. Larger value of <it>C </it>will lead to a higher error penalty. The kernel function of SVM transforms samples to a high-dimensional space to make linear separation easier. The commonly-used radial basis kernel function is applied to non-linearly transform the feature space, defined as <it>K</it>(<it>x</it><sub><it>i</it></sub>, <it>x</it><sub><it>j</it></sub>) = exp(-<it>&#947;</it>||<it>x</it><sub><it>i </it></sub>- <it>x</it><sub><it>j</it></sub>||), where <it>&#947; </it>> 0 is the kernel parameter, deciding how the samples are transformed to a high-dimensional space. These two parameters (<it>C </it>and &#947;) must be tuned to obtain satisfactory prediction results. In this study, the used SVM package is LIBSVM of version 2.84 <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>.</p>
            <p>Two extensively used classifiers, the <it>k</it>-nearest neighbor classifier (IBk) and the Na&#239;veBayes classifier that are included in the machine learning tool WEKA <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, are also utilized to evaluate the promising prediction features. To obtain the best performance, five versions of the IBk classifier with <it>k </it>= 1, 3,..., 9 are evaluated for identifying the best value of <it>k</it>. For the Na&#239;veBayes classifier, in addition to normal distribution, a distribution obtained from a kernel density estimator is used to model numeric attributes <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <p>Informative features will lead to better performances of classifiers. Numerous features can be extracted from peptide sequences <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>. This study assesses three kinds of features including amino acid identity, evolutionary information, and physicochemical property. The feature representations used for the above-mentioned classifiers are described below.</p>
            <p>The conventional feature representation of amino acid identity uses 20 binary bits to represent an amino acid <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B13">13</abbr></abbrgrp>. For example, the amino acid A is represented by '00000000000000000001' and R is represented by '00000000000000000010'. To deal with the problem of windows spanning out of N-terminal or C-terminal, one additional bit is appended to indicate this situation. A vector of size (20+1)<it>w </it>bits is used for representing a sample.</p>
            <p>Evolutionary information has been successfully used in many studies <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. To prepare evolutionary information for each protein sequence, the corresponding position-specific scoring matrix (PSSM) is obtained by applying PSI-BLAST <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> against non-redundant SWISS-PROT database using 3 iteration and default values of parameters. The matrix has 20*<it>L </it>elements, where <it>L </it>is the length of a peptide. For each residue, there are 20 values indicating the probabilities of occurrences for 20 amino acids. By using the window size <it>w</it>, there are 20*<it>w </it>elements to represent a peptide <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. One additional bit is utilized to deal with the terminal spanning windows as used for amino acid identity <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. Therefore, a vector of size (20+1)<it>w </it>is used for representing a sample.</p>
            <p>Physicochemical property is the most intuitive feature for biochemical reactions and is extensively applied in bioinformatics studies. The amino acid indices (AAindex) database collects many published indices representing physicochemical properties of amino acids. For each physicochemical property, there is a set of 20 numerical values for amino acids. Currently, 544 physicochemical properties can be retrieved from the AAindex database of version 9.0 <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. After removing physicochemical properties having the value 'NA' in the amino acid indices, 531 physicochemical properties are obtained for the following studies. In contrast to the residue-based encoding methods of amino acid identity and evolutionary information, a vector of 531 mean values is used to represent a sample for various window sizes <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B16">16</abbr></abbrgrp>. The method of encoding the input vector from peptide sequences consists of two steps. First, a vector of 531 index values is determined for each amino acid of the peptide. For a peptide of size <it>w</it>, there are <it>w </it>531-dimensional vectors. Notably, the number of amino acids for the peptide with a terminal spanning window would be smaller than <it>w</it>. The second step is to construct a vector of 531 mean values obtained by averaging these 531-dimensional vectors <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B16">16</abbr></abbrgrp>. If <it>m </it>out of 531 informative physicochemical properties are selected by IPMA and are used in SVM, a vector of <it>m </it>mean values is used to represent a sample.</p>
            <p>To find the best features for the SVM-based method, the control parameters <it>C </it>and &#947; of SVM and associated window size <it>w </it>&#8712; {11, 13,..., 29} should be tuned for each kind of features. The grid search method is applied to tune the parameters <it>C </it>and &#947; &#8712; {2<sup>-7</sup>, 2<sup>-6</sup>,..., 2<sup>8</sup>} that a total number 256 (= 16*16) of grids are evaluated. The prediction accuracy of 10-CV is used to determine the best features and classifier.</p>
         </sec>
         <sec>
            <st>
               <p>Informative physicochemical property mining algorithm</p>
            </st>
            <p>An informative physicochemical property mining algorithm (IPMA) is proposed to select a small set of <it>m </it>informative physicochemical properties form a large set of <it>n </it>= 531 physicochemical properties and determine the values of <it>C </it>and &#947; of the used SVM simultaneously. The IPMA is based on an inheritable bi-objective genetic algorithm (GA) <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> which is an efficient method for solving the bi-objective 0/1 combinatorial optimization problem C(<it>n</it>, <it>m</it>). In using the IPMA, minimizing the number <it>m </it>of properties (features) and maximizing the prediction accuracy are the two objectives to be achieved. High performance of the inheritable bi-objective GA arises mainly from an intelligent evolutionary algorithm <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> which can efficiently solve large-scale parameter optimization problems by using a divide-and-conquer strategy and orthogonal array crossover with a systematic reasoning method instead of traditional generate-and-go in the crossover operation.</p>
            <p>The encoded GA-chromosome X consists of <it>n </it>= 531 bits for selecting physicochemical properties (1 for inclusion and 0 for exclusion) and two 4-bit GA-genes for tuning parameters <it>C </it>and &#947; of SVM. The two 4-bit GA-genes map the 16 values of <it>C </it>and &#947; into {2<sup>-7</sup>, 2<sup>-6</sup>,..., 2<sup>8</sup>}. IPMA can simultaneously obtain a set of solutions X<sub>r </sub>to C(<it>n</it>, <it>r</it>) where <it>r </it>= <it>r</it><sub>start</sub>, <it>r</it><sub>start </sub>+1,..., <it>r</it><sub>end </sub>in a single run. The best among all X<sub>r </sub>according to the fitness function <it>f</it>(X) is the desirable solution X<sub><it>m </it></sub>where <it>f</it>(X) is the overall accuracy of 10-CV. By decoding X<sub><it>m</it></sub>, <it>m </it>informative physicochemical properties and the SVM classifier can be obtained at the same time.</p>
            <p>The algorithm IPMA with the given values of <it>r</it><sub>start </sub>and <it>r</it><sub>end </sub>is described below. In this study, the used parameters of IPMA are <it>N</it><sub>pop </sub>= 50, <it>P</it><sub>c </sub>= 0.8, <it>P</it><sub>m </sub>= 0.05, <it>r</it><sub>start </sub>= 5, and <it>r</it><sub>end </sub>= 45 according to experience.</p>
            <p>Step 1) (Initiation) Randomly generate an initial population of <it>N</it><sub>pop </sub>individuals. All the <it>n </it>binary genes have <it>r </it>1's and <it>n-r </it>0's where <it>r </it>= <it>r</it><sub>start</sub>.</p>
            <p>Step 2) (Evaluation) Evaluate the fitness values of <it>f</it>(X) for all individuals.</p>
            <p>Step 3) (Selection) Use the traditional tournament selection that selects the winner from two randomly selected individuals to form a mating pool.</p>
            <p>Step 4) (Crossover) Select <it>P</it><sub>c</sub>&#183;<it>N</it><sub>pop </sub>parents from the mating pool to perform orthogonal array crossover <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> on the selected pairs of parents where <it>P</it><sub>c </sub>is the crossover probability.</p>
            <p>Step 5) (Mutation) Apply a bit-inverse mutation operator with a mutation probability <it>P</it><sub>m </sub>to the population by keeping the <it>n </it>binary parameters in an individual having <it>r </it>1's. To prevent the best fitness value from deteriorating, mutation is not applied to the best individual in the population (<it>I</it><sub>best</sub>).</p>
            <p>Step 6) (Termination test) If <it>I</it><sub>best </sub>is not improved in 10 generations continuously, output <it>I</it><sub>best </sub>as X<sub>r</sub>. Otherwise, go to Step 2).</p>
            <p>Step 7) (Inheritance) If <it>r </it>&lt;<it>r</it><sub>end</sub>, randomly change one bit in the binary genes for each individual from 0 to 1; increase the number <it>r </it>by one, and go to Step 2). Otherwise, stop the algorithm.</p>
         </sec>
         <sec>
            <st>
               <p>Rule-based knowledge acquirement</p>
            </st>
            <p>Decision tree methods are useful algorithms to acquire interpretable rule-based knowledge as well as classification of ubiquitylation sites. In this study, the decision tree method C5.0, an improved version of C4.5 <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, with rather high prediction accuracy, is applied to construct decision tree classifiers and derive interpretable rules. For C5.0, the information gain is utilized to rank features for constructing a decision tree by iteratively appending nodes with high ranks. The decision tree method can serve as a tool of feature selection by using the ranks of features. However, the set of selected features is constructed by considering individual effects of classification only but no correlation among relevant features.</p>
            <p>To avoid over-fitting problems, a pruning process is applied to reduce the tree size by replacing a subtree with a leaf node. The used threshold value of confidence for pruning trees is set to 25%. The final decision tree can directly generate if-then rules where one leaf node corresponds to one rule. The samples in the leaf node are the covered samples of this rule. The majority rule determines the class label. The samples with a relative small size in the leaf node are regarded as misclassified samples. To derive more simple rule-based knowledge, the option '-r' of C5.0 is applied to generate rules of small length for intuitive interpretation.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>CWT designed the system, implemented programs, developed the web server, carried out the analysis, and participated in manuscript preparation. SYH supervised the whole project and participated in manuscript preparation. All authors have read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>The authors would like to thank the National Science Council of Taiwan for financially supporting this research under the contract numbers NSC 96-2628-E-009-141-MY3 and NSC 96-2627-B-009-002.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Ubiquitin and ubiquitin-like proteins in protein regulation</p>
            </title>
            <aug>
               <au>
                  <snm>Herrmann</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lerman</snm>
                  <fnm>LO</fnm>
               </au>
               <au>
                  <snm>Lerman</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Circ Res</source>
            <pubdate>2007</pubdate>
            <volume>100</volume>
            <issue>9</issue>
            <fpage>1276</fpage>
            <lpage>1291</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1161/01.RES.0000264500.11888.f0</pubid>
                  <pubid idtype="pmpid" link="fulltext">17495234</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Ubiquitin and ubiquitin-like proteins as multifunctional signals</p>
            </title>
            <aug>
               <au>
                  <snm>Welchman</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Gordon</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Mayer</snm>
                  <fnm>RJ</fnm>
               </au>
            </aug>
            <source>Nat Rev Mol Cell Biol</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>8</issue>
            <fpage>599</fpage>
            <lpage>609</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrm1700</pubid>
                  <pubid idtype="pmpid" link="fulltext">16064136</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Methods for the purification of ubiquitinated proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Tomlinson</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Palaniyappan</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Tooth</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Layfield</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proteomics</source>
            <pubdate>2007</pubdate>
            <volume>7</volume>
            <issue>7</issue>
            <fpage>1016</fpage>
            <lpage>1022</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/pmic.200601008</pubid>
                  <pubid idtype="pmpid" link="fulltext">17351889</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Tryptic digestion of ubiquitin standards reveals an improved strategy for identifying ubiquitinated proteins by mass spectrometry</p>
            </title>
            <aug>
               <au>
                  <snm>Denis</snm>
                  <fnm>NJ</fnm>
               </au>
               <au>
                  <snm>Vasilescu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lambert</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Figeys</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proteomics</source>
            <pubdate>2007</pubdate>
            <volume>7</volume>
            <issue>6</issue>
            <fpage>868</fpage>
            <lpage>874</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/pmic.200600410</pubid>
                  <pubid idtype="pmpid" link="fulltext">17370265</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>A subset of membrane-associated proteins is ubiquitinated in response to mutations in the endoplasmic reticulum degradation machinery</p>
            </title>
            <aug>
               <au>
                  <snm>Hitchcock</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Auld</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Gygi</snm>
                  <fnm>SP</fnm>
               </au>
               <au>
                  <snm>Silver</snm>
                  <fnm>PA</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <issue>22</issue>
            <fpage>12735</fpage>
            <lpage>12740</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">240687</pubid>
                  <pubid idtype="pmpid" link="fulltext">14557538</pubid>
                  <pubid idtype="doi">10.1073/pnas.2135500100</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>A proteomics approach to identify the ubiquitinated proteins in mouse heart</p>
            </title>
            <aug>
               <au>
                  <snm>Jeon</snm>
                  <fnm>HB</fnm>
               </au>
               <au>
                  <snm>Choi</snm>
                  <fnm>ES</fnm>
               </au>
               <au>
                  <snm>Yoon</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Hwang</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Chang</snm>
                  <fnm>JW</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>EK</fnm>
               </au>
               <au>
                  <snm>Choi</snm>
                  <fnm>HW</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>ZY</fnm>
               </au>
               <au>
                  <snm>Yoo</snm>
                  <fnm>YJ</fnm>
               </au>
            </aug>
            <source>Biochem Biophys Res Commun</source>
            <pubdate>2007</pubdate>
            <volume>357</volume>
            <issue>3</issue>
            <fpage>731</fpage>
            <lpage>736</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.bbrc.2007.04.015</pubid>
                  <pubid idtype="pmpid" link="fulltext">17451654</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Proteomic identification of ubiquitinated proteins from human cells expressing His-tagged ubiquitin</p>
            </title>
            <aug>
               <au>
                  <snm>Kirkpatrick</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Weldon</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Tsaprailis</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Liebler</snm>
                  <fnm>DC</fnm>
               </au>
               <au>
                  <snm>Gandolfi</snm>
                  <fnm>AJ</fnm>
               </au>
            </aug>
            <source>Proteomics</source>
            <pubdate>2005</pubdate>
            <volume>5</volume>
            <issue>8</issue>
            <fpage>2104</fpage>
            <lpage>2111</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/pmic.200401089</pubid>
                  <pubid idtype="pmpid" link="fulltext">15852347</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Large-scale analysis of the human ubiquitin-relatedproteome</p>
            </title>
            <aug>
               <au>
                  <snm>Matsumoto</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hatakeyama</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Oyamada</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Oda</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Nishimura</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Nakayama</snm>
                  <fnm>KI</fnm>
               </au>
            </aug>
            <source>Proteomics</source>
            <pubdate>2005</pubdate>
            <volume>5</volume>
            <issue>16</issue>
            <fpage>4145</fpage>
            <lpage>4151</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/pmic.200401280</pubid>
                  <pubid idtype="pmpid" link="fulltext">16196087</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>A proteomics approach to understanding protein ubiquitination</p>
            </title>
            <aug>
               <au>
                  <snm>Peng</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Schwartz</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Elias</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Thoreen</snm>
                  <fnm>CC</fnm>
               </au>
               <au>
                  <snm>Cheng</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Marsischky</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Roelofs</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Finley</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Gygi</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2003</pubdate>
            <volume>21</volume>
            <issue>8</issue>
            <fpage>921</fpage>
            <lpage>926</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nbt849</pubid>
                  <pubid idtype="pmpid" link="fulltext">12872131</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Proteomic insights into ubiquitin and ubiquitin-like proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Denison</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Kirkpatrick</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Gygi</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>Curr Opin Chem Biol</source>
            <pubdate>2005</pubdate>
            <volume>9</volume>
            <issue>1</issue>
            <fpage>69</fpage>
            <lpage>75</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.cbpa.2004.10.010</pubid>
                  <pubid idtype="pmpid" link="fulltext">15701456</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>AutoMotif server: prediction of single residue post-translational modifications in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Plewczynski</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Tkacz</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Wyrwicz</snm>
                  <fnm>LS</fnm>
               </au>
               <au>
                  <snm>Rychlewski</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>10</issue>
            <fpage>2525</fpage>
            <lpage>2527</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti333</pubid>
                  <pubid idtype="pmpid" link="fulltext">15728119</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties</p>
            </title>
            <aug>
               <au>
                  <snm>Tung</snm>
                  <fnm>CW</fnm>
               </au>
               <au>
                  <snm>Ho</snm>
                  <fnm>SY</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <issue>8</issue>
            <fpage>942</fpage>
            <lpage>949</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btm061</pubid>
                  <pubid idtype="pmpid" link="fulltext">17384427</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>NBA-Palm: prediction of palmitoylation site implemented in Naive Bayes algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Xue</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Jin</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sun</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Yao</snm>
                  <fnm>X</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>458</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1624852</pubid>
                  <pubid idtype="pmpid" link="fulltext">17044919</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-458</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Improving the accuracy of transmembrane protein topology prediction using evolutionary information</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <issue>5</issue>
            <fpage>538</fpage>
            <lpage>544</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl677</pubid>
                  <pubid idtype="pmpid" link="fulltext">17237066</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>A neural network method for prediction of beta-turn types in proteins using evolutionary information</p>
            </title>
            <aug>
               <au>
                  <snm>Kaur</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Raghava</snm>
                  <fnm>GP</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>16</issue>
            <fpage>2751</fpage>
            <lpage>2758</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth322</pubid>
                  <pubid idtype="pmpid" link="fulltext">15145798</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features</p>
            </title>
            <aug>
               <au>
                  <snm>Huang</snm>
                  <fnm>WL</fnm>
               </au>
               <au>
                  <snm>Tung</snm>
                  <fnm>CW</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>HL</fnm>
               </au>
               <au>
                  <snm>Hwang</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Ho</snm>
                  <fnm>SY</fnm>
               </au>
            </aug>
            <source>Biosystems</source>
            <pubdate>2007</pubdate>
            <volume>90</volume>
            <issue>2</issue>
            <fpage>573</fpage>
            <lpage>581</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.biosystems.2007.01.001</pubid>
                  <pubid idtype="pmpid" link="fulltext">17291684</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>UbiProt: a database of ubiquitylated proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Chernorudskiy</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Garcia</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Eremin</snm>
                  <fnm>EV</fnm>
               </au>
               <au>
                  <snm>Shorina</snm>
                  <fnm>AS</fnm>
               </au>
               <au>
                  <snm>Kondratieva</snm>
                  <fnm>EV</fnm>
               </au>
               <au>
                  <snm>Gainullin</snm>
                  <fnm>MR</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>126</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1855352</pubid>
                  <pubid idtype="pmpid" link="fulltext">17442109</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-8-126</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Inheritable genetic algorithm for biobjective 0/1 combinatorial optimization problems and its applications</p>
            </title>
            <aug>
               <au>
                  <snm>Ho</snm>
                  <fnm>SY</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>MH</fnm>
               </au>
            </aug>
            <source>IEEE Trans Syst Man Cybern B Cybern</source>
            <pubdate>2004</pubdate>
            <volume>34</volume>
            <issue>1</issue>
            <fpage>609</fpage>
            <lpage>620</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1109/TSMCB.2003.817090</pubid>
                  <pubid idtype="pmpid">15369097</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>C4.5: programs for machine learning</p>
            </title>
            <aug>
               <au>
                  <snm>Quinlan</snm>
                  <fnm>JR</fnm>
               </au>
            </aug>
            <publisher>San Mateo, CA: Morgan Kaufmann</publisher>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B20">
            <title>
               <p>UbiPred: a web server for prediction of ubiquitylation sites</p>
            </title>
            <url>http://iclab.life.nctu.edu.tw/ubipred</url>
         </bibl>
         <bibl id="B21">
            <title>
               <p>WebLogo: a sequence logo generator</p>
            </title>
            <aug>
               <au>
                  <snm>Crooks</snm>
                  <fnm>GE</fnm>
               </au>
               <au>
                  <snm>Hon</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Chandonia</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <issue>6</issue>
            <fpage>1188</fpage>
            <lpage>1190</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">419797</pubid>
                  <pubid idtype="pmpid" link="fulltext">15173120</pubid>
                  <pubid idtype="doi">10.1101/gr.849004</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Orthogonal fractional factorial designs</p>
            </title>
            <aug>
               <au>
                  <snm>Dey</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <publisher>New York: Wiley</publisher>
            <pubdate>1985</pubdate>
         </bibl>
         <bibl id="B23">
            <title>
               <p>On the optimality of orthogonal experimental design</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>Q</fnm>
               </au>
            </aug>
            <source>Acta Math Appl Sinica</source>
            <pubdate>1978</pubdate>
            <volume>1</volume>
            <fpage>283</fpage>
            <lpage>299</lpage>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Empirical studies of hydrophobicity. 1. Effect of protein size on the hydrophobic behavior of amino acids</p>
            </title>
            <aug>
               <au>
                  <snm>Meirovitch</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Rackovsky</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Scheraga</snm>
                  <fnm>HA</fnm>
               </au>
            </aug>
            <source>Macromolecules</source>
            <pubdate>1980</pubdate>
            <volume>13</volume>
            <fpage>1398</fpage>
            <lpage>1405</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1021/ma60078a013</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Volume changes on protein folding</p>
            </title>
            <aug>
               <au>
                  <snm>Harpaz</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Chothia</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Structure</source>
            <pubdate>1994</pubdate>
            <volume>2</volume>
            <issue>7</issue>
            <fpage>641</fpage>
            <lpage>649</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0969-2126(00)00065-4</pubid>
                  <pubid idtype="pmpid">7922041</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Cornette</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Cease</snm>
                  <fnm>KB</fnm>
               </au>
               <au>
                  <snm>Margalit</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Spouge</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Berzofsky</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>DeLisi</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1987</pubdate>
            <volume>195</volume>
            <issue>3</issue>
            <fpage>659</fpage>
            <lpage>685</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(87)90189-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">3656427</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Relation between amino acid composition and cellular location of proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Cedano</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Aloy</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Perez-Pons</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Querol</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1997</pubdate>
            <volume>266</volume>
            <issue>3</issue>
            <fpage>594</fpage>
            <lpage>600</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1996.0804</pubid>
                  <pubid idtype="pmpid" link="fulltext">9067612</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>An analysis of protein domain linkers: their classification and role in protein folding</p>
            </title>
            <aug>
               <au>
                  <snm>George</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Protein Eng</source>
            <pubdate>2002</pubdate>
            <volume>15</volume>
            <issue>11</issue>
            <fpage>871</fpage>
            <lpage>879</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/protein/15.11.871</pubid>
                  <pubid idtype="pmpid" link="fulltext">12538906</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Godzik</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>13</issue>
            <fpage>1658</fpage>
            <lpage>1659</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl158</pubid>
                  <pubid idtype="pmpid" link="fulltext">16731699</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>UniProt Knowledgebase (Swiss-Prot and TrEMBL)</p>
            </title>
            <url>http://www.expasy.org/sprot/</url>
         </bibl>
         <bibl id="B31">
            <title>
               <p>LIBSVM: a library for support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Chang</snm>
                  <fnm>CC</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Data Mining: Practical machine learning tools and techniques</p>
            </title>
            <aug>
               <au>
                  <snm>Witten</snm>
                  <fnm>IH</fnm>
               </au>
               <au>
                  <snm>Frank</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <publisher>San Francisco: Morgan Kaufmann</publisher>
            <edition>2</edition>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <issue>17</issue>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
                  <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>AAindex: amino acid index database, progress report 2008</p>
            </title>
            <aug>
               <au>
                  <snm>Kawashima</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Pokarowski</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Pokarowska</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kolinski</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Katayama</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Kanehisa</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2008</pubdate>
            <issue>36 Database</issue>
            <fpage>D202</fpage>
            <lpage>205</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2238890</pubid>
                  <pubid idtype="pmpid" link="fulltext">17998252</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Intelligent evolutionary algorithms for large parameter optimization problems</p>
            </title>
            <aug>
               <au>
                  <snm>Ho</snm>
                  <fnm>SY</fnm>
               </au>
               <au>
                  <snm>Shu</snm>
                  <fnm>LS</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <source>IEEE Trans Evol Comput</source>
            <pubdate>2004</pubdate>
            <volume>8</volume>
            <issue>6</issue>
            <fpage>522</fpage>
            <lpage>541</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1109/TEVC.2004.835176</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
