<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-110</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>IsoSVM &#8211; Distinguishing isoforms and paralogs on the protein level</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Spitzer</snm>
               <fnm>Michael</fnm>
               <insr iid="I1"/>
               <email>michael.spitzer@uni-muenster.de</email>
            </au>
            <au id="A2">
               <snm>Lorkowski</snm>
               <fnm>Stefan</fnm>
               <insr iid="I2"/>
               <insr iid="I3"/>
               <email>stefan.lorkowski@uni-muenster.de</email>
            </au>
            <au id="A3">
               <snm>Cullen</snm>
               <fnm>Paul</fnm>
               <insr iid="I2"/>
               <email>cullen@uni-muenster.de</email>
            </au>
            <au id="A4">
               <snm>Sczyrba</snm>
               <fnm>Alexander</fnm>
               <insr iid="I4"/>
               <email>asczyrba@techfak.uni-bielefeld.de</email>
            </au>
            <au id="A5" ca="yes">
               <snm>Fuellen</snm>
               <fnm>Georg</fnm>
               <insr iid="I1"/>
               <insr iid="I5"/>
               <email>fuellen@uni-muenster.de</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Division of Bioinformatics, Biology Department, Schlossplatz 4, 48149 M&#252;nster, Germany</p>
            </ins>
            <ins id="I2">
               <p>Leibniz Institute of Arteriosclerosis Research, Domagkstr. 3, 48149 M&#252;nster, Germany</p>
            </ins>
            <ins id="I3">
               <p>Institute of Biochemistry, Wilhelm-Klemm-Str. 2, 48149 M&#252;nster, Germany</p>
            </ins>
            <ins id="I4">
               <p>Faculty of Technology, Research Group in Practical Computer Science, University of Bielefeld,Postfach 10 01 31, 33501 Bielefeld, Germany</p>
            </ins>
            <ins id="I5">
               <p>Department of Medicine, AG Bioinformatics, Domagkstr. 3, 48149 M&#252;nster, Germany</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>110</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/110</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16519805</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-110</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>18</day>
               <month>7</month>
               <year>2005</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>06</day>
               <month>3</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>06</day>
               <month>3</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Spitzer et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Recent progress in cDNA and EST sequencing is yielding a deluge of sequence data. Like database search results and proteome databases, this data gives rise to inferred protein sequences without ready access to the underlying genomic data. Analysis of this information (e.g. for EST clustering or phylogenetic reconstruction from proteome data) is hampered because it is not known if two protein sequences are isoforms (splice variants) or not (i.e. paralogs/orthologs). However, even without knowing the intron/exon structure, visual analysis of the pattern of similarity across the alignment of the two protein sequences is usually helpful since paralogs and orthologs feature substitutions with respect to each other, as opposed to isoforms, which do not.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The IsoSVM tool introduces an automated approach to identifying isoforms on the protein level using a support vector machine (SVM) classifier. Based on three specific features used as input of the SVM classifier, it is possible to automatically identify isoforms with little effort and with an accuracy of more than 97%. We show that the SVM is superior to a radial basis function network and to a linear classifier. As an example application we use IsoSVM to estimate that a set of <it>Xenopus laevis </it>EST clusters consists of approximately 81% cases where sequences are each other's paralogs and 19% cases where sequences are each other's isoforms. The number of isoforms and paralogs in this allotetraploid species is of interest in the study of evolution.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We developed an SVM classifier that can be used to distinguish isoforms from paralogs with high accuracy and without access to the genomic data. It can be used to analyze, for example, EST data and database search results. Our software is freely available on the Web, under the name IsoSVM.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Typical eukaryotic genes are composed of several relatively short exons that are interrupted by long introns. The primary transcripts of most eukaryotic genes are composed of introns and exons separated by canonical splice sites. These mRNA precursors are shortened by a process called RNA splicing in which the intron sequences are removed yielding the mature transcript consisting of exons only <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. However, cells can splice the primary transcript in different ways and thereby generate different polypeptides from the same gene (reviewed in <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>). This process is called alternative splicing. The different polypeptides are termed alternatively spliced gene products, splice variants or protein isoforms <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
         <p>To generate correctly spliced, mature mRNAs, the exons must be identified and joined together precisely and efficiently by a complex process that requires the coordinated action of five small nuclear RNAs (termed U1, U2 and U4 to U6) and more than 60 polypeptides <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. According to <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, five common modes of alternative splicing are known: (i) exon skipping or inclusion, (ii) alternative 3' splice sites, (iii) alternative 5' splice sites, (iv) mutually exclusive exons, and (v) intron retention which corresponds to no splicing. In complex pre-mRNAs, more than one of these modes of alternative splicing can apply to different regions of the transcript, and extra mRNA isoforms can be generated through the use of alternative promoters or polyadenylation sites <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
         <p>Alternative splicing is a frequent process in eukaryotes. It is estimated that up to 60 percent of human genes are subjected to alternative splicing <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Thus, alternative splicing is probably an important source of protein diversity in higher eukaryotes. For example, the fruitfly <it>Drosophila melanogaster </it>contains fewer genes than <it>Caenorhabditis elegans </it>while exhibiting significantly higher protein diversity <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Furthermore, alternative splicing of primary transcripts is often tissue- or stage-specific (cf. the expression of different alternatively spliced transcripts during different stages of the development of an organism <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>), and is thus an important regulatory mechanism.</p>
         <p>For a protein in an organism, other proteins can be found that are homologous, i.e. that are similar due to common evolutionary ancestry. Following Fitch <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, there can be orthologs, which are homologs due to a speciation event, and paralogs, which are homologs due to a duplication event. Even if genomic information on intron/exon-structure is not available, isoforms can usually be visually distinguished from homologs based on protein sequence alone, since only the latter feature substitutions with respect to each other (cf. Figure <figr fid="F1">1</figr>). For the remainder of this paper, without loss of generality, we will consider paralogs only. Comparing a protein with an isoform of its paralog, we still find a predominance of substitutions, and we consider these two proteins to be paralogs.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Visualization of a part of an alignment of (A) two paralogous sequences (the human ABCB4 and ABCB1 protein) and (B) two isoforms (the human ABCB4 protein and its isoform <it>c</it>), representing an ideal case</p>
            </caption>
            <text>
               <p><b>Visualization of a part of an alignment of (A) </b>two paralogous sequences (the human ABCB4 and ABCB1 protein) and <b>(B) </b>two isoforms (the human ABCB4 protein and its isoform <it>c</it>), representing an ideal case. Positions with matches between the two sequences are indicated by "|", mismatches by "#" and amino acids vs. gap characters by ":". The values of the three features (cf. <b><it>Methods</it></b>, section <it>Features</it>) for the <it>full-length </it>sequences compared in panel (A) are (i) <it>sequence similarity </it>75.76%, (ii) <it>inverse CBIN count </it>0.0027, (iii) <it>fraction of consecutive matches and mismatches </it>0.7111. For the <it>full-length </it>sequences compared in panel (B) we have (i) sequence similarity 96.33%, (ii) inverse CBIN count 0.3333, (iii) fraction of consecutive matches and mismatches 0.9969.</p>
            </text>
            <graphic file="1471-2105-7-110-1"/>
         </fig>
         <p>Available databases of proteins and their isoforms consider only a small number of protein families and species (see e.g. <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>). We wanted to identify isoforms without knowledge of genomic information and independently of specific protein families or species, in a fashion well suited for high-throughput genomics and proteomics.</p>
         <p>Visual inspection of large datasets such as complete proteomes (meaning the totality of all proteins expressed in an organism) would be time-consuming and prone to misclassifications. To enable automation, a set of three different features was derived based on the pairwise alignment of the two protein sequences to be compared. These features take into account such parameters as the distribution of substitutions and sequence similarity. The three features are <it>overall sequence similarity</it>, the <it>number of consecutive blocks of identities or non-identities </it>(CBINs) and the <it>overall number of consecutive matches (and mismatches)</it>, see also Figures <figr fid="F2">2</figr> and <figr fid="F3">3</figr>, and <b><it>Methods</it></b>, section <it>Features</it>.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Features displayed by the samples in the canonical training dataset</p>
            </caption>
            <text>
               <p><b>Features displayed by the samples in the canonical training dataset. </b>Panels <b>(A) </b>to <b>(C) </b>illustrate combinations of two of the three features. Panel <b>(D) </b>illustrates all three features at the same time. Samples arising from the comparison of paralogous sequences are shown in blue, whereas isoforms are shown in red. An <it>inverse CBIN count </it>of <it>1</it>/<it>n </it>arises if <it>n </it>CBINs are featured by a given sample. Though the samples of both classes separate well in general, some samples of one class "overlap" into the other class.</p>
            </text>
            <graphic file="1471-2105-7-110-2"/>
         </fig>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Illustration of the different cases of consecutive blocks of identities or non-identities (CBINs)</p>
            </caption>
            <text>
               <p><b>Illustration of the different cases of consecutive blocks of identities or non-identities (CBINs). (A) </b>CBIN of matches, <b>(B) </b>CBIN of gaps (counted as mismatches), <b>(C) </b>CBIN of mismatches, <b>(D) </b>example of a comparison of two sequences with an alignment length of 32. Matches are denoted by "|", mismatches by "#" and amino acids aligned to gaps by ":". The example alignment of length 32 features eight CBINs. The values of the three features are: (i) sequence similarity 0.594, (ii) inverse CBIN count 0.125, (iii) fraction of consecutive matches and mismatches 0.75.</p>
            </text>
            <graphic file="1471-2105-7-110-3"/>
         </fig>
         <p>For automation the approach of supervised learning using a Support Vector Machine (SVM) <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp> was chosen. SVMs are gaining popularity in Bioinformatics <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp> and are often superior to Neural Networks and Bayesian Learning <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. SVM classifiers distinguish two classes of input data by calculating separating hyperplanes (decision surfaces) in a vector space <it>V </it>that is endowed with a dot product. The dot product is used as a measure of similarity. Data samples from the <it>input space </it>are mapped to the vector space <it>V </it>(usually of dimensionality higher than the input space), making it easier to find a separating hyperplane. The position and margin of the hyperplane are optimized in <it>V</it>, maximizing the distance of the hyperplane to instances of both classes. The kernel function used to measure similarity behaves in input space like the dot product in space <it>V</it>. Thus, similarity of input data can be measured easily in <it>V</it>. Without a kernel function, computation of the dot products in <it>V </it>would be necessary, consuming a large amount of time, depending on the structure of <it>V</it>. For an in-depth description of properties and theory of SVMs, please see <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The Support Vector Machine implementation SVMLight <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> was used. In this paper, we introduce a highly accurate SVM-based method to distinguish between isoforms and paralogs on the protein level (that is, without the need for genomic information). Our software is freely available on the Web (see <b>Conclusions</b>).</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Importance of maximizing accuracy in distinguishing isoforms and paralogs</p>
            </st>
            <p>Why does isoform detection require such a high degree of accuracy? Why do we want to use an SVM even though this approach is usually employed in case the input space has dimensionality (much) larger than three? For example, when performing 2,000 sequence comparisons, even a 0.2% improvement in accuracy results in 4 fewer misclassifications. Such numbers are typical, for example, in applications of our automated phylogeny pipeline RiPE <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. Analyzing a large protein family with RiPE, few misclassifications make a difference since paralogs misidentified as isoforms (false positives) are deleted from the dataset, which may result in the loss of key members of the protein family, compromising the interpretation of the evolution of sequence, domain structure and function. (In this specific application, isoforms misidentified as paralogs (false negatives) do not pose a major problem.)</p>
         </sec>
         <sec>
            <st>
               <p>Performance statistics of different classifiers based on three features</p>
            </st>
            <p>We investigated three different classifiers designed to distinguish isoforms and paralogs. We calculated the <it>mean accuracy </it>and <it>standard error of the mean </it>for an SVM, a radial basis function (RBF) network <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> and a linear classifier. Classification was based on three features and samples were derived from protein data taken from Genbank <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> (cf. <b><it>Methods</it></b>, section <it>Assessing performance of classifiers based on three features by jackknife resampling</it>). The SVM classifier showed better accuracy and a smaller standard error of the mean than the two other classifiers. In detail, the SVM classifier shows a mean accuracy of 99.55% and a standard error of 0.008. In contrast, the classifier based on the RBF network shows a mean accuracy of 99.33% and a standard error of 0.011, while for the linear classifier a mean accuracy of 99.42% and a standard error of 0.011 was observed. Mean accuracy, mean precision and true positive/true negative (TP/TN) and false positive/false negative (FP/FN) numbers for the three classifiers are given in Table <tblr tid="T1">1</tblr> and illustrated in Figure <figr fid="F4">4</figr>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Mean accuracy and standard error of the mean of various classifiers, using three features derived from the alignment of the sequences to be compared. 100-fold jackknife resampling was employed. "&#177; " denotes the standard error of the mean.</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c cspan="6" ca="center">
                        <p>
                           <b>SVM classifier</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Accuracy</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Precision</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>True Positives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>True Negatives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>False Positives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>False Negatives</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>99.55% &#177; 0.008</p>
                     </c>
                     <c ca="center">
                        <p>99.31% &#177; 0.015</p>
                     </c>
                     <c ca="center">
                        <p>1897.1 &#177; 0.21</p>
                     </c>
                     <c ca="center">
                        <p>1887.9 &#177; 0.28</p>
                     </c>
                     <c ca="center">
                        <p>13.1 &#177; 0.28</p>
                     </c>
                     <c ca="center">
                        <p>3.9 &#177; 0.21</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="6" ca="center">
                        <p>
                           <b>RBF network classifier</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Accuracy</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Precision</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>True Positives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>True Negatives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>False Positives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>False Negatives</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>99.33% &#177; 0.011</p>
                     </c>
                     <c ca="center">
                        <p>98.91% &#177; 0.019</p>
                     </c>
                     <c ca="center">
                        <p>1896.5 &#177; 0.22</p>
                     </c>
                     <c ca="center">
                        <p>1880.1 &#177; 0.38</p>
                     </c>
                     <c ca="center">
                        <p>20.9 &#177; 0.38</p>
                     </c>
                     <c ca="center">
                        <p>4.6 &#177; 0.22</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="6" ca="center">
                        <p>
                           <b>3-feature linear classifier</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Accuracy</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Precision</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>True Positives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>True Negatives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>False Positives</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>False Negatives</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>99.42% &#177; 0.011</p>
                     </c>
                     <c ca="center">
                        <p>99.22% &#177; 0.020</p>
                     </c>
                     <c ca="center">
                        <p>1893.8 &#177; 0.35</p>
                     </c>
                     <c ca="center">
                        <p>1886.0 &#177; 0.39</p>
                     </c>
                     <c ca="center">
                        <p>15.0 &#177; 0.39</p>
                     </c>
                     <c ca="center">
                        <p>7.2 &#177; 0.35</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Accuracy of classifiers measured by jackknife resampling, employing all three features</p>
               </caption>
               <text>
                  <p><b>Accuracy of classifiers measured by jackknife resampling, employing all three features. </b>Performance of the SVM classifier is compared to classifiers based on an RBF network as well as a linear classifier. Mean accuracy and standard error of the mean were assessed by 100-fold jackknife resampling using 7604 samples resulting from a visual inspection process of protein sequences taken from Genbank.</p>
               </text>
               <graphic file="1471-2105-7-110-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Performance of different classifiers using a canonical training/testing dataset</p>
            </st>
            <p>In the following, we report results that are not supported by resampling but derived from a specific ("canonical") training and testing dataset (cf. <b><it>Methods</it></b>, section <it>Canonical training and testing dataset</it>). In this way, we were able to explore, on a large (3802 samples) dataset, a wide variety of classifiers in reasonable time.</p>
            <p>The SVM classifier distinguishes isoforms and paralogs of the canonical testing dataset with an accuracy of 99.63% and a precision of 99.37% (cf. Table <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr>). All three sequence-based features used by the SVM (cf. Figure <figr fid="F2">2</figr>) contributed to accuracy; results based on any combination of two features only were inferior, as shown in Table <tblr tid="T3">3</tblr>.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Performance of the SVM classifier (accuracy/precision) on four testing scenarios.</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c cspan="6" ca="center">
                        <p>
                           <b>Full-length-sequence (canonical testing dataset)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Accuracy</p>
                     </c>
                     <c ca="center">
                        <p>Precision</p>
                     </c>
                     <c ca="center">
                        <p>True Positives</p>
                     </c>
                     <c ca="center">
                        <p>True Negatives</p>
                     </c>
                     <c ca="center">
                        <p>False Positives</p>
                     </c>
                     <c ca="center">
                        <p>False Negatives</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>99.63%</p>
                     </c>
                     <c ca="center">
                        <p>99.37%</p>
                     </c>
                     <c ca="center">
                        <p>1899</p>
                     </c>
                     <c ca="center">
                        <p>1889</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="6" ca="center">
                        <p>
                           <b>Selected <it>Xenopus </it>EST data</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Accuracy</p>
                     </c>
                     <c ca="center">
                        <p>Precision</p>
                     </c>
                     <c ca="center">
                        <p>True Positives</p>
                     </c>
                     <c ca="center">
                        <p>True Negatives</p>
                     </c>
                     <c ca="center">
                        <p>False Positives</p>
                     </c>
                     <c ca="center">
                        <p>False Negatives</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>97.93%</p>
                     </c>
                     <c ca="center">
                        <p>99.23%</p>
                     </c>
                     <c ca="center">
                        <p>129</p>
                     </c>
                     <c ca="center">
                        <p>155</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6" ca="center">
                        <p>
                           <b>Homologous-regions-only</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Accuracy</p>
                     </c>
                     <c ca="center">
                        <p>Precision</p>
                     </c>
                     <c ca="center">
                        <p>True Positives</p>
                     </c>
                     <c ca="center">
                        <p>True Negatives</p>
                     </c>
                     <c ca="center">
                        <p>False Positives</p>
                     </c>
                     <c ca="center">
                        <p>False Negatives</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>98.98%</p>
                     </c>
                     <c ca="center">
                        <p>97.57%</p>
                     </c>
                     <c ca="center">
                        <p>2529</p>
                     </c>
                     <c ca="center">
                        <p>5455</p>
                     </c>
                     <c ca="center">
                        <p>63</p>
                     </c>
                     <c ca="center">
                        <p>19</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="6" ca="center">
                        <p>
                           <b>ABC protein homologous-regions-only</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Accuracy</p>
                     </c>
                     <c ca="center">
                        <p>Precision</p>
                     </c>
                     <c ca="center">
                        <p>True Positives</p>
                     </c>
                     <c ca="center">
                        <p>True Negatives</p>
                     </c>
                     <c ca="center">
                        <p>False Positives</p>
                     </c>
                     <c ca="center">
                        <p>False Negatives</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>95.65%</p>
                     </c>
                     <c ca="center">
                        <p>110</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Performance comparison of the three-feature SVM classifier to linear classifiers, an RBF network classifier and other SVM classifiers, using canonical training and testing datasets.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>Accuracy</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Feature(s)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Canonical testing dataset</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Homologous-regions-only testing dataset</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>3-feature SVM classifier</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p><b>Sequence similarity, inverse CBIN count, match/mismatch fraction </b>(cf. Table 2)</p>
                     </c>
                     <c ca="center">
                        <p>99.63%</p>
                     </c>
                     <c ca="center">
                        <p>98.98%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>2-feature SVM classifiers</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Match/mismatch fraction, sequence similarity</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>97.50%</p>
                     </c>
                     <c ca="center">
                        <p>96.68%</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Inverse CBIN count, sequence similarity</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>99.32%</p>
                     </c>
                     <c ca="center">
                        <p>98.97%</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Match/mismatch fraction, inverse CBIN count</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>99.42%</p>
                     </c>
                     <c ca="center">
                        <p>98.91%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>RBF Network classifier</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Sequence similarity, inverse CBIN count, match/mismatch fraction</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>99.32%</p>
                     </c>
                     <c ca="center">
                        <p>98.79%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>3-feature linear classifier</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Sequence similarity, inverse CBIN count, match/mismatch fraction</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>99.42%</p>
                     </c>
                     <c ca="center">
                        <p>98.80%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>2-feature linear classifiers</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Match/mismatch fraction, sequence similarity</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>99.03%</p>
                     </c>
                     <c ca="center">
                        <p>98.75%</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Inverse CBIN count, sequence similarity</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>99.32%</p>
                     </c>
                     <c ca="center">
                        <p>98.67%</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Match/mismatch fraction, inverse CBIN count</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>99.37%</p>
                     </c>
                     <c ca="center">
                        <p>98.77%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>1-feature linear classifiers</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Sequence similarity</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>82.22%</p>
                     </c>
                     <c ca="center">
                        <p>82.02%</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Match/mismatch fraction</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>98.05%</p>
                     </c>
                     <c ca="center">
                        <p>98.62%</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Inverse CBIN count</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>99.37%</p>
                     </c>
                     <c ca="center">
                        <p>98.75%</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>A linear classifier that was calculated using all three features of the samples in the canonical training dataset was found to classify the canonical testing dataset with an accuracy of 99.42%. Linear classifiers that were trained using all possible combinations of only two features showed at least slightly inferior results compared to the linear classifier based on all three features. Not surprisingly, the best-performing classifier based on two features does not use the weakest feature that is <it>sequence similarity</it>. Classifiers based on <it>sequence similarity </it>alone appear to be weak in distinguishing between isoforms and paralogs and perform much worse than any other of the tested classifiers; a linear classifier derived by line-sweeping using the feature <it>sequence similarity </it>alone results in an accuracy of approximately 82%. Linear classifiers based on one of the other features perform surprisingly well, however (cf. Table <tblr tid="T3">3</tblr>).</p>
            <p>Finally, the radial basis function (RBF) network classifier <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> (cf. <b><it>Methods</it>, </b>section <it>Training of the radial basis function network</it>) applied to the canonical testing dataset using all three features results in an accuracy of 99.32%.</p>
         </sec>
         <sec>
            <st>
               <p>Application of the SVM classifier to EST data</p>
            </st>
            <p>As a first real-life application we used IsoSVM to search for isoforms within the CAP3-derived contigs of 722 <it>Xenopus laevis </it>EST clusters <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. <it>Xenopus laevis</it>, as an allotetraploid species, has undergone a genome wide duplication. Therefore, many genes are represented by two paralogs. Isoforms of <it>X</it>. <it>laevis </it>proteins have not been studied yet in any systematic way. Sequencing the <it>X. laevis </it>genome is made difficult by its sheer size, and genomic sequence data are too few in number to study intron-exon structures of most genes. Contigs were derived from 350,468 <it>Xenopus </it>ESTs downloaded from GenBank. After cleanup of the EST data (high quality sequence clipping, vector and repeat masking), sequences were clustered using an enhanced suffix array based approach <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> implemented in the tool Vmatch <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. Clustering resulted in 25,971 clusters which were assembled into 31,353 contigs using CAP3 <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Table <tblr tid="T4">4</tblr> summarizes the results of the clustering process.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Summary of Xenopus EST cleanup and clustering.</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Total number of ESTs and cDNAs</p>
                     </c>
                     <c ca="right">
                        <p>350,468</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Number of good sequences</p>
                     </c>
                     <c ca="right">
                        <p>317,242</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Average trimmed EST length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>536</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Number of clusters</p>
                     </c>
                     <c ca="right">
                        <p>25,971</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Number of singletons</p>
                     </c>
                     <c ca="right">
                        <p>40,877</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Number of CAP3 contigs</p>
                     </c>
                     <c ca="right">
                        <p>31,353</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Number of CAP3 singletons</p>
                     </c>
                     <c ca="right">
                        <p>4,801</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Average CAP3 contig length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>1,045</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Max. cluster size (no. of ESTs)</p>
                     </c>
                     <c ca="right">
                        <p>6,332</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Average cluster size (no. of ESTs)</p>
                     </c>
                     <c ca="right">
                        <p>10.6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Cluster sizes:</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b># EST</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>4,097 &#8211; 8,192</p>
                     </c>
                     <c ca="right">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>2,049 &#8211; 4,096</p>
                     </c>
                     <c ca="right">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>1,025 &#8211; 2,048</p>
                     </c>
                     <c ca="right">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>513 &#8211; 1,024</p>
                     </c>
                     <c ca="right">
                        <p>15</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>257 &#8211; 512</p>
                     </c>
                     <c ca="right">
                        <p>35</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>129 &#8211; 256</p>
                     </c>
                     <c ca="right">
                        <p>116</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>65 &#8211; 128</p>
                     </c>
                     <c ca="right">
                        <p>414</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>33 &#8211; 64</p>
                     </c>
                     <c ca="right">
                        <p>973</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>17 &#8211; 32</p>
                     </c>
                     <c ca="right">
                        <p>1,755</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>9 &#8211; 16</p>
                     </c>
                     <c ca="right">
                        <p>2,974</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>5 &#8211; 8</p>
                     </c>
                     <c ca="right">
                        <p>4,571</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>3 &#8211; 4</p>
                     </c>
                     <c ca="right">
                        <p>6,444</p>
                     </c>
                  </r>
                  <r>
                     <c indent="2" ca="left">
                        <p>2</p>
                     </c>
                     <c ca="right">
                        <p>8,670</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>To assess whether the splitting of clusters by CAP3 into several contigs was caused by grouping isoforms into the same cluster, or whether the splitting was due to paralogs, we extracted 722 clusters that have multiple contigs (2,243 contigs total), and for which each contig has a full length protein match in the protein NR database <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Most of the 722 clusters consist of only two contigs and only a fraction features three or more contigs. Treating each contig consensus as a sequence, 5,459 sequence pairs were compared by IsoSVM within clusters; 986 of these samples (19.3%) were classified as isoforms and 4,125 as paralogs (80.7%). 348 samples were left out, representing contigs with almost no overlap, i.e. sequence pairs of low (&lt;1%) similarity. As a further check, to assess the accuracy of this analysis, 290 randomly chosen samples were reviewed manually and the result was noted (cf. Table <tblr tid="T2">2</tblr>); an accuracy of 97.93% and a precision of 99.23% was found. (In a few cases, early EST sequencing termination events produce a block of amino acids aligned with gaps at the end of the two sequences compared, causing classification of such cases as isoforms, and they were counted as such.)</p>
         </sec>
         <sec>
            <st>
               <p>Application of the SVM classifier to an automated phylogeny pipeline</p>
            </st>
            <p>As a second application, the classifier was incorporated into a pipeline for automatic generation of protein phylogenies called RiPE <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>, with the aim to further reduce the redundancy of the RiPE-retrieved protein data by recognizing and deleting sequences that are isoforms. Isoforms are usually considered irrelevant data in phylogenetic tree inference and analysis. RiPE data are generated by homology search (PSIBLAST, <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>), retrieving hits with putative homology to a search profile and assembling HSP-based homologous-regions-only data as described in <b><it>Methods</it>, </b>section <it>Homologous regions only</it>. The pipeline already features a redundancy minimization stage, sorting out hits that are similar to other hits (95% identity or more). The IsoSVM classifier was incorporated, enabling the detection and deletion of isoforms, thus decreasing dataset size and redundancy while simultaneously increasing computational speed and legibility of the phylogenetic tree. We first tested the ability of our classifier to deal with homologous-regions-only data (using the testing dataset described in <b><it>Methods</it></b>, section <it>Homologous regions only</it>), noting an accuracy of 98.98% and a precision of 97.57% (cf. Table <tblr tid="T2">2</tblr>). Training on homologous-regions-only data did not improve classifier performance (data not shown).</p>
            <p>Following our interest in ABC (<b>A</b>TP-<b>b</b>inding <b>c</b>assette) proteins, which are found in a wide variety of species and are of major biomedical importance, a dataset of 1,349 ABC protein hits was then retrieved by RiPE from 20 model proteomes (12 eukaryotes, 6 bacteria and 2 archaea) using 48 known human ABC proteins <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> as search profile. 115 hits were identified as isoforms of another hit by the SVM classifier. As a further check, all 115 putative isoforms were inspected visually, the automatic classification (isoform or paralog) was checked, and a precision of 95.65% was found. The accuracy of the classifier was not calculated in this case since RiPE reports only samples classified as positives (i.e. isoforms). While the precision reported is based on the number of false positives (i.e. sequences representing paralogous sequences being reported as isoforms), assessment of accuracy would require the visual inspection of tens of thousands of samples of (putative) paralogs, i.e. putative false negatives. Removal of isoforms resulted in a reduction of dataset size by about 8%, rendering the eukaryotic parts of the tree much more legible.</p>
         </sec>
         <sec>
            <st>
               <p>Limitations of the classifier</p>
            </st>
            <p>Despite showing reliable performance, the SVM classifier is not perfect. It may misleadingly classify a small portion of paralogs with high similarity as isoforms, since they feature long stretches of identical amino acid sequence. Further, sequences that are fragments of other sequences will be classified as isoforms.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>The SVM classifier, trained using visually classified cases of isoform and paralog relationships, proved to be reliable in all tests, exhibiting an accuracy of over 97% and a precision of over 95%. We are thus able to distinguish isoforms and paralogs in a satisfactory way, no matter whether full-length, homologous-regions-only or EST cluster sequences are handled. In particular, for species such as <it>Xenopus laevis</it>, for which few detailed analyses of the evolution of genes and proteins exist, the analysis of paralogs and isoforms can improve statistical models of sequence evolution, e.g. regarding the likelihood of gene duplication and alternative splicing. Overall, the IsoSVM tool should be useful for researchers in several fields of genomic research and EST analysis as a reliable method of automatic isoform identification. Our software is freely available at the IsoSVM Website <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, under an open source license.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <p>To automatically determine if one protein sequence is an isoform of another, we first derive three features, characterizing the degree and pattern of matches and mismatches in a pairwise alignment of the two sequences as detailed in the paragraphs below. The three features depend on the length of the alignment of the two sequences and <b>c</b>onsecutive <b>b</b>locks of <b>i</b>dentities or <b>n</b>on-identities (CBINs).</p>
         <sec>
            <st>
               <p>Prerequisites</p>
            </st>
            <sec>
               <st>
                  <p>Length of the alignment (<it>l</it>)</p>
               </st>
               <p>The length of the alignment of two protein sequences <it>a </it>and <it>b </it>is used in two of the features described below to normalize their values to a range from 0 to 1. This was done in order to avoid numerical problems that may affect classification performance and to exclude features of large absolute amount that may numerically dominate smaller ones during training of the SVM (cf. <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>).</p>
            </sec>
            <sec>
               <st>
                  <p>Consecutive blocks of identities or non-identities (CBIN)</p>
               </st>
               <p>A CBIN is a block in which the alignment features consecutive matches or mismatches (cf. Figure <figr fid="F3">3</figr>). Few large CBINs are characteristic for comparisons of isoforms whereas many short CBINs are typically found in comparisons of paralogs (cf. Figure <figr fid="F1">1</figr>, illustrating the comparison of two isoforms and two paralogs).</p>
               <p>There are two possible cases of a CBIN. First, if sequence <it>a </it>features a subsequence of length <it>c </it>starting at position <it>i </it>(with <it>c </it>between 1 and <it>l-i</it>) that is a maximum run of exact matches (that cannot be extended any further) to its aligned counterpart of sequence <it>b</it>, then this block of consecutive matches is a CBIN of length <it>c</it>. Second, if sequence <it>a </it>features a subsequence of length <it>c </it>starting at position <it>i </it>(with <it>c </it>between 1 and <it>l-i</it>) that is a maximum run of mismatches to its aligned counterpart of sequence <it>b</it>, then this block of consecutive mismatches is a CBIN of length <it>c</it>. Formally, for internal CBINs that are not located at the beginning or at the end of the alignment, we have</p>
               <p><it>a</it><sub><it>k </it></sub>= <it>b</it><sub><it>k </it></sub>for all <it>k</it>,<it>k </it>= <it>i</it>,...,<it>i</it>+<it>c </it>&#160;&#160;&#160; <b>and </b>&#160;&#160;&#160; <it>a</it><sub><it>i</it>-1 </sub>&#8800; <it>b</it><sub><it>i</it>-1 </sub>&#160;&#160;&#160; <b>and </b>&#160;&#160;&#160; <it>a</it><sub><it>i</it>+<it>c</it>+1&#8800; </sub><it>b</it><sub><it>i</it>+<it>c</it>+1 </sub></p>
               <p>or</p>
               <p><it>a</it><sub><it>k</it></sub>&#8800; <it>b</it><sub><it>k</it></sub>for all <it>k</it>,<it>k </it>= <it>i</it>,...,<it>i</it>+<it>c </it>&#160;&#160;&#160; <b>and </b>&#160;&#160;&#160; <it>a</it><sub><it>i</it>-1 </sub>= <it>b</it><sub><it>i</it>-1 </sub>&#160;&#160;&#160; <b>and </b>&#160;&#160;&#160; <it>a</it><sub><it>i</it>+<it>c</it>+1 </sub>= <it>b</it><sub><it>i</it>+<it>c</it>+1 </sub>&#160;&#160;&#160; (1)</p>
               <p>where <it>i </it>is the start coordinate and <it>i+c </it>the end coordinate of the maximum block of matches or mismatches. For CBINs that are not internal, the definition can be generalized in an obvious way. Amino acids aligned with gaps are considered mismatches.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Features</p>
            </st>
            <sec>
               <st>
                  <p>Sequence similarity</p>
               </st>
               <p>Sequence similarity is the overall number of matches in the alignment of the sequences <it>a </it>and <it>b</it>, divided by its length <it>l</it>:</p>
               <p>
                  <m:math name="1471-2105-7-110-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtext>Feature&#160;</m:mtext>
                           <m:mn>1</m:mn>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>|</m:mo>
                                    <m:mrow>
                                       <m:mo>{</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                       <m:mo>,</m:mo>
                                       <m:mo>&#8230;</m:mo>
                                       <m:mo>,</m:mo>
                                       <m:mi>l</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>a</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:msub>
                                          <m:mi>b</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>}</m:mo>
                                    </m:mrow>
                                    <m:mo>|</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mi>l</m:mi>
                           </m:mfrac>
                           <m:mo>,</m:mo>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>2</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGgbGrcqqGLbqzcqqGHbqycqqG0baDcqqG1bqDcqqGYbGCcqqGLbqzcqqGGaaicqaIXaqmcqGH9aqpdaWcaaqaamaaemaabaGaei4EaSNaemyAaKMaeiilaWIaemyAaKMaeyypa0JaeGymaeJaeiilaWIaeSOjGSKaeiilaWIaemiBaWMaeiiFaWNaemyyae2aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGIbGydaWgaaWcbaGaemyAaKgabeaakiabc2ha9bGaay5bSlaawIa7aaqaaiabdYgaSbaacqGGSaalcaWLjaGaaCzcamaabmaabaGaeGOmaidacaGLOaGaayzkaaaaaa@56F3@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>where |<it>M</it>| denotes the number of elements in a set <it>M</it>.</p>
            </sec>
            <sec>
               <st>
                  <p>Inverse CBIN count</p>
               </st>
               <p>As the second feature we us the reciprocal value of the number of CBINs <it>n </it>in the pair of aligned sequences:</p>
               <p>
                  <m:math name="1471-2105-7-110-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtext>Feature&#160;2</m:mtext>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mn>1</m:mn>
                              <m:mi>n</m:mi>
                           </m:mfrac>
                           <m:mo>.</m:mo>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>3</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGgbGrcqqGLbqzcqqGHbqycqqG0baDcqqG1bqDcqqGYbGCcqqGLbqzcqqGGaaicqqGYaGmcqGH9aqpdaWcaaqaaiabigdaXaqaaiabd6gaUbaacqGGUaGlcaWLjaGaaCzcamaabmaabaGaeG4mamdacaGLOaGaayzkaaaaaa@3FB7@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
            </sec>
            <sec>
               <st>
                  <p>Fraction of consecutive matches and mismatches</p>
               </st>
               <p>This feature describes the overall number of consecutive matches and mismatches (not counting the match or mismatch at the first position of a CBIN). In other words, it is the sum of the lengths <it>c</it><sub><it>j </it></sub>minus one, of all <it>n </it>CBINs (with <it>j = 1..n</it>), divided by <it>l</it>:</p>
               <p>
                  <m:math name="1471-2105-7-110-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtext>Feature&#160;</m:mtext>
                           <m:mn>3</m:mn>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>n</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>c</m:mi>
                                          <m:mi>j</m:mi>
                                       </m:msub>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mn>1</m:mn>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mi>l</m:mi>
                           </m:mfrac>
                           <m:mo>.</m:mo>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>4</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGgbGrcqqGLbqzcqqGHbqycqqG0baDcqqG1bqDcqqGYbGCcqqGLbqzcqqGGaaicqaIZaWmcqGH9aqpdaWcaaqaamaaqahabaGaeiikaGIaem4yam2aaSbaaSqaaiabdQgaQbqabaGccqGHsislcqaIXaqmcqGGPaqkaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aaGcbaGaemiBaWgaaiabc6caUiaaxMaacaWLjaWaaeWaaeaacqaI0aanaiaawIcacaGLPaaaaaa@4C43@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>The feature <it>fraction of consecutive matches and mismatches </it>is abbreviated as <it>match-mismatch fraction </it>in all figures and tables. In the following we describe the procedure of the generation of the training and testing datasets, the learning pipeline and the validation of classifier performance.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Generation of the training and testing datasets</p>
            </st>
            <sec>
               <st>
                  <p>Sequence retrieval, homology search and visual classification</p>
               </st>
               <p>The NCBI non-redundant (NR) database <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> was used as the source for retrieving protein sequences and was downloaded from the NCBI FTP server on March 8, 2004. The NR database was then searched for sequences annotated as "isoform" or "splice variant". 13,061 sequences featuring at least one of the two keywords were found and retrieved from the NR database, establishing a set of unrelated sequences that are from any species for which isoforms can be expected to exist. From this set, 250 sequences were randomly selected to give rise to the canonical training and testing datasets, as follows (for a complete list of taxa included in this set please consult the supplementary material [see <supplr sid="S1">Additional file 1</supplr>]).</p>
               <suppl id="S1">
                  <title>
                     <p>Additional File 1</p>
                  </title>
                  <text>
                     <p>Supplementary Material. Threshold levels and kernel parameters used; species affiliation of BLAST query sequences; illustration of the line-sweeping procedure.</p>
                  </text>
                  <file name="1471-2105-7-110-S1.doc">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <p>For all 250 sequences a BLAST search <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> was performed, again on the NR database, using each sequence as the query sequence. BLAST standard parameters and an E-value threshold of 10<sup>-90</sup> were used to ensure that no unrelated hits were retrieved. For 176 of the 250 query sequences, hits corresponding to putative homologous sequences or isoforms were found. All sequences corresponding to hits from the same species as the query were retrieved from the NR database in full length. Sequences were then aligned using the program <it>fftnsi </it>of the MAFFT package <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> using default values (PAM200 log-odds matrix <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, gap open penalty 2.4, gap extension penalty 0.06). The resulting multiple alignment gives rise to pairwise alignments of all pairs of sequences. We obtained each pairwise alignment from a multiple alignment to improve the quality of the pairwise alignment (see e.g. <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>). Finally, each pair of sequences was assigned to one out of two possible classes (+1,-1) based on visual inspection (cf. Figure <figr fid="F5">5</figr>). A value of +1 indicates isoforms and a value of -1 paralogs. A few cases where no clear decision was possible and sequence pairs of low similarity (&lt;1%) were not included in the data. More specifically, two sequences are classified as isoforms if their alignment displays the following evidence:</p>
               <fig id="F5">
                  <title>
                     <p>Figure 5</p>
                  </title>
                  <caption>
                     <p>Visual inspection process</p>
                  </caption>
                  <text>
                     <p><b>Visual inspection process. </b>Matches in the alignments are colored in blue and mismatches in red. Amino acids aligned to gaps are indicated in green. Panels <b>(A) </b>to <b>(D) </b>illustrate alignments of two protein sequences classified as isoforms (panels <b>(A) </b>and <b>(B)</b>) or as paralogs (panels <b>(C) </b>and <b>(D)</b>). The sequences shown in panel <b>(A) </b>feature a shared subsequence (a putative constitutive exon), marked in blue. The upper sequence features an additional exon at the beginning (marked in green) that is missing in the lower sequence. In contrast, a putative exon at the end (also shown in green) is found in the lower sequence only. Comparison of the two putative isoforms shown in panel <b>(B) </b>reveals two constitutive exons in the middle and towards the end of the alignment, colored in blue (the only mismatch is interpreted as a sequencing error, or a polymorphism). These are separated by a stretch of amino acids aligned to gaps, interpreted as an exon skipped in the lower sequence. At the beginning of the alignment, the upper sequence features a long stretch of amino acids aligned to gaps and a few mismatches; two mutually exclusive exons are a plausible interpretation, since the lower sequence (starting with G and not with M) is incomplete and its first exon is probably much longer. At the end of the alignment both sequences feature a stretch of mismatches and gaps (colored in red), interpreted as mutually exclusive exons (indicated by a black frame). The sequences compared in panel <b>(C) </b>give rise to a sample of the paralog class. In general, the alignment features many mismatches, interpreted as substitutions, and six stretches of amino acids aligned to gaps (putative deletions). Panel <b>(D) </b>illustrates another putative paralog. Besides a shared stretch (featuring numerous substitutions) in the middle of the alignment, the upper sequence features putative deletions, or missing exons. It may thus be a case of an isoform of a paralog.</p>
                  </text>
                  <graphic file="1471-2105-7-110-5"/>
               </fig>
               <p>1. We observe large blocks of (almost) identical sequence with no (or few) mismatches that can be interpreted as common exons, except for a few sequencing errors or polymorphisms.</p>
               <p>2. Additionally, we observe either one or both of the following:</p>
               <p>i. We observe one or more sequence blocks that do not match (interspersed with a few random matches) which can be interpreted as mutually exclusive exons of similar size that are spuriously aligned and which are embedded in blocks of (almost) identical sequence.</p>
               <p>ii. We observe one or more sequence blocks that align to gap characters which can be interpreted as surplus amino acids that arise if mutually exclusive exons of different length are spuriously aligned, or if exon(s) are missing in one of the sequences, or if an exon has an alternative splice site such that it is observed in a short and in a long version, and which are again embedded in blocks of (almost) identical sequence.</p>
               <p>In contrast, two sequences are classified as paralogs if there is a large sequence block that displays sufficient similarity to allow assumption of common evolutionary origin, interspersed with a sufficiently large number of mismatches that must be interpreted as substitutions and that cannot be interpreted as sequencing errors, etc. Paralogs may feature deletions that give rise to observations similar to the ones in (i) and (ii) which are however embedded in blocks of sufficient similarity with many mismatches.</p>
            </sec>
            <sec>
               <st>
                  <p>Canonical training and testing dataset</p>
               </st>
               <p>The dataset resulting from visual inspection featured 3,802 samples of the isoform class and 8,757 of the paralog class. We started training with many more paralogs than isoforms, with inferior testing results (data not shown). Therefore, to prevent one class from outweighing the other during SVM training, the number of samples of the larger class was truncated to 3,802 samples. One half of the dataset, consisting of 1,901 isoform and 1,901 paralog samples, was designated the canonical training dataset, the other half is the canonical testing dataset. As can be seen from Figure <figr fid="F2">2</figr>, the two classes separate quite well, although close inspection reveals that the boundary between them is in fact quite complex.</p>
            </sec>
            <sec>
               <st>
                  <p>Homologous regions only</p>
               </st>
               <p>Another testing dataset was generated directly from the database search reports obtained before. They were converted into FASTA-formatted alignments of merged HSPs (partial hits called <it>high-scoring segment pairs</it>) using MVIEW <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. These merged HSPs can be viewed as the concatenation of the homologous regions of the full hit sequences. Some of the queries contained internal repeats that do not give rise to a single concatenation; these sequences were left out. By automatically transferring the visual classification of the corresponding full-length-sequence-based samples above to the merged HSP data, a set of 8,066 classified samples was obtained (5,518 samples of the paralog and 2,548 samples of the isoform class).</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Training of the SVM</p>
            </st>
            <p>To find an optimum SVM classifier for a given problem, a kernel has to be specified. As kernel function the radial basis function (RBF) kernel was used. For SVMs with RBF kernels, two parameters, <it>C </it>and <it>g </it>need to be determined. <it>C </it>describes a penalty for training errors and is part of the soft margin concept of SVMs. It allows for a number of (misclassified) training samples to be located within the margin. Thus, a certain amount of noise is tolerated in the training data. The parameter <it>g </it>describes the width of the Gaussian bells of the radial basis function of the RBF kernel</p>
            <p>
               <m:math name="1471-2105-7-110-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>K</m:mi>
                        <m:mo stretchy="false">(</m:mo>
                        <m:msub>
                           <m:mi>x</m:mi>
                           <m:mi>i</m:mi>
                        </m:msub>
                        <m:mo>,</m:mo>
                        <m:msub>
                           <m:mi>x</m:mi>
                           <m:mi>j</m:mi>
                        </m:msub>
                        <m:mo stretchy="false">)</m:mo>
                        <m:mo>=</m:mo>
                        <m:mi>exp</m:mi>
                        <m:mo>&#8289;</m:mo>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mrow>
                              <m:mo>&#8722;</m:mo>
                              <m:mi>g</m:mi>
                              <m:msup>
                                 <m:mrow>
                                    <m:mrow>
                                       <m:mo>&#8214;</m:mo>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>x</m:mi>
                                             <m:mi>i</m:mi>
                                          </m:msub>
                                          <m:mo>&#8722;</m:mo>
                                          <m:msub>
                                             <m:mi>x</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:msub>
                                       </m:mrow>
                                       <m:mo>&#8214;</m:mo>
                                    </m:mrow>
                                 </m:mrow>
                                 <m:mn>2</m:mn>
                              </m:msup>
                           </m:mrow>
                           <m:mo>)</m:mo>
                        </m:mrow>
                        <m:mo>,</m:mo>
                        <m:mi>g</m:mi>
                        <m:mo>></m:mo>
                        <m:mn>0</m:mn>
                        <m:mo>,</m:mo>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>5</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWscqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0JagiyzauMaeiiEaGNaeiiCaa3aaeWaaeaacqGHsislcqWGNbWzdaqbdaqaaiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IaemiEaG3aaSbaaSqaaiabdQgaQbqabaaakiaawMa7caGLkWoadaahaaWcbeqaaiabikdaYaaaaOGaayjkaiaawMcaaiabcYcaSiabdEgaNjabg6da+iabicdaWiabcYcaSiaaxMaacaWLjaWaaeWaaeaacqaI1aqnaiaawIcacaGLPaaaaaa@539F@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>where <it>x</it><sub><it>i</it></sub>, <it>x</it><sub><it>j </it></sub>denote feature vectors of training samples. We scanned for best parameter values in a specific range using a so-called grid-search.</p>
            <p>The grid-search was carried out for parameter <it>C </it>ranging from 10<sup>-5 </sup>to 10<sup>15 </sup>and for parameter <it>g </it>ranging from 10<sup>-15</sup> to 10<sup>3</sup>, following <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. Both parameters were scanned using 10 steps per axis on a logarithmic scale, resulting in a total number of 100 grid points. The grid-search was based on a cross-validation procedure intended to prevent overfitting of the classifier on the canonical training dataset, again following <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. We split the training dataset into <it>n </it>= 4 subsets (cf. Figure <figr fid="F6">6</figr>, each subset is denoted by an encircled number). For each point of the grid evaluated by the grid-search, n-1 of the <it>n </it>subsets are used to train a classifier using the kernel parameters <it>C </it>and <it>g </it>corresponding to the point in the grid. The resulting classifier is then tested on the one remaining subset of the training dataset, and accuracy is recorded. The overall accuracy of the SVM classifier trained at a specific point of the grid is then the mean over all <it>n </it>accuracies. The maximum accuracy was identified and the corresponding kernel parameters <it>C </it>and <it>g </it>were noted. New parameter ranges (10<sup>-1</sup>-10<sup>3 </sup>for C, 10<sup>-2</sup>-10<sup>3 </sup>for <it>g</it>) were then used to run a second grid-search with higher resolution in the area in which maximum accuracy was found. Inside this new grid, the point of maximum mean accuracy (99.58%) was chosen and its corresponding kernel parameters (<it>C </it>= 12.5; <it>g </it>= 6.25) were noted. Final training was then carried out on the entire canonical training dataset, resulting in a final SVM classifier. To assess its performance true positive/true negative (TP/TN) and false positive/false negative (FP/FN) ratios were tallied and accuracy</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>SVM training process</p>
               </caption>
               <text>
                  <p><b>SVM training process. </b>The complete dataset generated by visual inspection was split into two parts, yielding a canonical training dataset of 3,802 samples and a canonical testing dataset of 3,802 samples, each consisting of an equal number of isoform and paralog instances. The canonical training dataset was again split into four subsets (denoted by numbers in circles) and submitted to the grid-search procedure. The resulting classifier was then tested on the canonical testing dataset.</p>
               </text>
               <graphic file="1471-2105-7-110-6"/>
            </fig>
            <p>
               <m:math name="1471-2105-7-110-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mfrac>
                           <m:mrow>
                              <m:mi>T</m:mi>
                              <m:mi>P</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>T</m:mi>
                              <m:mi>N</m:mi>
                           </m:mrow>
                           <m:mrow>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>T</m:mi>
                              <m:mi>P</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>T</m:mi>
                              <m:mi>N</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>F</m:mi>
                              <m:mi>P</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>F</m:mi>
                              <m:mi>N</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                           </m:mrow>
                        </m:mfrac>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>6</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdsfaujabdcfaqjabgUcaRiabdsfaujabd6eaobqaaiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdsfaujabd6eaojabgUcaRiabdAeagjabdcfaqjabgUcaRiabdAeagjabd6eaojabcMcaPaaacaWLjaGaaCzcamaabmaabaGaeGOnaydacaGLOaGaayzkaaaaaa@4395@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>and precision (cf. <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>)</p>
            <p>
               <m:math name="1471-2105-7-110-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mfrac>
                           <m:mrow>
                              <m:mi>T</m:mi>
                              <m:mi>P</m:mi>
                           </m:mrow>
                           <m:mrow>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>T</m:mi>
                              <m:mi>P</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>F</m:mi>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                           </m:mrow>
                        </m:mfrac>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>7</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdsfaujabdcfaqbqaaiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabdcfaqjabcMcaPaaacaWLjaGaaCzcamaabmaabaGaeG4naCdacaGLOaGaayzkaaaaaa@3A0B@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>were calculated.</p>
         </sec>
         <sec>
            <st>
               <p>Training of the radial basis function network</p>
            </st>
            <p>To compare the performance of the SVM classifier to another machine learning technique, a neural network classifier (more precisely a radial basis function (RBF) network <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>) was trained on the canonical training dataset. The implementation of RBF networks with adaptive centers by <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> was used with default values (<it>number of centers </it>3; <it>regularization </it>10<sup>-4</sup>;<it> iterations for optimization </it>10).</p>
         </sec>
         <sec>
            <st>
               <p>Assessing performance of classifiers based on three features by jackknife resampling</p>
            </st>
            <p>To estimate the mean accuracy and standard error of the mean of a classifier, it was trained and tested on datasets derived from random splits of the canonical samples derived from Genbank using a 100-fold jackknife resampling process <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. More specifically, the canonical training and testing datasets described above were concatenated yielding a dataset of 7,604 samples, with 3,802 samples of each class. For each jackknife run, 1,901 samples of each class were chosen randomly from this dataset for training, while the remaining samples were used for testing. The mean accuracy and the standard error of the mean (&#963;/<m:math name="1471-2105-7-110-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msqrt><m:mi>N</m:mi></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6eaobWcbeaaaaa@2DEC@</m:annotation></m:semantics></m:math>, where &#963; denotes the standard deviation and <it>N </it>the number of jackknife resamplings) were calculated.</p>
            <p>For each jackknife run, an SVM, RBF network and linear classifier were trained using all three features of the corresponding training dataset. For training the SVM classifier, the kernel parameters as derived by the grid-search on the canonical training dataset (<it>C </it>= 12.5; <it>g </it>= 6.25) were used. The RBF network was trained using default parameter values (<it>number of centers </it>3; <it>regularization </it>10<sup>-4</sup>; <it>iterations for optimization </it>10). With respect to the linear classifier, threshold calculation by line-sweeping (cf. supplemental Figure S1 [see <supplr sid="S1">Additional file 1</supplr>]) in case of three features cannot be accomplished by an exhaustive search in feasible time, since the search space is cubic. Therefore, we estimated lower and upper bounds and searched for the optimum thresholds within these bounds. To be precise, based on visual inspection (cf. Figure <figr fid="F2">2</figr>) only the following feature ranges were searched by line-sweeping on the training datasets:</p>
            <p>1. Sequence similarity: 0.01...0.05</p>
            <p>2. Inverse CBIN count: 0.01...0.03</p>
            <p>3. Fraction of consecutive matches and mismatches: 0.90...0.94</p>
            <p>Although line-sweeping is not exhaustive, the best combination of thresholds found in the reduced search space should represent the optimum; these are 0.01832 for sequence similarity, 0.01613 for inverse CBIN count and 0.92827 for the fraction of consecutive matches and mismatches.</p>
            <p>Accuracy, precision and true positive/true negative (TP/TN) and false positive/false negative (FP/FN) ratios were averaged over all jackknife runs and the standard error of the mean of each of these properties was calculated (cf. Table <tblr tid="T1">1</tblr> and Figure <figr fid="F4">4</figr>).</p>
         </sec>
         <sec>
            <st>
               <p>Classifiers based on fewer features, thresholds and parameters; measuring performance</p>
            </st>
            <p>Performance of the classifiers based on three features was compared to the performance of classifiers based on a reduced set of two or only one feature(s), using the canonical training and testing datasets only. In contrast to the studies using resampling, all linear classifiers were derived by exhaustive line sweeping, that is, by an exhaustive search for the best combination of thresholds or the best single threshold in case of one feature. The thresholds for linear classifiers are listed in the supplementary data, Tables S1 and S2 [see <supplr sid="S1">Additional file 1</supplr>]. The kernel parameters (cf. <b>Methods</b>, section <it>Training of the SVM</it>) for SVM classifiers based on canonical training datasets are listed in Table S3 of the supplementary data [see <supplr sid="S1">Additional file 1</supplr>]. Performance (in terms of accuracy) of all classifiers was noted on canonical testing datasets and homologous-regions-only datasets and is given in Table <tblr tid="T3">3</tblr>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>MS drafted the manuscript. AS, PC and SL participated in data analysis and helped in drafting the manuscript, together with GF who supervised the research. AS provided all data related to the <it>Xenopus </it>ESTs. MS implemented and tested the IsoSVM tool and carried out all classifier training and testing procedures. GF and AS helped with testing and application of the tool.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We would like to thank the Interdisciplinary Center for Clinical Research, M&#252;nster, for partial funding of this work, Karl Grosse-Vogelsang, Integrated Functional Genomics, M&#252;nster, for maintaining and providing access to a 16-node x86-cluster, enabling the calculation of countless grid-searches in acceptable time, and Martin Eisenacher, Integrated Functional Genomics, M&#252;nster, for advice on statistics and linear classifiers.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <aug>
               <au>
                  <snm>Alberts</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lewis</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Raff</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Walter</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Molecular Biology of the Cell</source>
            <publisher>Garland Publishing, New York</publisher>
            <edition>4</edition>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Alternative splicing: increasing diversity in the proteomic world</p>
            </title>
            <aug>
               <au>
                  <snm>Graveley</snm>
                  <fnm>BR</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>2</issue>
            <fpage>100</fpage>
            <lpage>107</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11173120</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Listening to silence and understanding nonsense: exonic mutations that affect splicing</p>
            </title>
            <aug>
               <au>
                  <snm>Cartegni</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Chew</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Krainer</snm>
                  <fnm>AR</fnm>
               </au>
            </aug>
            <source>Nature Reviews Genetics</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>285</fpage>
            <lpage>298</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11967553</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Alternative RNA splicing in the nervous system</p>
            </title>
            <aug>
               <au>
                  <snm>Grabowski</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>Black</snm>
                  <fnm>DL</fnm>
               </au>
            </aug>
            <source>Prog Neurobiol</source>
            <pubdate>2001</pubdate>
            <volume>65</volume>
            <issue>3</issue>
            <fpage>289</fpage>
            <lpage>308</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11473790</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Distinguishing homologous from analogous proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Fitch</snm>
                  <fnm>WM</fnm>
               </au>
            </aug>
            <source>Syst Zool</source>
            <pubdate>1970</pubdate>
            <volume>19</volume>
            <issue>2</issue>
            <fpage>99</fpage>
            <lpage>113</lpage>
            <xrefbib>
               <pubid idtype="pmpid">5449325</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>ASAP: The Alternative Splicing Annotation Project</p>
            </title>
            <aug>
               <au>
                  <snm>Lee</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Atanelov</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Modrek</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Xing</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>101</fpage>
            <lpage>105</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165476</pubid>
                  <pubid idtype="pmpid" link="fulltext">12519958</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>EASED: Extended Alternatively Spliced EST Database</p>
            </title>
            <aug>
               <au>
                  <snm>Pospisil</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Herrmann</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bortfeldt</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Reich</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>D70</fpage>
            <lpage>74</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308870</pubid>
                  <pubid idtype="pmpid" link="fulltext">14681361</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>ASD: the Alternative Splicing Database</p>
            </title>
            <aug>
               <au>
                  <snm>Thanaraj</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Stamm</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Clark</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Riethoven</snm>
                  <fnm>JJM</fnm>
               </au>
               <au>
                  <snm>Le Texier</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Muilu</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>D64</fpage>
            <lpage>D69</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308764</pubid>
                  <pubid idtype="pmpid" link="fulltext">14681360</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>A training algorithm for optimal margin classifiers</p>
            </title>
            <aug>
               <au>
                  <snm>Boser</snm>
                  <fnm>BE</fnm>
               </au>
               <au>
                  <snm>Guyon</snm>
                  <fnm>IM</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>VN</fnm>
               </au>
            </aug>
            <source>5th Annual ACM Workshop COLT</source>
            <pubdate>1992</pubdate>
            <fpage>144</fpage>
            <lpage>152</lpage>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Support vector networks</p>
            </title>
            <aug>
               <au>
                  <snm>Cortes</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>1995</pubdate>
            <volume>20</volume>
            <fpage>273</fpage>
            <lpage>297</lpage>
         </bibl>
         <bibl id="B11">
            <aug>
               <au>
                  <snm>Sch&#246;lkopf</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Smola</snm>
                  <fnm>AJ</fnm>
               </au>
            </aug>
            <source>Learning with Kernels</source>
            <publisher>MIT Press, Cambridge, MA</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B12">
            <title>
               <p>An Introduction to Kernel-based Learning Algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>M&#252;ller</snm>
                  <fnm>KR</fnm>
               </au>
               <au>
                  <snm>Mika</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>R&#228;tsch</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Tsuda</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Sch&#246;lkopf</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>IEEE Neural Networks</source>
            <pubdate>2001</pubdate>
            <volume>12</volume>
            <issue>2</issue>
            <fpage>181</fpage>
            <lpage>201</lpage>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Support vector machine applications in bioinformatics</p>
            </title>
            <aug>
               <au>
                  <snm>Byvatov</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Appl Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>2</volume>
            <issue>2</issue>
            <fpage>67</fpage>
            <lpage>77</lpage>
            <xrefbib>
               <pubid idtype="pmpid">15130823</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Sequence information for the splicing of human pre-mRNA identified by support vector machine classification</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>XH</fnm>
               </au>
               <au>
                  <snm>Heller</snm>
                  <fnm>KA</fnm>
               </au>
               <au>
                  <snm>Hefter</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Leslie</snm>
                  <fnm>CS</fnm>
               </au>
               <au>
                  <snm>Chasin</snm>
                  <fnm>LA</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <issue>12</issue>
            <fpage>2637</fpage>
            <lpage>2650</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">403805</pubid>
                  <pubid idtype="pmpid" link="fulltext">14656968</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Mismatch string kernels for discriminative protein classification</p>
            </title>
            <aug>
               <au>
                  <snm>Leslie</snm>
                  <fnm>CS</fnm>
               </au>
               <au>
                  <snm>Eskin</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Cohen</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Weston</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Noble</snm>
                  <fnm>WS</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>4</issue>
            <fpage>467</fpage>
            <lpage>476</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">14990442</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Accurate identification of alternatively spliced exons using support vector machine</p>
            </title>
            <aug>
               <au>
                  <snm>Dror</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Sorek</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Shamir</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>7</issue>
            <fpage>897</fpage>
            <lpage>901</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15531599</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Making large-Scale SVM Learning Practical</p>
            </title>
            <aug>
               <au>
                  <snm>Joachims</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Advances in Kernel Methods &#8211; Support Vector Learning</source>
            <publisher>MIT-Press</publisher>
            <editor>Sch&#246;lkopf B, Burges C, Smola A</editor>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B18">
            <title>
               <p>BLASTing proteomes, yielding phylogenies</p>
            </title>
            <aug>
               <au>
                  <snm>Fuellen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Spitzer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Cullen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Lorkowski</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>In Silico Biol</source>
            <pubdate>2003</pubdate>
            <volume>3</volume>
            <issue>3</issue>
            <fpage>313</fpage>
            <lpage>319</lpage>
            <xrefbib>
               <pubid idtype="pmpid">12954093</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Correspondence of function and phylogeny of ABC proteins based on an automated analysis of 20 model protein data sets</p>
            </title>
            <aug>
               <au>
                  <snm>Fuellen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Spitzer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Cullen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Lorkowski</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2005</pubdate>
            <volume>61</volume>
            <issue>4</issue>
            <fpage>888</fpage>
            <lpage>899</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16254912</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Fast learning in networks of locally-tuned processing units</p>
            </title>
            <aug>
               <au>
                  <snm>Moody</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Darken</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <source>Neural Computation</source>
            <pubdate>1989</pubdate>
            <volume>1</volume>
            <issue>2</issue>
            <fpage>281</fpage>
            <lpage>294</lpage>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Database resources of the National Center for Biotechnology Information</p>
            </title>
            <aug>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Barrett</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Benson</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Bryant</snm>
                  <fnm>SH</fnm>
               </au>
               <au>
                  <snm>Canese</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>DiCuccio</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Edgar</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Federhen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Helmberg</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Kenton</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Khovayko</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Maglott</snm>
                  <fnm>DR</fnm>
               </au>
               <au>
                  <snm>Ostell</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Pontius</snm>
                  <fnm>JU</fnm>
               </au>
               <au>
                  <snm>Pruitt</snm>
                  <fnm>KD</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Schriml</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Sequeira</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Sherry</snm>
                  <fnm>ST</fnm>
               </au>
               <au>
                  <snm>Sirotkin</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Starchenko</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Suzek</snm>
                  <fnm>TO</fnm>
               </au>
               <au>
                  <snm>Tatusov</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Yaschenko</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>D39</fpage>
            <lpage>D45</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">540016</pubid>
                  <pubid idtype="pmpid" link="fulltext">15608222</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>XenDB: full length cDNA prediction and cross species mapping in Xenopus laevis</p>
            </title>
            <aug>
               <au>
                  <snm>Sczyrba</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Beckstette</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Brivanlou</snm>
                  <fnm>AH</fnm>
               </au>
               <au>
                  <snm>Giegerich</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Altmann</snm>
                  <fnm>CR</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>123</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1261260</pubid>
                  <pubid idtype="pmpid" link="fulltext">16162280</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Replacing Suffix Trees with Enhanced Suffix Arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Abouelhoda</snm>
                  <fnm>MI</fnm>
               </au>
               <au>
                  <snm>Kurtz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ohlebusch</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Journal of Discrete Algorithms</source>
            <pubdate>2004</pubdate>
            <volume>2</volume>
            <fpage>53</fpage>
            <lpage>86</lpage>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Vmatch</p>
            </title>
            <url>http://www.vmatch.de</url>
         </bibl>
         <bibl id="B25">
            <title>
               <p>CAP3: A DNA sequence assembly program</p>
            </title>
            <aug>
               <au>
                  <snm>Huang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Madan</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <issue>9</issue>
            <fpage>868</fpage>
            <lpage>877</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">310812</pubid>
                  <pubid idtype="pmpid" link="fulltext">10508846</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Sch&#228;ffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>The human ATP-binding cassette (ABC) transporter superfamily</p>
            </title>
            <aug>
               <au>
                  <snm>Dean</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rzhetsky</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Allikmets</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <issue>7</issue>
            <fpage>1156</fpage>
            <lpage>1166</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11435397</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>IsoSVM</p>
            </title>
            <url>http://www.uni-muenster.de/Bioinformatics/services/isosvm/</url>
         </bibl>
         <bibl id="B29">
            <title>
               <p>A practical guide to support vector classification</p>
            </title>
            <aug>
               <au>
                  <snm>Hsu</snm>
                  <fnm>CW</fnm>
               </au>
               <au>
                  <snm>Chang</snm>
                  <fnm>CC</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <url>http://www.csie.ntu.edu.tw/~cjlin/</url>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Neural Network FAQ</p>
            </title>
            <aug>
               <au>
                  <snm>Sarle</snm>
                  <fnm>WS</fnm>
               </au>
            </aug>
            <source>Periodic posting to the Usenet newsgroup comp.ai.neural-nets</source>
            <pubdate>1997</pubdate>
         </bibl>
         <bibl id="B31">
            <title>
               <p>MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform</p>
            </title>
            <aug>
               <au>
                  <snm>Katoh</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Misawa</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kuma</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Miyata</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>3059</fpage>
            <lpage>3066</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">135756</pubid>
                  <pubid idtype="pmpid" link="fulltext">12136088</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>The rapid generation of mutation data matrices from protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Taylor</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Comput Appl Biosci</source>
            <pubdate>1992</pubdate>
            <volume>8</volume>
            <issue>3</issue>
            <fpage>275</fpage>
            <lpage>282</lpage>
            <xrefbib>
               <pubid idtype="pmpid">1633570</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>A Gentle Guide to Multiple Alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Fuellen</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Complexity International</source>
            <pubdate>1997</pubdate>
            <volume>4</volume>
            <url>http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/</url>
         </bibl>
         <bibl id="B34">
            <title>
               <p>MView: a web-compatible database search or multiple alignment viewer</p>
            </title>
            <aug>
               <au>
                  <snm>Brown</snm>
                  <fnm>NP</fnm>
               </au>
               <au>
                  <snm>Leroy</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1998</pubdate>
            <volume>14</volume>
            <issue>4</issue>
            <fpage>380</fpage>
            <lpage>381</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9632837</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Qian</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Luscombe</snm>
                  <fnm>NM</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>15</issue>
            <fpage>1917</fpage>
            <lpage>1926</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">14555624</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Soft Margins for AdaBoost</p>
            </title>
            <aug>
               <au>
                  <snm>R&#228;tsch</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Onoda</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>M&#252;ller</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Mach Learn</source>
            <pubdate>2001</pubdate>
            <volume>42</volume>
            <issue>3</issue>
            <fpage>287</fpage>
            <lpage>320</lpage>
         </bibl>
         <bibl id="B37">
            <title>
               <p>A leisurely look at the bootstrap, the jackknife, and cross-validation</p>
            </title>
            <aug>
               <au>
                  <snm>Efron</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Gong</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>The American Statistician</source>
            <pubdate>1983</pubdate>
            <volume>37</volume>
            <fpage>36</fpage>
            <lpage>48</lpage>
         </bibl>
      </refgrp>
   </bm>
</art>
