<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-410</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Features generated for computational splice-site prediction correspond to functional elements</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Dogan</snm>
               <mnm>Islamaj</mnm>
               <fnm>Rezarta</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>rezarta@cs.umd.edu</email>
            </au>
            <au id="A2">
               <snm>Getoor</snm>
               <fnm>Lise</fnm>
               <insr iid="I1"/>
               <email>getoor@cs.umd.edu</email>
            </au>
            <au id="A3">
               <snm>Wilbur</snm>
               <fnm>W John</fnm>
               <insr iid="I2"/>
               <email>wilbur@mail.nih.gov</email>
            </au>
            <au id="A4">
               <snm>Mount</snm>
               <mi>M</mi>
               <fnm>Stephen</fnm>
               <insr iid="I3"/>
               <insr iid="I4"/>
               <email>smount@umd.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Computer Science Department, University of Maryland, College Park, MD 20742, USA</p>
            </ins>
            <ins id="I2">
               <p>National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA</p>
            </ins>
            <ins id="I3">
               <p>Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA</p>
            </ins>
            <ins id="I4">
               <p>Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>410</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/410</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17958908</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-410</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>19</day>
               <month>3</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>24</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>24</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Dogan et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Accurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We present examples from the observed features that correspond to known signals, both core signals (including the branch site and pyrimidine tract) and auxiliary signals (including GGG triplets and exon splicing enhancers). We present evidence that features identified by FGA include splicing signals not found by other methods.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Our generated features capture known biological signals in the expected sequence interval flanking splice sites. The method can be easily applied to other species and to similar classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, etc.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The analysis of genome sequences in order to discover the location and structure of genes is an increasingly important task. However, a complete and accurate description of the gene structure on the basis of sequence alone remains a difficult problem <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. In eukaryotic organisms, sequences known as <it>introns </it>are removed from precursors to mRNA, in the complex process of splicing. The boundaries between introns and exons are called <it>splice sites </it>and the identification of these positions poses a particular challenge. The adjacent nucleotides on intron boundaries comprise two different consensus sequences for the 5' (donor) site and 3' (acceptor) site. Position-specific scoring matrices can be compiled from thousands of annotated splice sites that reflect the contribution of each base at each position. Any given sequence can then be evaluated on the degree of agreement with the consensus matrix. However, similar sequences within introns and exons that fit the scoring matrices are observed at a very high frequency, and information at the 5' splice site, branch site, and 3' splice site is insufficient to accurately predict splicing outcomes <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. These facts suggest that other factors must also play a role and help the complex of RNA and proteins identify real splice sites <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
         <p>In many cases, the discrimination between splice sites and other sequences can be optimized using machine-learning methods. A machine-learning algorithm uses a set of known examples (the training set) and a set of characteristics or <it>features </it>describing the training set to construct a model of the data. The learned model is evaluated by testing its accuracy on a held-out test set. Different machine-learning algorithms, such as Markov models or neural networks, have been used to improve splice-site prediction <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. GeneSplicer, described by Pertea et al. <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and MaxEnt, described by Yeo and Burge <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, are examples of machine-learning algorithms applied to splice-site prediction. GeneSplicer uses Markov modelling techniques, in addition to Maximal Dependency Decomposition analysis, and MaxEnt uses a maximum entropy approach to rank and select "constraints" (features) for splice-site prediction.</p>
         <p>An important input to any machine-learning algorithm is the choice of features describing the dataset. A challenge is how to determine the best set of features for the prediction task at hand. This is especially true for sequence data. One solution is to use automated feature-selection techniques that identify useful or informative features from a large collection of features.</p>
         <p>Feature-selection techniques have been used extensively in machine-learning problems, and they have been receiving more attention in the computational biology community. For example, Liu and Wong used feature-selection methods in their prediction of translation-initiation sites <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Degroeve et al. <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp> used feature subset selection, combined with support vector machines, to predict splice sites. Zhang et al. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, employed a recursive feature-selection technique, based on support vector machines, to identify sequence information that distinguishes real exons from pseudo exons.</p>
         <p>In earlier work, we developed a feature-generation algorithm (FGA) for sequence classification <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The algorithm used the four nucleotides of the DNA alphabet, {A, C, G and T}, and their positions in the sequence to construct descriptive features. FGA started with these basic features and built more-complex features in an iterative fashion. These features were: groups of consecutive nucleotides, groups of not-necessarily-adjacent nucleotides, and nucleotides or groups of nucleotides associated with particular positions or a range of relative positions in the sequence. Because the feature space explored was very large, FGA iteratively reduced the size of the feature set by eliminating features according to various feature-selection methods. Then, the final set of features that we obtained became input for the learning algorithm.</p>
         <p>The learning algorithm that we used (C-Modified Least Squares, CMLS) is a max-margin classifier similar to support vector machines (Zhang and Oles, <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>). Relative to support vector machines, the CMLS algorithm exhibits a faster convergence, resulting in shorter training times. Our generated features, in combination with the CMLS classifier, resulted in two very effective splice-site prediction models for acceptor <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and donor sites. We illustrate the performance of the FGA model for acceptor and donor splice-site prediction in Figure <figr fid="F1">1A</figr> and Figure <figr fid="F1">1B</figr>. Here we also include a comparison with the performances of GeneSplicer and MaxEnt splice-site prediction models. The FGA classifier has been made generally available as a webserver <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Receiver Operating Curve Analysis for FGA, GeneSplicer and MaxEnt for Acceptor (A) and Donor (B) Splice-Site Prediction</p>
            </caption>
            <text>
               <p><b>Receiver Operating Curve Analysis for FGA, GeneSplicer and MaxEnt for Acceptor (A) and Donor (B) Splice-Site Prediction</b>. The true positive rate (TP/(TP+FN)) is plotted versus the false positive rate (FP/(FP+TN)). We show the sensitivity values ranging from 50% to 95%. When the score threshold for each method is adjusted such that 5% of the true sites are missed (sensitivity is 95%), for acceptor splice-site prediction, MaxEnt has recalled 10.48 % of the false sites, GeneSplicer 5.80% and FGA only 2.49%, and, for donor splice-site prediction, MaxEnt has recalled 6.61 % of the false sites, GeneSplicer 6.40% and FGA only 3.30%. These results are computed on the Human dataset of GeneSplicer team which contains 1,115 pre-mRNA sequences.</p>
            </text>
            <graphic file="1471-2105-8-410-1"/>
         </fig>
         <p>In this paper, we explore the knowledge-discovery power of our algorithm by taking a closer look at the generated features. We present examples of the observed feature groups and describe our efforts to detect biological signals that may be important for the splicing process. We find that the features generated for computational splice-site prediction include known functional elements, and we present evidence that these features provide previously unknown information about some aspects of these splicing signals.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Sequences and splice-site neighborhood</p>
            </st>
            <p>For these experiments we considered canonical splice sites. We explored a splice-site neighborhood of 80 nucleotides upstream and 80 nucleotides downstream of the consensus AG or GT dinucleotides, with a total sequence length of 162 nucleotides. The sequence alphabet was composed of four different nucleotides A, C, G and T, and their individual positions were measured relative to the annotated splice site.</p>
         </sec>
         <sec>
            <st>
               <p>Description of generated feature sets</p>
            </st>
            <p>Here we summarize the specific steps used to generate the composite feature sets used in our analysis. These features are significantly more complex than the features previously considered in the literature. The algorithm, FGA, is described formally in the Methods section and in <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. To generate a composite feature set we need to specify an initial set of features, an appropriate construction method, and a fast feature-selection method. To prepare the initial sets of features, we started with the position-specific <it>k</it>-mer sets for <it>k </it>from <it>3 </it>to <it>6</it>. The numbers of potential features for these feature sets are, respectively, <it>10,240, 40,960, 163,840, and 655,360</it>. For each of these sets the <it>Information Gain </it>feature-selection method was used to select the top scoring <it>5,000 </it>features. These sets constituted our initial feature sets for the construction algorithm. As described in Methods, the feature-construction method expanded each of these sets by adding one position-specific nucleotide in an unconstrained position. After the construction step, we again used Information Gain to evaluate each of the features in the constructed set. Then we evaluated each feature according to a logistic scheme, taking into account the distance between the newly added nucleotide and the original feature, preferring features for which the distance was smaller. After the feature selection step, the top scoring <it>5,000 </it>features were selected. These sets constituted the input sets for the next iteration. We ran the algorithm and generated features up to, at most, 10 conjunct nucleotides in different positions in the composite feature sets. For each set of features we built a separate splice-site prediction model using the CMLS <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> classification algorithm. Table <tblr tid="T1">1</tblr> summarizes the splice-site prediction performance for each of these feature sets. Some of these sets performed better than others, but in our analysis we explored all the sets for the purpose of knowledge discovery.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Individual classification performances of FGA-generated feature sets for 3' (A-KmerX) and 5' (D-KmerX) splice sites.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>A-3mer</p>
                     </c>
                     <c ca="left">
                        <p>86.46</p>
                     </c>
                     <c ca="left">
                        <p>A-4mer</p>
                     </c>
                     <c ca="left">
                        <p>84.92</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-3mer1</p>
                     </c>
                     <c ca="left">
                        <p>84.16</p>
                     </c>
                     <c ca="left">
                        <p>A-4mer1</p>
                     </c>
                     <c ca="left">
                        <p>77.28</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-3mer2</p>
                     </c>
                     <c ca="left">
                        <p>77.01</p>
                     </c>
                     <c ca="left">
                        <p>A-4mer2</p>
                     </c>
                     <c ca="left">
                        <p>69.10</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-3mer3</p>
                     </c>
                     <c ca="left">
                        <p>69.42</p>
                     </c>
                     <c ca="left">
                        <p>A-4mer3</p>
                     </c>
                     <c ca="left">
                        <p>63.11</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-3mer4</p>
                     </c>
                     <c ca="left">
                        <p>63.30</p>
                     </c>
                     <c ca="left">
                        <p>A-4mer4</p>
                     </c>
                     <c ca="left">
                        <p>56.66</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-3mer5</p>
                     </c>
                     <c ca="left">
                        <p>56.84</p>
                     </c>
                     <c ca="left">
                        <p>A-4mer5</p>
                     </c>
                     <c ca="left">
                        <p>49.23</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-3mer6</p>
                     </c>
                     <c ca="left">
                        <p>49.50</p>
                     </c>
                     <c ca="left">
                        <p>A-4mer6</p>
                     </c>
                     <c ca="left">
                        <p>41.02</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-3mer7</p>
                     </c>
                     <c ca="left">
                        <p>41.22</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-5mer</p>
                     </c>
                     <c ca="left">
                        <p>80.60</p>
                     </c>
                     <c ca="left">
                        <p>A-6mer</p>
                     </c>
                     <c ca="left">
                        <p>68.64</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-5mer1</p>
                     </c>
                     <c ca="left">
                        <p>69.20</p>
                     </c>
                     <c ca="left">
                        <p>A-6mer1</p>
                     </c>
                     <c ca="left">
                        <p>61.72</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-5mer2</p>
                     </c>
                     <c ca="left">
                        <p>62.74</p>
                     </c>
                     <c ca="left">
                        <p>A-6mer2</p>
                     </c>
                     <c ca="left">
                        <p>54.65</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-5mer3</p>
                     </c>
                     <c ca="left">
                        <p>56.25</p>
                     </c>
                     <c ca="left">
                        <p>A-6mer3</p>
                     </c>
                     <c ca="left">
                        <p>47.19</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-5mer4</p>
                     </c>
                     <c ca="left">
                        <p>49.08</p>
                     </c>
                     <c ca="left">
                        <p>A-6mer4</p>
                     </c>
                     <c ca="left">
                        <p>39.62</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A-5mer5</p>
                     </c>
                     <c ca="left">
                        <p>40.51</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-3mer</p>
                     </c>
                     <c ca="left">
                        <p>86.79</p>
                     </c>
                     <c ca="left">
                        <p>D-4mer</p>
                     </c>
                     <c ca="left">
                        <p>85.21</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-3mer1</p>
                     </c>
                     <c ca="left">
                        <p>83.45</p>
                     </c>
                     <c ca="left">
                        <p>D-4mer1</p>
                     </c>
                     <c ca="left">
                        <p>81.14</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-3mer2</p>
                     </c>
                     <c ca="left">
                        <p>80.31</p>
                     </c>
                     <c ca="left">
                        <p>D-4mer2</p>
                     </c>
                     <c ca="left">
                        <p>70.47</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-3mer3</p>
                     </c>
                     <c ca="left">
                        <p>70.08</p>
                     </c>
                     <c ca="left">
                        <p>D-4mer3</p>
                     </c>
                     <c ca="left">
                        <p>55.38</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-3mer4</p>
                     </c>
                     <c ca="left">
                        <p>56.06</p>
                     </c>
                     <c ca="left">
                        <p>D-4mer4</p>
                     </c>
                     <c ca="left">
                        <p>44.77</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-3mer5</p>
                     </c>
                     <c ca="left">
                        <p>42.97</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-5mer</p>
                     </c>
                     <c ca="left">
                        <p>83.64</p>
                     </c>
                     <c ca="left">
                        <p>D-6mer</p>
                     </c>
                     <c ca="left">
                        <p>75.03</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-5mer1</p>
                     </c>
                     <c ca="left">
                        <p>77.20</p>
                     </c>
                     <c ca="left">
                        <p>D-6mer1</p>
                     </c>
                     <c ca="left">
                        <p>66.68</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-5mer2</p>
                     </c>
                     <c ca="left">
                        <p>57.42</p>
                     </c>
                     <c ca="left">
                        <p>D-6mer2</p>
                     </c>
                     <c ca="left">
                        <p>43.31</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>D-5mer3</p>
                     </c>
                     <c ca="left">
                        <p>38.09</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>FGA-generated feature sets for splice sites and their individual performances at splice-site prediction. Each value reported is an average precision (positive predictive value, TP/(TP+FP)) over 11 values of recall (sensitivity, TP/(TP+FN)), 0%, 10%, 20% ... and 100%, and is the result of a three-fold cross validation. All the features in these features sets extend along the whole splice-site neighbourhood [-82, 80] that we study.</p>
               </tblfn>
            </tbl>
            <p>In what follows, we use the shorthand notation <it>S</it>-<it>kMERn</it>[<it>p</it><sub>1</sub>, <it>p</it><sub>2</sub>] to describe the composite feature subsets that we studied. In this notation, <it>S </it>&#8712; {<it>A</it>, <it>D</it>} stands for acceptor (<it>A</it>) or donor (<it>D</it>) splice sites, <it>kMER </it>stands for the number of consecutive position-specific nucleotide features in the initial set, <it>n </it>is the number of additional conjuncts and [<it>p</it>1, <it>p</it>2] denotes the interval from position <it>p</it>1 to position <it>p</it>2 in the sequence. For example, <it>A</it>-3<it>mer</it>3[20,40] is a subset of acceptor splice-site features. These features were generated from the initial set of position-specific 3-mer features and were obtained after three FGA iterations, adding each time a new nucleotide in an unconstrained position within the specified interval. The sequence positions associated with each of the features in this subset were from the coding region 20 to 40 nucleotides downstream the acceptor splice site.</p>
            <p>Following with our definitions, we say that two composite features match if they share the same nucleotide pattern, starting at different positions. For example, let 4<it>mer</it>[1,10] = {<it>a</it><sub>1 </sub><it>g</it><sub>2 </sub><it>c</it><sub>3 </sub><it>t</it><sub>4</sub>, <it>a</it><sub>6 </sub><it>g</it><sub>7 </sub><it>c</it><sub>8 </sub><it>t</it><sub>9</sub>} be the subset of composite 4-mer features from the interval <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B10">10</abbr></abbrgrp>, where <it>a</it><sub>1 </sub>denotes nucleotide <it>a </it>at the first sequence position. In this case, the features <it>a</it><sub>1 </sub><it>g</it><sub>2 </sub><it>c</it><sub>3 </sub><it>t</it><sub>4 </sub>and <it>a</it><sub>6 </sub><it>g</it><sub>7 </sub><it>c</it><sub>8 </sub><it>t</it><sub>9</sub>, are two <it>matching composite features</it>. A composite feature subset may contain several matching features that differ only in the starting position within the specified interval. We represent a set of such occurrences with an <it>interval-feature pattern</it>, e.g. <it>a</it><sub><it>i </it></sub><it>g</it><sub><it>i</it>+1 </sub><it>c</it><sub><it>i</it>+2 </sub><it>t</it><sub><it>i</it>+3</sub>. An interval-feature pattern is the nucleotide pattern shared among the matching composite features and the number of interval occurrences of a feature pattern is the number of matching composite features it represents. We use the notation <it>S</it>-<it>kMERn</it>[<it>p</it><sub>1</sub>, <it>p</it><sub>2</sub>]* to denote the set of all interval- feature patterns for the subset <it>S</it>-<it>kMERn</it>[<it>p</it><sub>1</sub>, <it>p</it><sub>2</sub>]. For the above example, given the set of features 4<it>mer</it>[1,10] = {<it>a</it><sub>1 </sub><it>g</it><sub>2 </sub><it>c</it><sub>3 </sub><it>t</it><sub>4</sub>, <it>a</it><sub>6 </sub><it>g</it><sub>7 </sub><it>c</it><sub>8 </sub><it>t</it><sub>9</sub>}, the set of interval-feature patterns is 4<it>mer</it>[1,10]* = {<it>a</it><sub><it>i </it></sub><it>g</it><sub><it>i</it>+1 </sub><it>c</it><sub><it>i</it>+2 </sub><it>t</it><sub><it>i</it>+3</sub>}. The number of occurrences for the pattern <it>a</it><sub><it>i </it></sub><it>g</it><sub><it>i</it>+1 </sub><it>c</it><sub><it>i</it>+2 </sub><it>t</it><sub><it>i</it>+3 </sub>in the given feature set is two.</p>
            <p>In our analysis, features were ranked according to the weight assigned to them by the classification algorithm. We used the WebLogo program <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> to draw frequency plots. We plotted histograms and used basic k-means clustering algorithms and edit-distance measures to cluster the features into groups. Here we list some of our findings and illustrate them with our features.</p>
         </sec>
         <sec>
            <st>
               <p>Knowledge discovery: generated features capture biological signals</p>
            </st>
            <p>What kinds of biological signals do these generated features capture? Examples of positive signals that we might expect to find in a typical pre-mRNA include the branch site, the pyrimidine-rich region close to the acceptor splice site, splice-site consensus signals themselves, and exonic splicing enhancers. In addition, it is likely that sequence elements associated with the coding sequence were present among our features. However, we found that FGA performed quite well (the 11-point average precisions for acceptor and donor splice sites were, respectively, 83.33% and 64.52%) on the recognition of splice sites flanked by non-coding exons (data not shown).</p>
         </sec>
         <sec>
            <st>
               <p>The Branch Site interval</p>
            </st>
            <p>The mammalian branch-site signal is difficult to describe because it is degenerate and shows very low levels of purifying selection <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. In order to investigate the branch-point signal, we examined composite features of 6 nucleotides that start in the interval from 40 to 20 nucleotides upstream from the acceptor splice site (and therefore extend from -40 to -15). Our current feature set for this purpose was <it>A</it>-3<it>mer</it>3[-40,-20]. The subset contained 346 selected features.</p>
            <p>Table <tblr tid="T2">2</tblr> shows the top-scoring 20 features in their exact position with respect to the annotated acceptor site, which is found 15 nucleotides downstream of the interval shown. Each feature is listed, ranked by the weight assigned by the CMLS classification algorithm. A large number of positional features in this feature set captured the branch-point signal. In fact, of the 30 features that had weights above 0.1 in this set, all but 5 contained either CTRA or at least five pyrimidines. In absolute numbers, 97 individual features of this set matched the branch-point consensus TNCTRAC <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> and 158 features were pyrimidine-rich. The rest of the features were assigned negative weights. The negatively weighted features comprised a G-rich signal mostly. Of those, 44 features matched the pattern AGG and the others were A-rich (see supplemental data, additional file <supplr sid="S1">1</supplr>).</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p><b>FGA identified features that contribute to </b>Table <tblr tid="T2">2</tblr>. <b>(Table2_features_A3mer3.txt)</b>. A text file with the complete list of features associated with the branch-point interval [-40,-20], from the feature set A-3mer3. The features are ranked according to the absolute value or their assigned weight. The top-scoring 20 features of this list are shown in Table <tblr tid="T2">2</tblr>.</p>
               </text>
               <file name="1471-2105-8-410-S1.txt">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Top scoring features in branch site interval</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>
                           <b>FGA A-3mer3 [-40,-20] features</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Weight</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>------------ctgacc-------</p>
                     </c>
                     <c ca="center">
                        <p>0.1800</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-----------ctgacc--------</p>
                     </c>
                     <c ca="center">
                        <p>0.1678</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>----------------ctgacc---</p>
                     </c>
                     <c ca="center">
                        <p>0.1488</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>----------ctgacc---------</p>
                     </c>
                     <c ca="center">
                        <p>0.1453</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-------------cctgac------</p>
                     </c>
                     <c ca="center">
                        <p>0.1417</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>---------------cctgac----</p>
                     </c>
                     <c ca="center">
                        <p>0.1382</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>----------------tgaccc---</p>
                     </c>
                     <c ca="center">
                        <p>0.1371</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>--------ctgacc-----------</p>
                     </c>
                     <c ca="center">
                        <p>0.1370</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-----------------cctgac--</p>
                     </c>
                     <c ca="center">
                        <p>0.1368</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>------ctgacc-------------</p>
                     </c>
                     <c ca="center">
                        <p>0.1359</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>--------------ctgacc-----</p>
                     </c>
                     <c ca="center">
                        <p>0.1358</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-------------------tctctc</p>
                     </c>
                     <c ca="center">
                        <p>0.1303</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>------------------ccttct-</p>
                     </c>
                     <c ca="center">
                        <p>0.1283</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-------------------cttttc</p>
                     </c>
                     <c ca="center">
                        <p>0.1281</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>------------------cttttt-</p>
                     </c>
                     <c ca="center">
                        <p>0.1281</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-------------ctcacc------</p>
                     </c>
                     <c ca="center">
                        <p>0.1254</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-----------ctcacc--------</p>
                     </c>
                     <c ca="center">
                        <p>0.1219</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>---------------ctgact----</p>
                     </c>
                     <c ca="center">
                        <p>0.1206</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-----------cctgac--------</p>
                     </c>
                     <c ca="center">
                        <p>0.1202</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-------------------tccctc</p>
                     </c>
                     <c ca="center">
                        <p>0.1200</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The 20 top-scoring <it>A </it>-3<it>mer</it>3 [-40,-20] features (i.e. composite features that start in the interval between -40 and -25 derived using FGA from a seed of trimers) are all related to either the branch-site consensus or the pyrimidine tract.</p>
               </tblfn>
            </tbl>
            <p>Table <tblr tid="T3">3</tblr> illustrates a subset of <it>A-</it>3<it>mer</it>3[-40,-20]* interval-feature patterns. Each listed pattern represents at least five matching composite features, differing only in the starting position in this interval. The number of interval occurrences is also given and an average weight is computed for each interval-feature pattern from the individual CMLS weights assigned to the distinct matching composite features during training. We grouped these patterns into three categories: 1) nine interval-feature patterns matching the branch-site consensus, 2) two pyrimidine-rich interval-feature patterns, and 3) two negatively weighted purine-rich interval-feature patterns.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Identified interval-feature patterns in the branch-point interval</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="center">
                        <p>
                           <b>A-3mer3 [-40,-20]*</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Interval occurrences</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Average Weight</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Total occurrences</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Total Range</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>--cctgac--</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>0.096</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>[-34,-16]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>---ctgacc-</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>0.131</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>[-33,-16]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>---ctgact-</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>0.082</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>[-32,-16]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-ccctga---</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0.083</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>[-32,-19]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>--gctgac--</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0.083</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>[-34,-18]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>--tctgac--</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0.083</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>[-32,-18]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>----tgaccc</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>0.089</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>[-32,-16]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>--actgac--</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>0.059</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>[-33,-13]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>---ctgatg-</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>0.068</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>[-36, 18]</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-cccctc---</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0.065</p>
                     </c>
                     <c ca="center">
                        <p>24</p>
                     </c>
                     <c ca="center">
                        <p>[-35, 0]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>---cctctc-</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>0.049</p>
                     </c>
                     <c ca="center">
                        <p>22</p>
                     </c>
                     <c ca="center">
                        <p>[-36, 0]</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>--gggagg--</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>-0.041</p>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>[-34, 14]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>--aaaaaa--</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>-0.028</p>
                     </c>
                     <c ca="center">
                        <p>84</p>
                     </c>
                     <c ca="center">
                        <p>[-50, 80]</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The first column shows the interval-feature patterns in the branch-point interval [-40,-20]. The second column shows the number of individual occurrences for each pattern in different positions within the specified interval. The average assigned weight is given in the third column. For comparison we include the total number of occurrences for this pattern in the complete neighbourhood ([-82, 80]) (forth column), and in the last column we show the narrowed range interval that comprises the total occurrences for each pattern.</p>
               </tblfn>
            </tbl>
            <p>Table <tblr tid="T4">4</tblr> lists all the position-specific occurrences of GCTGAC in the [-80, -1] interval. These features matched the branch-site consensus and they were assigned positive weights by the classification algorithm. The distribution of scores for this one hexamer suggests a preferred location for the branch site A at -30 to -20. Many independent observations with related features (<it>e.g</it>. CTAAC) indicated a similar region. For example, in Figure <figr fid="F2">2</figr>, we present a comparison of four tetramer features present in the <it>A</it>-3<it>mer</it>1[-60,-5] set. It is apparent from the distribution of these features that positions -27 through -16 are preferred for the branch site A. This observation agrees well with experimental results <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Individual position-specific GCTGAC features</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Features in exact position wrt AG consensus</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Weight</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-----------gctgac---------------------AG</p>
                     </c>
                     <c ca="center">
                        <p>0.114</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>----------------gctgac----------------AG</p>
                     </c>
                     <c ca="center">
                        <p>0.114</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>---------------gctgac-----------------AG</p>
                     </c>
                     <c ca="center">
                        <p>0.105</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>----------gctgac----------------------AG</p>
                     </c>
                     <c ca="center">
                        <p>0.082</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>------------gctgac--------------------AG</p>
                     </c>
                     <c ca="center">
                        <p>0.077</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>------gctgac--------------------------AG</p>
                     </c>
                     <c ca="center">
                        <p>0.074</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>---------gctgac-----------------------AG</p>
                     </c>
                     <c ca="center">
                        <p>0.068</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-------------gctgac-------------------AG</p>
                     </c>
                     <c ca="center">
                        <p>0.062</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>A summary of position-specific GCTGAC features and their respective weight assigned by the CMLS classifier from the <it>A </it>-3<it>mer</it>3 [-80.-1] feature set.</p>
               </tblfn>
            </tbl>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Weight distribution comparison for pairs of tetramers CTGA, CTAA and TTTT, CCTT</p>
               </caption>
               <text>
                  <p><b>Weight distribution comparison for pairs of tetramers CTGA, CTAA and TTTT, CCTT</b>. The distribution of CMLS weights for four tetramers from <it>A</it>-3<it>mer</it>1 [-60,-5] is shown graphically. Note that the distributions of scores for CTGA and CTAA are similar and sharply focused around the peak that would place the branch A at position -24. Note that the distributions of TTTT and CCTT corresponds to the well-known pyrimidine tract with the additional information that C is preferred to T at positions -15 through -11, where a peak of scores for CCTT coincides with a group of negative values for TTTT. There are no occurrences of these four hexamers in this feature set upstream of the region shown.</p>
               </text>
               <graphic file="1471-2105-8-410-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>The acceptor splice-site (pyrimidine-tract) interval</p>
               </caption>
               <text>
                  <p><b>The acceptor splice-site (pyrimidine-tract) interval</b>. Frequency plot sequence logos for the positively and negatively weighted features in the pyrimidine-tract interval, <it>A</it>-5<it>mer</it>1 [-20,-1], (Figure 3a and Figure 3b), compared with frequency distribution of the training acceptor and non-acceptor sequences in the same interval (Figure 3c and Figure 3d). The positive features frequency plot corresponds to the acceptor splice-site consensus, which is also illustrated with the true acceptor sequences frequency plot. The negative features frequency plot reveals an AG-rich element.</p>
               </text>
               <graphic file="1471-2105-8-410-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>The acceptor splice-site (pyrimidine-tract) interval</p>
            </st>
            <p>Also shown in Figure <figr fid="F2">2</figr> is the distribution of TTTT and CCTT in this interval. Note that this distribution is broader than the distribution of branch-site tetramers. In addition, there is a region (-16 to -12) where the scores assigned to TTTT become negative and tetramers containing C have maximal scores. Similar peaks are observed for CTTT, TCTT, TTCT and TTTC (see supplemental data, additional file <supplr sid="S2">2</supplr>).</p>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>Features that contribute to </b>Figure <figr fid="F2">2</figr><b> and other features that show similar behaviour (tetramers-of-figure2.txt). A text file with the complete list of selected features from feature set A-3mer1 [-60,-5].</b></p>
               </text>
               <file name="1471-2105-8-410-S2.txt">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>In order to further investigate the characteristics of the upstream region close to the acceptor splice site, we also examined the feature set <it>A</it>-5<it>mer</it>[-20,-1] There were more than 2,000 selected features in this subset. We note that a large number of features were selected in this set, indicating stronger potential signals close to the splice site. Based on the weight assigned by the CMLS algorithm, we divided these features into two groups; positively weighted features and negatively weighted ones. In Figure <figr fid="F3">3</figr>, we used the WebLogo program to draw a frequency plot of the two groups of features. The annotated acceptor site is shown in the figure with the consensus dinucleotide AG.</p>
            <p>One interpretation from these plots is that the generated features are capturing the pyrimidine tract, and that they are scanning along the sequence for the exact AG dinucleotide consensus where the true acceptor site is located. The difference between the two frequency plots for positively and negatively weighted features is striking. Figure <figr fid="F3">3a</figr> shows that the presence of the CT-rich feature is very important in this interval and Figure <figr fid="F3">3b</figr> shows that the presence of an AG-rich element is an indicator of a non-splice sequence. The frequency plot for the positively weighted features (Fig. <figr fid="F3">3a</figr>) is very similar to the acceptor splice-site consensus itself. However, our features do not simply reflect the nucleotide frequencies seen at true sites. Figures <figr fid="F3">3c</figr> and <figr fid="F3">3d</figr> show the frequency distribution of the true acceptor sequences and non-acceptor sequences in the training dataset. The frequency distribution of the non-acceptor sequences in our dataset in the pyrimidine-tract interval (Fig. <figr fid="F3">3d</figr>) is different from that of the negatively weighted features in the <it>A</it><b>-</b>5<it>mer</it>[-20,-1] feature set (Fig. <figr fid="F3">3b</figr>). In other words, our features were better than frequency data alone at discriminating true splice sites. To illustrate this difference, we used the frequency distribution matrices of these data to discriminate the true splice sites, achieving an 11ptAvg precision of 40.1%. On the other hand, when we trained a CMLS classifier on the FGA feature set, it achieved an 11ptAvg precision of 80.6% for the same task.</p>
            <p>Exploring the pyrimidine-tract interval further, we selected another feature set, which was characterized by composite positional features containing 7 nucleotides in different positions, <it>A</it>-6<it>mer</it>1[-20,-1]. We made a list of the features, and we identified clusters of similar features, using the <it>k</it>-means clustering algorithm with the edit-distance similarity measure. Figure <figr fid="F4">4</figr> shows some examples and samples of the features in each group.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Clusters of negative features of the pyrimidine-tract interval</p>
               </caption>
               <text>
                  <p><b>Clusters of negative features of the pyrimidine-tract interval</b>. Examples of the individual features for two clusters of features and the assigned CMLS weight for each feature from the feature set <it>A</it>-6<it>mer</it>1 [-20,-1]. The presence of the AG dinucleotide upstream the annotated 3' splice site, in the pyrimidine-tract interval is not preferred. All these features have negative weights.</p>
               </text>
               <graphic file="1471-2105-8-410-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>GGG motifs near the 5' slice site</p>
            </st>
            <p>In order to investigate the characteristics of introns near the 5' splice site, we explored the intron downstream of the 5' splice site, using a number of parameters. In each case, GGG and GGGG motifs were common. For example, the <it>D</it>-3<it>mer</it>[6,64] set included 54 positively ranked occurrences of GGG and 4 negatively ranked occurrences. A plot of scores versus position for GGG and GGGG is provided in Figure <figr fid="F5">5A</figr> and Figure <figr fid="F5">5B</figr>, showing that this motif scores positively in the intron downstream of 5' splice sites but negatively in the flanking exon. GGG likewise dominates <it>D</it>-3<it>mer</it>[-80,-40]. A number of papers have reported a role for GGG and GGGG motifs in splicing <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. Recognition of these motifs has been attributed to the U1 snRNP <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> and hnRNP H <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>G-rich features in the donor-site interval</p>
               </caption>
               <text>
                  <p><b>G-rich features in the donor-site interval</b>. Weighted histogram for all the GGG (A) and GGGG (B) features in the donor-site interval selected from the feature set <it>D</it>-3<it>mer</it>1 [-30,-45] (A) and <it>D</it>-4<it>mer</it>1 [-30,-45] (B). These features are not preferred upstream the donor site, but they are encouraged on the downstream region.</p>
               </text>
               <graphic file="1471-2105-8-410-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>The donor splice-site interval</p>
            </st>
            <p>Next, we investigate the characteristics of the donor splice site. Sample clusters, similar to those created for the acceptor site, are shown in Figure <figr fid="F6">6</figr>. The first two sequence logos, Figure <figr fid="F6">6a</figr> and Figure <figr fid="F6">6b</figr>, show the frequency plot of the positively and negatively weighted groups of features for the set <it>D</it>-6<it>mer</it>[-10,10]. The donor splice-site consensus sequence is MAGGTRAGT (where M is A or C and R is A or G). The next two plots, Figure <figr fid="F6">6c</figr> and Figure <figr fid="F6">6d</figr>, show the frequency plot for the same interval based on the true donor and non-donor sequences in the training dataset. Once again, the sequence logo of the positively weighted features resembles the logo of the nucleotide frequency of the positive data, but important differences are apparent, especially at positions on the periphery of the region shown.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>The donor splice-site interval</p>
               </caption>
               <text>
                  <p><b>The donor splice-site interval</b>. Frequency plot sequence logos for the positively and negatively weighted features in the donor-site interval, <it>D </it>-6<it>mer</it>[10,10] (Figure 6a and Figure 6b), compared with frequency distribution of the training donor and non-donor sequences in the same interval (Figure 6c and Figure 6d). The positively weighted features capture the donor-site consensus ([A|C]AGGT [A|G]AGT.</p>
               </text>
               <graphic file="1471-2105-8-410-6"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Exon Splicing Enhancers (ESEs) and Exon Splicing Suppressors (ESSs)</p>
            </st>
            <p>We also compared our generated features to published work on Exonic Splicing Enhancers (ESEs) and Exonic Splicing Silencers (ESSs). ESEs and ESSs are short oligonucleotide sequences located in the exonic region that affect splicing. The presence of ESE sequences in the exonic region results in the enhancement of the recognition of the nearby splice sites. The presence of the ESS sequences, on the other hand, suppresses nearby splicing events. These regulatory signals have been studied experimentally (reviewed in <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>) and computational methods have been built to find them <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>.</p>
            <p>We considered the set of distinct hexamers in the flanking exon interval, for both acceptor and donor by computing interval features of the region of the sequence downstream from the annotated splice site for acceptor and upstream for donor. We divided this set of interval features into positively and negatively weighted sets. We compared these sets of hexamers (see supplemental data, additional file <supplr sid="S3">3</supplr>) with a list of experimentally identified ESE's and ESS's of mammalian and viral RNA <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. There are 61 experimentally determined ESE sequences listed by Zheng <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, including some that are identical but have different sources. The set of hexamers identified from our method produced an overlap for 54 ESE sequences comprising 641 nucleotides, out of 738, yielding a coverage of 87%. Twenty-eight of these sequences were perfectly identified by the hexamers covering all the nucleotides. The ESS sequences were not recognized as well as the ESE ones. We provide these results as supplement data (see supplemental data, additional file <supplr sid="S4">4</supplr>).</p>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <text>
                  <p><b>FGA identified hexamers in acceptor splice-site prediction and donor splice-site prediction (FGA-hexamers.txt)</b>. A text file with the complete list of hexamers that our method indicates they are likely to be ESEs or ESSs.</p>
               </text>
               <file name="1471-2105-8-410-S3.txt">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S4">
               <title>
                  <p>Additional file 4</p>
               </title>
               <text>
                  <p><b>FGA-generated features produce a significant overlap with experimentally identified ESE sequences table in </b><abbrgrp><abbr bid="B22">22</abbr></abbrgrp><b> (ESE-ESS-overlap-sequences.xls)</b>. The first worksheet in the Excel file contains the table of experimentally identified ESE sequences in <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and the overlap with the FGA identified hexamers from feature sets A-6mer [0,80] and D-6mer [2,82]. For each comparison an exact match is required. We compared the positively weighted hexamer sets against the ESE sequences, and the negatively weighted hexamer sets against ESS sequences. The second worksheet contains the overlap of the ESE sequences with the FGA identified hexamers that are not included in RescueESE, AstESR or ChPESE sets.</p>
               </text>
               <file name="1471-2105-8-410-S4.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>Rescue-ESE <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, Fas-ESS <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> and ESR <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> are computational methods that are specifically tailored to identifying exonic signals that impact a splicing event. Rescue-ESE identified candidate exonic splicing enhancers in vertebrate exons based on their statistical features. This method identified a set of 238 hexamers, which we refer to as RescueESE. Fas-ESS started with a set of experimentally identified exonic splicing silencer sequences of length 10. It computationally derived a set of 176 hexamers which we refer to as FasESS. ESR identified exonic splicing regulator sequences based on conservation of synonymous nucleotides. This set contains 285 hexamers, which were not necessarily divided into enhancer and silencer categories. We refer to this set as AstESR. An additional method (Zhang and Chasin, <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>) compared <it>bona fide </it>exons with pseudo-exons in order to identify putative ESEs (PESEs) and putative ESSs (PESSs). The PESE set contains 2060 octamers and the PESS set contains 1018 octamers. There were 1701 unique hexamers in the PESE set, which we refer to as ChPESE, and there were 924 unique hexamers in the PESS set, which we refer to as ChPESS.</p>
            <p>In order to be able to compare the FGA-generated features with the ESE hexamers identified by these methods, we looked at the different FGA sets of features that contained six consecutive position-specific nucleotides and were associated with the exonic regions. We looked at the feature sets generated for both acceptor and donor splice sites. We selected the features that belonged to the sequence interval 80 nucleotides downstream of annotated acceptor splice sites and 80 nucleotides upstream of annotated donor sites (bearing in mind that these intervals can contain some contribution from the adjacent intron that lies beyond the exon). Because FGA features were position-specific, for each set we computed the interval-feature patterns, thus obtaining a list of hexamers found in the exonic regions. We divided the features into positively weighted and negatively weighted sets denoted as <it>S</it>-<it>kMERn</it>[<it>p</it><sub>1</sub>, <it>p</it><sub>2</sub>]+ and <it>S</it>-<it>kMERn</it>[<it>p</it><sub>1</sub>, <it>p</it><sub>2</sub>]-, where <it>S </it>&#8712; {<it>A</it>, <it>D</it>} stands for acceptor and donor features respectively.</p>
            <p>We computed the overlap between each FGA-generated set of hexamers and each of the four published sets of exonic regulatory sequences (see supplemental data, additional file <supplr sid="S5">5</supplr>). We present the overlap for each pair of sets and the corresponding p-values in Table <tblr tid="T5">5</tblr>. The p-value shows the probability that a randomly selected set of hexamers, containing as many hexamer features as found by the FGA algorithm, has an overlap equal to or greater than the value given in the <it>Overlap </it>column in Table <tblr tid="T5">5</tblr>; this probability is calculated from the hypergeometric distribution. In Table <tblr tid="T5">5</tblr>, we have highlighted all the p-values less than 0.01 or greater than 0.99, indicating the significant relationship between the feature sets. All of these other sets have significant overlaps with our features, but the most significant are with ChPESE and ChPESS sets, perhaps because they were generated using methods similar to ours.</p>
            <suppl id="S5">
               <title>
                  <p>Additional file 5</p>
               </title>
               <text>
                  <p><b>FGA-generated features produce significant overlap with computationally identified lists of exonic splicing regulator signals </b><abbrgrp><abbr bid="B23">23</abbr><abbr bid="B26">26</abbr></abbrgrp><b> (candidate-ese-esr-overlap.txt)</b>. A text file with the list of FGA features overlapping with RescueESE and AstESR exonic splicing regulator lists.</p>
               </text>
               <file name="1471-2105-8-410-S5.txt">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>FGA-generated feature set show significant overlap with ESE and ESS regulator signal sets.</p>
               </caption>
               <tblbdy cols="12">
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <it>FGAset</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>size</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>AstESR (285)</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>RescueESE (238)</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>ChPESE (1701)</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>FasESS (176)</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>ChPESS (924)</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>Overlap, P-value</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>Overlap, P-value</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>Overlap, P-value</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>Overlap, P-value</it>
                           </b>
                        </p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>
                              <it>Overlap, P-value</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-3mer3 [1,80]</p>
                     </c>
                     <c ca="center">
                        <p>313</p>
                     </c>
                     <c ca="center">
                        <p>34</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00514</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>24</p>
                     </c>
                     <c ca="center">
                        <p>0.09415</p>
                     </c>
                     <c ca="center">
                        <p>175</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2.09e-06</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>0.877</p>
                     </c>
                     <c ca="center">
                        <p>73</p>
                     </c>
                     <c ca="center">
                        <p>0.5407</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-3mer3 [1,80]+</p>
                     </c>
                     <c ca="center">
                        <p>177</p>
                     </c>
                     <c ca="center">
                        <p>28</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00003</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>24</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00007</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>130</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.42e-18</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.999</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>*</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-3mer3 [1,80]-</p>
                     </c>
                     <c ca="center">
                        <p>136</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>0.92089</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>*</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>43</p>
                     </c>
                     <c ca="center">
                        <p>0.9939</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>0.129</p>
                     </c>
                     <c ca="center">
                        <p>59</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>3.19e-08</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-4mer2 [1,80]</p>
                     </c>
                     <c ca="center">
                        <p>317</p>
                     </c>
                     <c ca="center">
                        <p>35</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00347</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>26</p>
                     </c>
                     <c ca="center">
                        <p>0.04319</p>
                     </c>
                     <c ca="center">
                        <p>177</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.96e-06</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>0.887</p>
                     </c>
                     <c ca="center">
                        <p>72</p>
                     </c>
                     <c ca="center">
                        <p>0.6423</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-4mer2 [1,80]+</p>
                     </c>
                     <c ca="center">
                        <p>179</p>
                     </c>
                     <c ca="center">
                        <p>29</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00001</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>25</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00003</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>129</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2.74e-17</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.999</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>*</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-4mer2 [1,80]-</p>
                     </c>
                     <c ca="center">
                        <p>138</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>0.92714</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.99999</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>46</p>
                     </c>
                     <c ca="center">
                        <p>0.9819</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>0.137</p>
                     </c>
                     <c ca="center">
                        <p>57</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>4.22e-07</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-5mer1 [1,80]</p>
                     </c>
                     <c ca="center">
                        <p>342</p>
                     </c>
                     <c ca="center">
                        <p>35</p>
                     </c>
                     <c ca="center">
                        <p>0.01147</p>
                     </c>
                     <c ca="center">
                        <p>27</p>
                     </c>
                     <c ca="center">
                        <p>0.05920</p>
                     </c>
                     <c ca="center">
                        <p>278</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.06e-08</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>0.812</p>
                     </c>
                     <c ca="center">
                        <p>70</p>
                     </c>
                     <c ca="center">
                        <p>0.9300</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-5mer1 [1,80]+</p>
                     </c>
                     <c ca="center">
                        <p>187</p>
                     </c>
                     <c ca="center">
                        <p>29</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00003</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>25</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00006</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>134</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.40e-17</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.999</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>*</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-5mer1 [1,80]-</p>
                     </c>
                     <c ca="center">
                        <p>155</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>0.96496</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.99915</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>59</p>
                     </c>
                     <c ca="center">
                        <p>0.8352</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>0.221</p>
                     </c>
                     <c ca="center">
                        <p>54</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.000257</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-6mer [1,80]</p>
                     </c>
                     <c ca="center">
                        <p>465</p>
                     </c>
                     <c ca="center">
                        <p>54</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00006</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>27</p>
                     </c>
                     <c ca="center">
                        <p>0.53401</p>
                     </c>
                     <c ca="center">
                        <p>278</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.06e-08</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>0.799</p>
                     </c>
                     <c ca="center">
                        <p>91</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.9993</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-6mer [1,80]+</p>
                     </c>
                     <c ca="center">
                        <p>263</p>
                     </c>
                     <c ca="center">
                        <p>38</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00001</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>25</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.00899</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>165</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>6.61e-13</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0.943</p>
                     </c>
                     <c ca="center">
                        <p>19</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>*</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>A-6mer [1,80]-</p>
                     </c>
                     <c ca="center">
                        <p>202</p>
                     </c>
                     <c ca="center">
                        <p>16</p>
                     </c>
                     <c ca="center">
                        <p>0.32994</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.99984</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>76</p>
                     </c>
                     <c ca="center">
                        <p>0.8907</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>0.368</p>
                     </c>
                     <c ca="center">
                        <p>64</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.001374</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D-5mer1 [-80,-1]</p>
                     </c>
                     <c ca="center">
                        <p>64</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>0.01195</p>
                     </c>
                     <c ca="center">
                        <p>32</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.32e-23</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>60</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>5.59e-19</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0.941</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.9999</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D-5mer1 [-80,-1]+</p>
                     </c>
                     <c ca="center">
                        <p>56</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>0.01403</p>
                     </c>
                     <c ca="center">
                        <p>30</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2.47e-23</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>52</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>4.27e-16</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>*</it>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.9995</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D-6mer [-80,-1]</p>
                     </c>
                     <c ca="center">
                        <p>1052</p>
                     </c>
                     <c ca="center">
                        <p>126</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.44e-12</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>112</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.81e-13</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>613</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>3.73e-37</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>26</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.999</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>183</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.9999</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D-6mer [-80,-1]+</p>
                     </c>
                     <c ca="center">
                        <p>701</p>
                     </c>
                     <c ca="center">
                        <p>93</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>2.28e-11</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>109</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>6.16e-28</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>482</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.02e-57</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.999</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>63</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>*</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>D-6mer [-80,-1]-</p>
                     </c>
                     <c ca="center">
                        <p>271</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>0.42504</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.99999</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>90</p>
                     </c>
                     <c ca="center">
                        <p>0.9985</p>
                     </c>
                     <c ca="center">
                        <p>19</p>
                     </c>
                     <c ca="center">
                        <p>0.022</p>
                     </c>
                     <c ca="center">
                        <p>106</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.54e-10</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>* p-value is very close to 1.</p>
                  <p>The number of shared features between the FGA generated sets of hexamers and the exon regulator hexamer sets and the p-value stating the probability of having this overlap or a greater overlap by chance. We highlight the highly statistically significant probabilities. The set <it>D </it>-3<it>mer</it>3 [-80,-1] did not contain position specific hexamers and the set <it>D </it>-4<it>mer</it>2 [-80,-1] contained only 3 position specific hexamers, two of which overlapped with RescueESE set.</p>
               </tblfn>
            </tbl>
            <p>In order to address possible positional preferences <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> for ESE elements we examined the distribution of short motifs corresponding to ESEs among our features. We observed a clear preference for exon sequences, but did not find a strong preference for a particular interval or position. For example, the GAAG tetramer is weighted positively throughout the exonic region, as illustrated in Figure <figr fid="F7">7A</figr> and Figure <figr fid="F7">7B</figr>. This signal was found in almost every position in the 80 nucleotide region and the weights of the respective features were very similar, so we cannot specify a region or interval of preference. The one exception was the immediate neighborhood of the donor site (position -4), which reflects splice-site consensus rather than exonic splicing enhancer signal. In contrast, GAAG was a negatively weighted feature in the intronic region.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>The weight distribution of the ESE motif GAAG in the acceptor (A) and donor (B) splice-site neighborhood</p>
               </caption>
               <text>
                  <p><b>The weight distribution of the ESE motif GAAG in the acceptor (A) and donor (B) splice-site neighborhood</b>. The x-axis shows the acceptor splice-site neighborhood interval. The consensus dinucleotide AG location is marked with the red bars (positions -1 and -2) in Figure 7A. The consensus dinucleotide GT location is marked with the red bars (positions 1 and 2), in Figure 7B. For every occurrence of the feature GAAG in the set <it>A-</it>4<it>mer </it>[-80,80], we draw a bar corresponding in height to its CMLS assigned weight. This feature has a negative weight when it is positioned in the intron region, but a positive weight downstream the splice site. For the donor site, we notice its exceptionally high weight at position -4. One possible reason may be the reflection of the donor-site consensus signal.</p>
               </text>
               <graphic file="1471-2105-8-410-7"/>
            </fig>
            <p>We next asked whether those hexamers present in our set but not others have predictive value. As described above, many experimentally determined exonic enhancers (as reviewed by Zheng <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>) overlapped our features. While this was true of the other sets as well, even when those previously described motifs were excluded, our features still accounted for some observations (see supplemental data, additional file <supplr sid="S4">4</supplr>). Interestingly, many of these were examples of the A/C-rich motifs: CACACA, GCCCAA, TCAACA, CATTCA and CCTACA. Such A/C-rich elements have been described before <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> but have not been extensively characterized.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We previously showed that our FGA algorithm could be used to build accurate sequence classifiers <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Here we have shown that the features generated by our algorithm for the purpose of discriminating between true and false splice sites correspond to functional splicing signals. Generated features included known features such as the branch-site consensus, acceptor splice-site consensus, pyrimidine tracts, coding potential and exon splicing regulator signals. The ability of FGA to accurately extract the branch-site signal (Tables <tblr tid="T2">2</tblr>, <tblr tid="T3">3</tblr>, <tblr tid="T4">4</tblr>) is especially noteworthy in view of the elusive nature of this signal <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. Furthermore, the generated features provided information about the preferred location and sequence of these features, as illustrated by the distribution of branch-site and pyrimidine-tract features. However, we note that because FGA does not produce features to capture particular events such as AG di-nucleotide exclusion zones <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, it was not able to extract contingent signals such as distant branch sites coupled to them.</p>
         <p>In addition, novel aspects of splicing signals could also be inferred from this method. We point to two examples. One is the co-occurrence of a peak of CCTT scores with a group of negative CMLS weights for TTTT at position -11 in the acceptor region. We believe that this may be a real, and previously unappreciated, aspect of the pyrimidine-tract signal. This signal is recognized by the large subunit of U2AF (and by PUF60; <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>). We note that in-vitro selection experiments <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> found a marked preference for a CC dinucleotide in the case of U2AF but not PTB or Sxl. Thus, although U2AF will bind oligoU, there are other proteins that will do so and these are generally splicing repressors. Our observed features were consistent with the possibility that positions -12 and -11 may be an especially important region for discriminating between positive factors and negative factors that bind to similar sequence elements. This subtlety was revealed by our features despite the fact that it was not apparent from raw nucleotide-frequency data (Fig. <figr fid="F3">3</figr>). In a second example, even though our ESE hexamer features showed a statistically significant overlap with those obtained by other computational methods (Table <tblr tid="T5">5</tblr>), there were examples obtained by ours but not other methods, including a number of ESE motifs that corresponded to experimentally determined exonic splicing enhancers.</p>
         <p>Finally, this method can be easily applied to other species and to similar classification problems for the discovery of species-specific regulatory elements. We have made our features available online (<abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Dataset</p>
            </st>
            <p>We have used a dataset of 4,000 human RefSeq pre-mRNA sequences to generate features and train our classifiers. A splice-site sequence in our training data is a subsequence consisting of 80 nucleotides upstream from the annotated splice site and 80 nucleotides downstream [80+AG/GT+80]. We counted the borders of all the introns within protein-coding regions whose acceptor and donor sites followed the AG and GT dinucleotide consensus. In order to construct negative examples for the training datasets, we selected random AG-pair or GT-pair locations that were not annotated splice sites and collected the subsequences as we did for the true sites. Our acceptor site training set consisted of 20,996 positive instances and 200,000 negative instances. Our donor-site training set consisted of 20,761 positive instances and 200,000 negative instances. We did not remove the sequences found within the regions identified by RepeatMasker. When we ran RepeatMasker on our training sets of sequences, we marked those sequences which had at least 20% of their nucleotides "masked" and the masking included the splice-site location. They constituted 36 of our positive and 67,571 of our negative instances. Our experiments revealed that the FGA performance was not affected by the repeated elements and the changes in the results when we did not include the repeated sequences in our training data were not significant. Therefore, all training was developed on the original training sequences, using a three-fold cross-validation scheme.</p>
         </sec>
         <sec>
            <st>
               <p>Splice-site prediction model and performance evaluation</p>
            </st>
            <p>Our feature generation algorithm <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> uses the pre-mRNA sequence properties to construct and select useful features for splice-site prediction. Feature generation starts with an initial feature set. Then, the algorithm iteratively calls a feature-construction method to expand the current feature set, and it calls a feature-selection method to identify the useful features for the prediction task. After a specified number of iterations the algorithm produces an output feature set. The final set of features is then used as input to the learning algorithm for the sequence classification task.</p>
            <p>We have used these features with a least-squares classifier algorithm, CMLS <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. When compared to AdaBoost, Support Vector Machines, Na&#239;ve Bayes and Maximum Entropy, this was the classifier that consistently gave the best performance. CMLS is a linear classifier with a performance similar to linear support vector machines, but with a mach faster convergence and therefore a shorter training time. When the classifier is trained, each of the input features is assigned a weight. These weights define the hyperplane, the decision boundary that optimizes the performance. Then, each given sequence, is assigned a score by adding the weights of each feature that is present in the sequence.</p>
            <p>We evaluated the performance of our model using 11-point average precision (11ptAvg Precision) and Receiver Operating Curve (ROC) analysis. For any sensitivity ratio, TP/(TP+FN), we calculate the precision at the threshold which achieves that ratio. Precision, TP/(TP+FP), measures the proportion of the sequences scoring above the threshold that are true splice sites. The 11ptAvg is the average of precisions estimated at these sensitivity values 0%, 10%, 20%, ..., 100%. We also draw the ROC curve, which is the graphical representation of the sensitivity (on the y-axis) versus false positive rate (on the x-axis). False positive rate, FP/(TN+FP), is the value we wish to minimize and the ROC graph shows the tradeoff between sensitivity and false positive rate.</p>
         </sec>
         <sec>
            <st>
               <p>Feature types and construction procedures</p>
            </st>
            <p>The composite features we generated for splice-site prediction capture compositional and positional properties of sequences. In our general FGA technique, we distinguished different types and we defined a construction algorithm for each type. In the experiments described in this paper, we used positional composite features, which we define as follows:</p>
            <p><it>Position-specific nucleotides </it>are basic features that represent the nucleotides at each of the positions <it>i </it>in the sequence. These features capture the nucleotide-position preference in the sequence; therefore they are very commonly used in DNA sequence-classification analysis. As an example, assume our feature set is F = { <it>a</it><sub>1</sub>, <it>c</it><sub>1</sub>, ..., <it>g</it><sub><it>n</it></sub>, <it>t</it><sub><it>n</it></sub>}, where <it>a</it><sub>1 </sub>denotes nucleotide <it>a </it>at the first sequence position. Our sequences have a length <it>n </it>of <it>162 </it>nucleotides; therefore our position-specific set of single nucleotides contains <it>648 </it>features. We use this initial feature set to construct complex position-specific features.</p>
         </sec>
         <sec>
            <st>
               <p>Position-specific k-mer features</p>
            </st>
            <p>The <it>position-specific k-mer </it>features capture the correlations between k-adjacent nucleotides and their respective positions. At each position <it>i </it>in the sequence these features represent the substring appearing at positions <it>i</it>, <it>i </it>+ 1, ..., <it>i </it>+ <it>k </it>- 1.</p>
            <sec>
               <st>
                  <p>Construction Method</p>
               </st>
               <p>Given an initial set of position-specific k-mer features, this construction method expands them to a set of position-specific (<it>k </it>+ 1)-mers by appending another nucleotide to each position-specific k-mer. Now, if our initial set is F<sub><it>intial </it></sub>= {<it>a</it><sub>1 </sub><it>g</it><sub>2</sub>}, we can extend it to the set F<sub><it>constructed </it></sub>= {<it>a</it><sub>1 </sub><it>g</it><sub>2 </sub><it>a</it><sub>3</sub>, <it>a</it><sub>1 </sub><it>g</it><sub>2 </sub><it>c</it><sub>3</sub>, <it>a</it><sub>1 </sub><it>g</it><sub>2 </sub><it>g</it><sub>3</sub>, <it>a</it><sub>1 </sub><it>g</it><sub>2 </sub><it>t</it><sub>3</sub>}.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Composite positional features</p>
            </st>
            <p>The <it>composite positional </it>features consist of a conjunction of <it>n </it>nucleotides in <it>n </it>different positions co-occurring in the sequence. In the simplest case, this type of feature set consists of position-specific single nucleotides. While the position-specific k-mers capture only the correlations among nearby positions, the composite positional features intend to capture the correlations between different nucleotides in non-consecutive positions in the sequence. We construct these complex features from conjunctions of position-specific features. The dimensionality of this kind of feature is inherently high. If the number of conjuncts is k, we have a total of <inline-formula><m:math name="1471-2105-8-410-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>(</m:mo><m:mrow><m:mtable><m:mtr><m:mtd><m:mi>n</m:mi></m:mtd></m:mtr><m:mtr><m:mtd><m:mi>k</m:mi></m:mtd></m:mtr></m:mtable></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaalmaabmaajugqbeaafaqabeGabaaabaGaemOBa4gabaGaem4AaSgaaaGccaGLOaGaayzkaaaaaa@31CA@</m:annotation></m:semantics></m:math></inline-formula> &#215; 4<sup><it>k </it></sup>such features for a sequence of length <it>n</it>.</p>
            <sec>
               <st>
                  <p>Construction Method</p>
               </st>
               <p>Given the set of <it>k</it>-conjuncts, this construction method selects from the set of basic features to add another position-specific nucleotide in an unconstrained position. In this manner we construct the set of (<it>k </it>+ 1)-conjuncts. Now, if our initial set is F<sub><it>initial </it></sub>= {<it>a</it><sub>1 </sub><it>g</it><sub>2</sub>}, we can extend it to the level <it>2 </it>set of position-specific base combinations F<sub><it>constructed </it></sub>= {<it>a</it><sub>1 </sub><it>g</it><sub>2 </sub>^ <it>a</it><sub>3</sub>, <it>a</it><sub>1 </sub><it>g</it><sub>2 </sub>^ <it>c</it><sub>3</sub>, ..., <it>a</it><sub>1 </sub><it>g</it><sub>2 </sub>^ <it>t</it><sub><it>n</it></sub>}. Incrementally, in this manner we can construct higher levels.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Feature Selection</p>
            </st>
            <p>Feature-selection methods reduce the set of features by keeping only the useful features for the task at hand. The problem of selecting useful features has been the focus of extensive research and many approaches have been proposed <abbrgrp><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>.</p>
            <p>We considered several approaches for initial pruning of features of different types during the generation stage. In our data, we found that the Information Gain feature-selection method performed best for selecting composite positional features and we calculated the value for each of the features according to the following formula:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-8-410-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>I</m:mi>
                           <m:mi>G</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>f</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mo>&#8722;</m:mo>
                           <m:mi>H</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mi>C</m:mi>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>+</m:mo>
                           <m:mi>p</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mi>f</m:mi>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mi>H</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mi>C</m:mi>
                                    <m:mo>/</m:mo>
                                    <m:mi>f</m:mi>
                                 </m:mrow>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>+</m:mo>
                           <m:mi>p</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mover accent="true">
                                    <m:mi>f</m:mi>
                                    <m:mo stretchy="true">&#175;</m:mo>
                                 </m:mover>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mi>H</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mi>C</m:mi>
                                    <m:mo>/</m:mo>
                                    <m:mrow>
                                       <m:mover accent="true">
                                          <m:mi>f</m:mi>
                                          <m:mo stretchy="true">&#175;</m:mo>
                                       </m:mover>
                                    </m:mrow>
                                 </m:mrow>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>,</m:mo>
                        </m:mrow>
 