<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-5-64</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Feature selection for splice site prediction: A new method using EDA-based feature ranking</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Saeys</snm>
               <fnm>Yvan</fnm>
               <insr iid="I1"/>
               <email>yvan.saeys@psb.ugent.be</email>
            </au>
            <au id="A2">
               <snm>Degroeve</snm>
               <fnm>Sven</fnm>
               <insr iid="I1"/>
               <email>sven.degroeve@psb.ugent.be</email>
            </au>
            <au id="A3">
               <snm>Aeyels</snm>
               <fnm>Dirk</fnm>
               <insr iid="I2"/>
               <email>dirk.aeyels@ugent.be</email>
            </au>
            <au id="A4">
               <snm>Rouz&#233;</snm>
               <fnm>Pierre</fnm>
               <insr iid="I3"/>
               <email>pierre.rouze@psb.ugent.be</email>
            </au>
            <au id="A5" ca="yes">
               <snm>Van de Peer</snm>
               <fnm>Yves</fnm>
               <insr iid="I1"/>
               <email>yves.vandepeer@psb.ugent.be</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Plant Systems Biology, Ghent University, Flanders Interuniversity Institute for Biotechnology (VIB), Technologiepark 927, B-9052 Ghent, Belgium</p>
            </ins>
            <ins id="I2">
               <p>SYSTeMS Research Group, Ghent University, Technologiepark 9, B-9052 Ghent, Belgium</p>
            </ins>
            <ins id="I3">
               <p>Laboratoire associ&#233; de l'INRA (France), Ghent University, Technologiepark 927, B-9052 Ghent, Belgium</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2004</pubdate>
         <volume>5</volume>
         <issue>1</issue>
         <fpage>64</fpage>
         <url>http://www.biomedcentral.com/1471-2105/5/64</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/1471-2105-5-64</pubid>
               <pubid idtype="pmpid">15154966</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>16</day>
               <month>12</month>
               <year>2003</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>21</day>
               <month>5</month>
               <year>2004</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>21</day>
               <month>5</month>
               <year>2004</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2004</year>
         <collab>Saeys et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features (as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The DNA sequences of most genes code for messenger RNA (mRNA) that is, in turn, encoding proteins. Whereas in prokaryotes the mRNA is a mere copy of a fragment of the DNA, in eukaryotes the RNA copy of DNA (primary transcript or pre-mRNA) contains non-coding segments (introns) which should be precisely spliced out to produce the mRNA. The border sides of such introns are referred to as splice sites. The splice site in the upstream part of the intron is called the donor site, the downstream site is termed the acceptor site.</p>
         <p>During the last years, large datasets containing the sequences of several eukaryotic genomes became available. Such datasets allow us to use supervised machine learning techniques to automate the process of splice site prediction. The identification of these sites constitutes the major subtask in gene prediction and is of key importance in determining the exact structure of genes in genomic sequences. An extensive overview of splice site recognition can be found in <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, while a more general overview and a comparison of gene and splice site prediction is discussed in <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. More recent work on splice site prediction for the human genome include methods base on maximum entropy modelling <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and support vector machines <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
         <p>To increase the probability of including relevant information, machine learning methods are typically provided with many features describing the data. In most cases however, not all of these features will be relevant to the classification task, often decreasing the classification performance of the learning algorithm. Therefore, there is a need to incorporate techniques that search for a "minimal" set of features with "best" classification performance. These techniques are often referred to as feature subset selection (FSS) or dimensionality reduction techniques.</p>
         <p>Genetic algorithms (GA) have been applied successfully to the identification of relevant feature subsets in small scale (less than 100 features) domains <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. During the last decade estimation of distribution algorithms (EDA) emerged as a new form of evolutionary computation <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. In previous work <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, the use of EDAs for selecting a constrained subset of features was shown to yield a considerable speed-up in time with respect to the traditional wrapper methods for feature selection.</p>
         <p>In this paper we elaborate further on these ideas and demonstrate how an EDA can be used to provide a dynamical view of the feature selection process. This offers new possibilities for identifying how much and which features are minimally needed before classification performance drastically goes down, and provides more insight into the biological problem of splicing. This is demonstrated by the detection of a new, biologically motivated feature, that we refer to as AG-scanning.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Splice site datasets</p>
            </st>
            <p>We constructed a dataset of splice sites for <it>Arabidopsis thaliana</it>. This was done as follows. We obtained mRNAs from the public EMBL database and aligned them to the BAC sequences that were used during the assembly of the <it>Arabidopsis </it>chromosomes. Afterwards the dataset was cleaned, by removing redundant genes, which resulted in a dataset of 1495 genes. From these genes, only the introns with canonical splice sites (GT for donor and AG for acceptor) were retained and used as positive instances. Negative instances were defined as GT or AG dinucleotides in the interval between 300 nucleotides upstream of the donor of the first intron and 300 nucleotides downstream of the acceptor of the last intron in that gene and that are not annotated as a splice site. More details on the construction of the datasets can be found in <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> and <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
            <sec>
               <st>
                  <p>Feature extraction</p>
               </st>
               <p>Splice site prediction can be divided into two subtasks: prediction of donor sites and prediction of acceptor sites. Each of these subtasks can be formally stated as a two-class classification task: {donor site, non-donor site} and {acceptor site, non-acceptor site}. The features describing the positive and negative instances were extracted from a local context around the splice site. In our experiments we used a window of 50 nucleotide positions to the left (upstream of the splice site) and 50 positions to the right (downstream of the splice site). Features were then extracted from this local context, resulting in three datasets with growing complexity.</p>
               <p>Dataset 1 is the most simple dataset, containing only position-dependent nucleotide information. This results in a dataset described by 100 (50 to the left, 50 to the right) features. These features were converted into a binary format using a sparse vector encoding, yielding 400 binary features.</p>
               <p>Dataset 2 adds to these position-dependent features also a number of position-independent features, representing the occurrence of trimers (words of length three) in the flanking sequence. An example of such a feature is the occurrence of the word "ATC" in the upstream part of the splice site. This yields another 128 binary features, summing up to 528 binary features for the second dataset version.</p>
               <p>Dataset 3 adds another layer of position-dependent information: the position-dependent dimers. This results in an additional set of 1568 features (49 &#215; 16 &#215; 2), summing up to 2096 features for the third dataset. It should be noted that adding position-dependent dimers already captures dependencies between adjacent nucleotides at the feature level. This allows us to still use linear classification models, yet take into account nucleotide dependencies. Another advantage of incorporating the dependencies at the feature level, is the ease to visualise and interpret feature dependencies using feature selection, as will be shown further. Note that these features only model dependencies between pairs of adjacent bases, but not between non-neighbouring bases.</p>
               <p>For each of the three datasets, different training and test sets were compiled. This was done as follows. Each dataset was split into a train and a test set, each containing 3000 positive and 18,000 negative instances. This class imbalance was chosen, because it is a more realistic view of real sequences, where the number of pseudo sites also outnumbers the amount of real sites. This process of splitting was replicated five times, resulting in five pairs of training and test sets, allowing us to perform a 10-fold cross-validation (5 &#215; 2). The results described further are all averaged over these 10 folds.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Estimation of Distribution Algorithms</p>
            </st>
            <p>Standard GAs have been criticized in the literature for a number of aspects: the large number of parameters that have to be tuned, the difficult prediction of the movements of the populations in the search space and the fact that there is no mechanism for capturing the relations among the variables of the problem <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. EDAs try to overcome these difficulties by providing a more statistical analysis of the selected individuals, thereby explicitly modelling the relationships among the variables. Instead of using the traditional crossover and mutation operators as in GAs, the further exploration of the search space is guided by the probabilistic modelling of promising solutions. The main scheme of the EDA approach is shown in Figure <figr fid="F1">1</figr>. In a first step, the initial population is generated. From this population a subset of promising individuals is selected. This is done by calculating an evaluation measure (often called the fitness) for each individual and afterwards selecting a number of individuals (mostly the best half of the population). In the case of feature selection, each individual is a binary feature vector, each bit representing the presence (1) or absence (0) of a particular feature. The evaluation can then be calculated as the classification performance of a machine learning method when using only the features having a 1 in the binary vector. This will be discussed in more detail in the following sections.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Schematic overview of the EDA algorithm</p>
               </caption>
               <text>
                  <p><b>Schematic overview of the EDA algorithm. </b>The EDA starts by generating an initial population P0. Then, an iterative procedure runs until the termination criteria are met.</p>
               </text>
               <graphic file="1471-2105-5-64-1"/>
            </fig>
            <p>An iterative procedure repeating steps 2, 3 and 4 (see Figure <figr fid="F1">1</figr>) is then carried out until a termination criterion is met. Such a criterion can either be quantitative, like a fixed number of iterations, or qualitative, like a lower limit on the evaluation measure that has to be reached. In each iteration, a number of individuals is selected from the population and from these a probability distribution of the encoded variables is estimated. Afterwards, the estimated probability distribution is used to generate the next population. This is done by sampling the probability distribution, i.e. generating individuals according to this distribution.</p>
            <p>The actual estimation of the underlying probability distribution represents the core of the EDA paradigm, and can be considered an optimization problem on its own. Depending on the domain (discrete or continuous), different estimation algorithms with varying complexity (modelling univariate, bivariate or multivariate dependencies) were designed <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. In the most complex case of multivariate dependencies, Bayesian Networks are frequently used. A greedy search algorithm is then used to find a suitable (and often constrained) network that is likely to generate the selected individuals.</p>
            <p>The use of EDAs for feature subset selection was pioneered in <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> and the use of EDAs for FSS in large scale domains was reported to yield good results <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B13">13</abbr></abbrgrp>.</p>
            <sec>
               <st>
                  <p>The Univariate Marginal Distribution Algorithm</p>
               </st>
               <p>As an example of an EDA, we will consider here the Univariate Marginal Distribution Algorithm (UMDA, <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>). The UMDA is a simple estimation algorithm, based on the assumption that all variables are independent. For each iteration l the probability model <it>p</it><sub><it>l</it></sub>(<it>x</it>) that is induced from a selected number of individuals (step 3 in Figure <figr fid="F1">1</figr>) is estimated as <graphic file="1471-2105-5-64-i1.gif"/></p>
               <p>Here each <it>p</it><sub><it>l</it></sub>(<it>x</it><sub><it>i</it></sub>) (the relative frequency) is estimated from the selected set (Se) of individuals of the previous generation <graphic file="1471-2105-5-64-i2.gif"/>. A new individual is then generated by sampling a value from the distribution <it>p</it><sub><it>l</it></sub>(<it>x</it><sub><it>i</it></sub>) for each variable <it>x</it><sub><it>i</it></sub>.</p>
               <p>It has to be pointed out that the EDA-UMDA approach is very similar to the compact GA <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> or to a GA with uniform crossover. Although these algorithms assume independence between variables, it has been shown that they are fast and robust for feature selection <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B10">10</abbr></abbrgrp>.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Classification methods</p>
            </st>
            <p>Two classification methods were used in our experiments: the Naive Bayes classifier and the Support Vector Machine. These methods are known to perform well in high dimensional spaces. They are supervised classification methods that induce a decision function from the instances in a training set that can then be used to classify a new instance. The Support Vector Machine (SVM, <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>) is a data-driven method for solving two-class classification tasks. In our experiments we used a linear SVM. The Naive Bayes method (NBM, <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>) follows the Bayes optimal decision rule, combining it with the assumption that the probability of the features given the class is the product of the probabilities of the individual features. It is known that the NBM can achieve considerably better results when FSS is applied <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, yet also the SVM can benefit from feature selection, although it already performs an implicit feature weighting based on the maximisation of the margin <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Feature subset selection methods</p>
            </st>
            <p>Techniques for FSS are traditionally divided into two classes: <it>filter </it>approaches and <it>wrapper </it>approaches <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. In the case of filter methods a feature relevance score is calculated, and low-scoring features are removed, providing a mechanism that is independent of the classification method to be used. In the wrapper approach various subsets of features are generated and evaluated, typically using greedy (iterative forward or backward methods) or heuristic search methods (GA, EDA). This approach is used with a specific classification algorithm, as the outcome of the evaluation is used during the search. Additionally one can distinguish a third class of FSS methods where the feature selection mechanism is built into the model <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>.</p>
            <p>In general, the use of wrapper methods is preferred, as this approach is better able to deal with datasets where many correlations between features exist. On the other hand, wrapper techniques are computationally very demanding, because for each feature subset a classification model has to be trained and evaluated. The technique we describe here is an EDA-based heuristic wrapper approach, that scales better to larger feature sets than the traditional wrapper methods, as will be shown further in this paper. Traditionally, FSS techniques based on GAs or EDAs use the single best subset of features as the result of the search. Here we elaborate further on these ideas and show how the EDA can be used to derive a more dynamic view of the feature selection process.</p>
            <sec>
               <st>
                  <p>Feature ranking using EDA-UMDA</p>
               </st>
               <p>The most common usage of GAs/EDAs in feature selection is to search for a subset of features, representing the "best" solution, i.e. one that maximises the classification performance of the classification model on the training (using cross-validation) or holdout set. This is done by evolving a population, where at the end of the iterative process the best scoring individual is regarded as "the solution".</p>
               <p>It should be noted that such a single best subset of features provides a rather static view of the whole elimination process. When using FSS to gain more insight in the underlying processes, the human expert does not know the context of the specific subset. Questions about how much and which features can still be eliminated before the classification performance drastically drops down remain unanswered using a static analysis, although these would provide interesting information.</p>
               <p>Feature ranking is a first step towards a dynamical analysis of the feature elimination process. The result of a feature ranking is an ordering of the features, sorted from the least relevant to the most relevant. Starting from the full/empty feature set, features can then be removed/added and the classification performance for each subset can be calculated, providing a dynamic view.</p>
               <p>A solution to the traditional, static approach lies in the fact that the outcome of an EDA should not be restricted to the single best individual from the population, yet the distribution estimated from the population can be used as a whole to yield better generality than a single solution. To derive a feature ranking from a probability distribution, some sort of importance or relevance score for each feature needs to be calculated. Evidently, a feature <it>i </it>having a higher value for <graphic file="1471-2105-5-64-i3.gif"/> can be considered more important than a feature <it>j </it>with a lower value for <graphic file="1471-2105-5-64-i4.gif"/>. The generalized probabilities <graphic file="1471-2105-5-64-i3.gif"/> can thus be considered as feature relevance scores, and a list of features sorted by these probabilities returns a feature ranking. The general algorithm to calculate such a ranking (EDA-R) consists of the steps presented below.</p>
            </sec>
            <sec>
               <st>
                  <p>Algorithm EDA-R</p>
               </st>
               <p>1. Select <it>S </it>individuals from the final population <it>D</it><sub><it>final</it></sub></p>
               <p>2. Construct the probability model <it>P </it>from <graphic file="1471-2105-5-64-i5.gif"/>, <it>j </it>= 1... <it>S</it>, using an EDA (UMDA, BMDA, BOA/EBNA)</p>
               <p>3. For each variable (feature) <it>X</it><sub><it>i</it></sub>, calculate the probability <graphic file="1471-2105-5-64-i6.gif"/></p>
               <p>4. Sort the features <it>X</it><sub>1 </sub>,..., <it>X</it><sub><it>n </it></sub>by their probabilities <graphic file="1471-2105-5-64-i6.gif"/></p>
               <p>5. List the array of sorted features</p>
               <p>The most important step in this algorithm is the extraction of the probabilities <graphic file="1471-2105-5-64-i3.gif"/> from the model. For models with univariate dependencies like the UMDA, the extraction of these probabilities is trivial, as they can be directly inferred from the model. For higher order EDAs the probabilities <graphic file="1471-2105-5-64-i3.gif"/> need to be calculated in a forward manner, as they may involve conditional probabilities.</p>
               <p>The feature ranking can then be used afterwards to iteratively discard features. It should be noted that this ranking is specific to the classification model that was used during the search. The number of classification models that have to be trained can be easily calculated. For an EDA with a population size <it>P</it>, running for <it>I </it>iterations, the number of model evaluations is <it>P</it>(<it>I</it>+1).</p>
            </sec>
            <sec>
               <st>
                  <p>Other techniques</p>
               </st>
               <p>We compared EDA-based feature ranking to two other selection strategies. The first of these is a traditional sequential wrapper approach, known as sequential backward elimination (SBE). SBE starts with the full feature set and iteratively discards features. At iteration <it>l </it>the feature set consists of <it>n</it><sub><it>l </it></sub>features and <it>n</it><sub><it>l </it></sub>models have to be trained, leaving out each feature once in each model. At iteration <it>l+1 </it>the feature set for the model with the best predictive performance is then chosen as the new feature subset. For a feature set of size <it>n </it>the number of classification models to be trained and evaluated when a complete view of the selection process is required is <graphic file="1471-2105-5-64-i7.gif"/>. One could also use a sequential forward selection procedure, but in general correlated features are better discovered using a backward approach.</p>
               <p>The second method is an advanced filter method, described by Koller and Sahami <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, further referred to as KS. This filter method is based on Markov blankets, being able to discover feature interactions, a property that does not apply for all filter methods. During the first step a correlation matrix is calculated, requiring O(<it>n</it><sup>2</sup>(<it>m </it>+ log <it>n</it>)) operations, <it>m </it>being the number of instances in the training set. During the second step, the actual feature selection is done. The parameter <it>k </it>in the algorithm represents a small, fixed number of conditioning features, typically set to 0,1 or 2. For this parameter in the algorithm we used the value 1 in all our experiments, requiring an additional O(2<it>n</it><sup>2</sup><it>m</it>) operations for a complete view of the selection process.</p>
            </sec>
            <sec>
               <st>
                  <p>Selection criterion</p>
               </st>
               <p>The determination of the classification performance for a specific subset of features greatly influences the feature selection mechanism. In our experiments we used the F-measure as a measure of classification performance, because it is better able to deal with imbalanced datasets than the traditional accuracy measure <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>.</p>
               <p>
                  <graphic file="1471-2105-5-64-i8.gif"/>
               </p>
               <p>TP and TN represent the number of true positives and true negatives, FP and FN the number of false positives/negatives.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Implementation</p>
            </st>
            <p>The methods for feature selection were all implemented in C++, using the SVM<sup><it>light </it></sup>implementation for SVMs <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Both SBE and EDA are suitable candidates for parallellization, providing a linear gain in speed of the selection process. For parallellization, we made use of the MPI libraries, available at <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. All experiments ran on a cluster of 5 dual-processor (1.2 Ghz) Linux machines running RedHat Linux 7.2. The source code is available from the authors upon request.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>All results were averaged using 10-fold cross-validation. Using the EDA-approach, the internal evaluation of a feature subset was calculated on a 5-fold cross-validation of the training set. For the different datasets, the C-parameter of the SVM was tuned on the full feature set: C = 0.05 for datasets 1 and 2, C = 0.005 for dataset 3. These values were determined experimentally using a cross-validation procedure. For the EDA-approach, the population size was tuned to 500 individuals, and the number of iterations was set to 20. At each iteration in the EDA, the probability model was estimated using the best half of the distribution. For the largest dataset (2096 features) the SBE approach turned out to be infeasible, due to the large number of models that needs to be evaluated.</p>
         <p>Figure <figr fid="F2">2</figr> compares the results for the three feature selection methods on the three datasets when NBM is used as classifier. At the x-axis the number of features eliminated so far is represented, while the y-axis measures the classification performance (F-measure). Several conclusions can be drawn from these results. A general observation is that many features can be eliminated before the classification performance drastically goes down. This illustrates the fact that the datasets contain many irrelevant or correlated features, as removing these features does not harm the classification performance. Furthermore it can be noted that better results can be obtained using the more complex datasets (adding position-independent trimers and position-dependent dimers), proving the usefulness of including such kind of features. A second observation is that the wrapper methods (EDA-R and SBE) consistently perform better than the filter method (KS), and that EDA-R achieves better results than SBE.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Comparison of feature selection techniques</p>
            </caption>
            <text>
               <p><b>Comparison of feature selection techniques. </b>For each of the three datasets, the different feature selection techniques are compared with NBM used as a classifier. The x-axis denotes the number of features that has been eliminated so far, while the y-axis shows the classification performance (F-measure).</p>
            </text>
            <graphic file="1471-2105-5-64-2"/>
         </fig>
         <p>In addition to the comparison of classification performance, an important aspect to consider is the running time. In this respect KS, being a filter approach, is the fastest algorithm for the datasets we used. To compare EDA-R and SBE, the formulas given earlier can be used to calculate the number of model evaluations that is needed. For the datasets containing 528 features, an SBE approach eliminating one feature at the time requires 139,656 model evaluations, while the EDA-R method needs only 10,500 model evaluations, a reduction by approximately one order of magnitude. For the largest dataset (2096 features) the EDA-R method achieves good results, and needed only 10,500 model evaluations, while the SBE approach would need 2,197,656 model evaluations, a reduction by more than two orders of magnitude.</p>
         <p>Clearly, the EDA-R method scales better to datasets with many features. Another advantage of EDA-R is the fact that the number of model evaluations needed is not directly dependent on the number of features. It is only indirectly dependent through the classification algorithm that is used in the EDA-process.</p>
         <p>As both NBM and SVM scale well in the number of features used, the use of EDA-R with these models provides advantages over SBE and KS, both being quadratic in the number of features. As a consequence, the use of SBE and KS will turn out to be infeasible as the number of features gets larger, while the use of EDA-R will still produce results.</p>
         <p>As we already mentioned in the introduction, a key advantage of applying FSS methods is the extraction of knowledge from complex datasets. Using the different datasets mentioned earlier, we now discuss the advantages of the EDA-R approach to gain more insight into the classification of acceptor and donor splice sites.</p>
         <sec>
            <st>
               <p>Acceptor prediction</p>
            </st>
            <p>An important advantage of the EDA-R method, compared to the sequential backward wrapper and the filter method, is the fact that the relative frequencies of the features in the final distribution can be used as an importance measure, or feature weight. As a result, several gradations of the importance of features can be distinguished and visualised, which cannot be done using only a feature ranking.</p>
            <p>To visualise the results of the EDA-R feature selection method, the feature weights can be color coded using a so-called heat map. On a heat map, the interval [0,1] is mapped to a color gradient changing from blue (0), over green (0.5) to red (1). The results of such a color coding of the features for acceptor prediction are shown in Figure <figr fid="F3">3</figr>. In this figure, the features for dataset 1 (400 features, Figure <figr fid="F3">3A</figr>), dataset 2 (528 features, Figure <figr fid="F3">3B</figr>) and dataset 3 (2096 features, Figure <figr fid="F3">3C</figr>) are shown when EDA-R is combined with the linear SVM. Figure <figr fid="F3">3A</figr> represents the dataset containing only position-dependent nucleotides. For each of the four nucleotides (shown as four rows), each column represents a position in the local context of the acceptor site (the upstream part is shown on the left and the downstream part is shown on the right). Figure <figr fid="F3">3B</figr> shows the features of the second dataset, containing also the position invariant trimers. For each part of the context (upstream, downstream), the trimers are grouped according to their composition: the first four columns represent trimers with a bias to the respective nucleotides A, T, C and G. The last two columns represent the remaining trimers. In Figure <figr fid="F3">3C</figr>, the position-dependent dimers are included, where each row again represents a specific dimer. The color gradients for each of the three datasets clearly reveal some insightful patterns.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Visualization of EDA-R feature weights for acceptor prediction</p>
               </caption>
               <text>
                  <p><b>Visualization of EDA-R feature weights for acceptor prediction. </b>For each of the three datasets, the color coded feature weights as a result of the EDA-R feature selection in combination with a linear SVM are shown. (A) The simplest dataset (only position dependent nucleotides, 400 binary features). (B) The extended (position dependent nucleotides + position invariant 3-mers, 528 binary features). (C) The most complex dataset (also including position dependent dinucleotides, 2096 binary features).</p>
               </text>
               <graphic file="1471-2105-5-64-3"/>
            </fig>
            <p>For example, the bases flanking the acceptor site turn out to be of key importance in distinguishing true sites from pseudo sites. These features represent the consensus around the acceptor site. Also note the importance of the dimer features in the immediate neighbourhood of the splice site, capturing local dependencies.</p>
            <p>The existence of a poly-pyrimidine (nucleotides C and T) stretch in the upstream part (about 20 nucleotides) of the acceptor also appears to be a strong feature. Further, it can be noticed that in this pyrimidine stretch, the nucleotide T is of higher importance than the C. This fits with the current knowledge on spliceosomal splicing first documented in yeast and mammals <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. Even if T-rich sequences are reported to be spread all over the introns in plants <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, our observation indicates that poly-pyrimidine tracts do play a specific role in acceptor recognition in plants as well. Another feature related to the poly-pyrimidine tract is the importance of TG-dinucleotides upstream of the acceptor (Figure <figr fid="F3">3C</figr>). A position-frequency plot of this dinucleotide is shown in Figure <figr fid="F4">4</figr>, from which we can conclude that the TG is more abundant in true acceptors.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>TG percentage upstream of the acceptor site</p>
               </caption>
               <text>
                  <p><b>TG percentage upstream of the acceptor site. </b>For both real and pseudo acceptor sites, the percentage of TG dinucleotides is shown as a function of the position upstream of the site. The closer to the acceptor, the more abundant this dinucleotide is in real acceptor sites.</p>
               </text>
               <graphic file="1471-2105-5-64-4"/>
            </fig>
            <p>The fact that the acceptor site is a boundary between a non-coding region (intron) and a coding region (exon) is also reflected in the features that are selected. A three-base periodicity in the features, especially for the bases G, T and C, can be observed downstream of the acceptor site, as expected for coding regions. Furthermore, some position invariant features are of great importance, shown by the fact that the periodic pattern becomes less apparent if position invariant features are considered. This illustrates the importance of the position invariant features in capturing codon bias.</p>
         </sec>
         <sec>
            <st>
               <p>AG-scanning feature</p>
            </st>
            <p>In the largest data set (Figure <figr fid="F3">3C</figr>), the dinucleotide "AG" appears as a very strong feature in the region up to about 25 positions upstream of the acceptor site. Naturally, in the local context of true acceptors, this dinucleotide should not appear in this region, because it is known that the acceptor site is usually the first "AG" following the branch point <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. Selection against AG dinucleotides in the upstream part of true acceptors is shown in Figure <figr fid="F5">5</figr>, where the positional frequencies of the dinucleotide AG is compared for the true and pseudo acceptors. The prominence of this feature in this region points to the fact that in <it>Arabidopsis </it>the branch point should be at least about 25 positions upstream of the acceptor site, which fits with the &#177; 30 nt distance of branch points to acceptors, previously reported for plants <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>AG percentage upstream of the acceptor site</p>
               </caption>
               <text>
                  <p><b>AG percentage upstream of the acceptor site. </b>For both real and pseudo acceptor sites, the percentage of AG dinucleotides is shown as a function of the position upstream of the site. The closer to the acceptor, the more this dinucleotide is selected against in real acceptor sites.</p>
               </text>
               <graphic file="1471-2105-5-64-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Donor prediction</p>
            </st>
            <p>A similar analysis was done for donor sites. The results for the most complex dataset (2096 features) are shown in Figure <figr fid="F6">6</figr>. Analogous to acceptor prediction, the strongest features are the ones that represent the consensus sequence around the donor site, both for the position dependent nucleotides and dinucleotides. Also in this case, some of the position invariant features are highly relevant for classification, for example the T-rich trimers TTA and TTT in the downstream part of the context, capturing the T-richness of introns in <it>Arabidopsis</it>.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Visualization of EDA-R feature weights for donor prediction</p>
               </caption>
               <text>
                  <p><b>Visualization of EDA-R feature weights for donor prediction. </b>For the most complex of the three datasets (2096 binary features), the color coded feature weights, resulting from the combination of EDA-R with a linear SVM, are shown. The interpretation is similar to Figure <figr fid="F3">3</figr>.</p>
               </text>
               <graphic file="1471-2105-5-64-6"/>
            </fig>
            <p>Another pattern that can be clearly observed is the importance of G immediately downstream of the donor site. A position-frequency plot of the G-percentage in the downstream part of the donor site is shown in Figure <figr fid="F7">7</figr>. From this figure we can learn that the nucleotide G is significantly under-represented in the case of real donor sites, compared to pseudo sites, except at position +3, where a G is over-represented as part of the consensus sequence.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>G percentage downstream of the donor site</p>
               </caption>
               <text>
                  <p><b>G percentage downstream of the donor site. </b>For both real and pseudo donor sites, the percentage of G nucleotides is shown as a function of the position downstream of the site. The closer to the donor, the less G is tolerated. The only exception occurs at position +3, where a G is clearly over-represented, as part of the donor consensus sequence.</p>
               </text>
               <graphic file="1471-2105-5-64-7"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>The results discussed in this paper show that feature subset selection using EDA-based ranking provides a robust framework for feature selection in splice site prediction. We presented a method that is easy to implement, can be easily parallellized, and is scalable to larger feature sets. This was obtained at no expense of efficiency. The method can be used for any other optimisation problem where the feature set is sufficiently large, like gene selection in microarray datasets.</p>
         <p>An important advantage of our method (EDA-R) is the derivation of feature weights, which is shown to be useful to extract knowledge from complex data. The most prominent example of this was the detection of a new, biologically motivated feature for acceptor prediction, which we termed AG-scanning. Because the knowledge on splicing mechanisms in plants is still limited <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, new findings such as discussed here could both lead to advances in gene prediction and to biologically relevant insights in the mechanisms behind transcription.</p>
         <p>Future research on splice site prediction will focus on larger feature sets, including additional information such as structural information to achieve better results. Other future directions we would like to explore are the combination of EDAs with other classification systems, and the development of more complex features that capture other nucleotide dependencies at the feature level.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>YS designed the EDA-R procedure and ran the experiments. SD prepared the datasets in this study. DA helped in the mathematical part of the research and PR and YVdP provided the biological interpretation and supervised the research. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>New Methods for Splice Site recognition</p>
            </title>
            <aug>
               <au>
                  <snm>Sonnenburg</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Diploma thesis, Humbold-Universit&#228;t zu Berlin</source>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Current methods of gene prediction, their strengths and weaknesses</p>
            </title>
            <aug>
               <au>
                  <snm>Math&#233;</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sagot</snm>
                  <fnm>MF</fnm>
               </au>
               <au>
                  <snm>Schiex</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Rouz&#233;</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>4103</fpage>
            <lpage>4117</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/gkf543</pubid>
                  <pubid idtype="pmpid" link="fulltext">12364589</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Computational prediction of eukaryotic protein-coding genes</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>MQ</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>698</fpage>
            <lpage>709</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrg890</pubid>
                  <pubid idtype="pmpid" link="fulltext">12209144</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Maximum entropy modelling of short sequence motifs with applications to RNA splicing signals</p>
            </title>
            <aug>
               <au>
                  <snm>Yeo</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Burge</snm>
                  <fnm>CB</fnm>
               </au>
            </aug>
            <source>In Proceedings of RECOMB 2003</source>
            <pubdate>2003</pubdate>
            <fpage>322</fpage>
            <lpage>331</lpage>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Sequence Information for the Splicing of Human pre-mRNA Identified by Support Vector Machine Classification</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Heller</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hefter</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Leslie</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Chasin</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>2637</fpage>
            <lpage>2650</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.1679003</pubid>
                  <pubid idtype="pmpid" link="fulltext">14656968</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Comparison of algorithms that select features for pattern classifiers</p>
            </title>
            <aug>
               <au>
                  <snm>Kudo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sklansky</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Pattern Recogn</source>
            <pubdate>2000</pubdate>
            <volume>33</volume>
            <fpage>25</fpage>
            <lpage>41</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0031-3203(99)00041-2</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>On automatic feature selection</p>
            </title>
            <aug>
               <au>
                  <snm>Siedelecky</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Sklansky</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Int J Pattern Recogn</source>
            <pubdate>1988</pubdate>
            <volume>2</volume>
            <fpage>197</fpage>
            <lpage>220</lpage>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Robust feature selection algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>Vafaie</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>De Jong</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>In Proceedings of the Fifth International Conference on Tools with Artificial Intelligence</source>
            <pubdate>1993</pubdate>
            <fpage>356</fpage>
            <lpage>363</lpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>From recombination of genes to the estimation of distributions. Binary parameters</p>
            </title>
            <aug>
               <au>
                  <snm>M&#252;hlenbein</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Paass</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>In Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature, PPSN IV</source>
            <pubdate>1996</pubdate>
            <fpage>178</fpage>
            <lpage>187</lpage>
         </bibl>
         <bibl id="B10">
            <aug>
               <au>
                  <snm>Larra&#241;aga</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Lozano</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <source>Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation</source>
            <publisher>Kluwer Academic Publishers</publisher>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Combinatorial Optimization by Learning and Simulation of Bayesian Networks</p>
            </title>
            <aug>
               <au>
                  <snm>Larra&#241;aga</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Etxebarria</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Lozano</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Pe&#241;a</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>In Proceedings of the 16th Annual Conference on Uncertainty in Artificial Intelligence (UAI-00)</source>
            <publisher>Morgan Kaufmann Publishers</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Feature subset selection by Bayesian networks based optimization</p>
            </title>
            <aug>
               <au>
                  <snm>Inza</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Larra&#241;aga</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Etxebarria</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Sierra</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Artif Intell</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>143</fpage>
            <lpage>164</lpage>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Fast feature selection using a simple Estimation of Distribution Algorithm: A case study on splice site prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Saeys</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Degroeve</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Aeyels</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Van de Peer</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Rouz&#233;</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>Suppl 2</issue>
            <fpage>II179</fpage>
            <lpage>II188</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">14534188</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Feature subset selection for splice site prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Degroeve</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>De Baets</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Van de Peer</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Rouz&#233;</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>Suppl 2</issue>
            <fpage>S75</fpage>
            <lpage>S83</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12385987</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>The equation for response to selection and its use for prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Muhlenbein</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Evol Comput</source>
            <pubdate>1997</pubdate>
            <volume>5</volume>
            <fpage>303</fpage>
            <lpage>346</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10021762</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>The compact genetic algorithm</p>
            </title>
            <aug>
               <au>
                  <snm>Harik</snm>
                  <fnm>GR</fnm>
               </au>
               <au>
                  <snm>Lobo</snm>
                  <fnm>GG</fnm>
               </au>
               <au>
                  <snm>Goldberg</snm>
                  <fnm>DE</fnm>
               </au>
            </aug>
            <source>In Proceedings of the International Conference on Evolutionary Computation</source>
            <pubdate>1998</pubdate>
            <fpage>523</fpage>
            <lpage>528</lpage>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Feature subset selection by estimation of distribution algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>Cant&#250;-Paz</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>In Proceedings of the Genetic and Evolutionary Computation Conference</source>
            <pubdate>2002</pubdate>
            <fpage>754</fpage>
            <lpage>761</lpage>
         </bibl>
         <bibl id="B18">
            <title>
               <p>A training algorithm for optimal margin classifiers</p>
            </title>
            <aug>
               <au>
                  <snm>Boser</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Guyon</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>VN</fnm>
               </au>
            </aug>
            <source>In Proceedings of COLT</source>
            <pubdate>1992</pubdate>
            <fpage>144</fpage>
            <lpage>152</lpage>
         </bibl>
         <bibl id="B19">
            <aug>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>VN</fnm>
               </au>
            </aug>
            <source>The nature of statistical learning theory. Springer-Verlag</source>
            <pubdate>1995</pubdate>
         </bibl>
         <bibl id="B20">
            <aug>
               <au>
                  <snm>Duda</snm>
                  <fnm>RO</fnm>
               </au>
               <au>
                  <snm>Hart</snm>
                  <fnm>PE</fnm>
               </au>
            </aug>
            <source>Pattern Classification and scene analysis</source>
            <publisher>New York, NY, Wiley</publisher>
            <pubdate>1973</pubdate>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Induction of selective Bayesian classifiers</p>
            </title>
            <aug>
               <au>
                  <snm>Langley</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Sage</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence</source>
            <pubdate>1994</pubdate>
            <fpage>399</fpage>
            <lpage>406</lpage>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Gene Selection for Cancer Classification using Support Vector Machines</p>
            </title>
            <aug>
               <au>
                  <snm>Guyon</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Weston</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Barnhill</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>VN</fnm>
               </au>
            </aug>
            <source>Mach Learn</source>
            <pubdate>2002</pubdate>
            <volume>46</volume>
            <fpage>389</fpage>
            <lpage>422</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1012487302797</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Wrappers for feature subset selection</p>
            </title>
            <aug>
               <au>
                  <snm>Kohavi</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>John</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Artif Intell</source>
            <pubdate>1997</pubdate>
            <volume>97</volume>
            <fpage>273</fpage>
            <lpage>324</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0004-3702(97)00043-X</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Toward optimal feature selection</p>
            </title>
            <aug>
               <au>
                  <snm>Koller</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Sahami</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>In Proceedings of the 13th International Conference on Machine Learning</source>
            <pubdate>1996</pubdate>
            <fpage>284</fpage>
            <lpage>292</lpage>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Feature selection on hierarchy of web documents</p>
            </title>
            <aug>
               <au>
                  <snm>Mladenic</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Grobelnik</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Decis Support Syst</source>
            <pubdate>2003</pubdate>
            <volume>35</volume>
            <fpage>45</fpage>
            <lpage>87</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0167-9236(02)00097-0</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Making large-scale support vector machine learning practical</p>
            </title>
            <aug>
               <au>
                  <snm>Joachims</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Advances in Kernel Methods: Support Vector Machines</source>
            <publisher>Cambridge, MA: MIT Press</publisher>
            <editor>Schoelkopf B, Burges C</editor>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B27">
            <title>
               <p>MPI libraries</p>
            </title>
            <url>http://www-unix.mcs.anl.gov/mpi/mpich</url>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Allosteric cascade of spliceosome activation</p>
            </title>
            <aug>
               <au>
                  <snm>Brow</snm>
                  <fnm>DA</fnm>
               </au>
            </aug>
            <source>Annu Rev Genet</source>
            <pubdate>2003</pubdate>
            <volume>36</volume>
            <fpage>333</fpage>
            <lpage>360</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1146/annurev.genet.36.043002.091635</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Pre-mRNA splicing in higher plants</p>
            </title>
            <aug>
               <au>
                  <snm>Lorkovic</snm>
                  <fnm>ZJ</fnm>
               </au>
               <au>
                  <snm>Wieczorek</snm>
                  <fnm>KDA</fnm>
               </au>
               <au>
                  <snm>Lambermon</snm>
                  <fnm>MH</fnm>
               </au>
               <au>
                  <snm>Filipowicz</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Trends Plant Sci</source>
            <pubdate>2000</pubdate>
            <volume>4</volume>
            <fpage>160</fpage>
            <lpage>167</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S1360-1385(00)01595-8</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Scanning and competition between AGs are involved in 3' splice site selection in mammalian introns</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>CWJ</fnm>
               </au>
               <au>
                  <snm>Chu</snm>
                  <fnm>TT</fnm>
               </au>
               <au>
                  <snm>Nadal-Ginard</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Mol Cell Biol</source>
            <pubdate>1993</pubdate>
            <volume>13</volume>
            <fpage>4939</fpage>
            <lpage>4952</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8336728</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
