<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-3</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Gene selection and classification of microarray data using random forest</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>D&#237;az-Uriarte</snm>
               <fnm>Ram&#243;n</fnm>
               <insr iid="I1"/>
               <email>rdiaz@ligarto.org</email>
            </au>
            <au id="A2">
               <snm>Alvarez de Andr&#233;s</snm>
               <mnm/>
               <fnm>Sara</fnm>
               <insr iid="I2"/>
               <email>salvarez@cnio.es</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Bioinformatics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain</p>
            </ins>
            <ins id="I2">
               <p>Cytogenetics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fern&#225;ndez Almagro 3, Madrid, 28029, Spain</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>3</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/3</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16398926</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-3</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>08</day>
               <month>7</month>
               <year>2005</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>06</day>
               <month>1</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>06</day>
               <month>1</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>D&#237;az-Uriarte and Alvarez de Andr&#233;s; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Selection of relevant genes for sample classification (e.g., to differentiate between patients with and without cancer) is a common task in most gene expression studies (e.g., <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>). When facing gene selection problems, biomedical researchers often show interest in one of the following objectives:</p>
         <p>1. To identify relevant genes for subsequent research; this involves obtaining a (probably large) set of genes that are related to the outcome of interest, and this set should include genes even if they perform similar functions and are highly correlated.</p>
         <p>2. To identify small sets of genes that could be used for diagnostic purposes in clinical practice; this involves obtaining the smallest possible set of genes that can still achieve good predictive performance (thus, "redundant" genes should not be selected).</p>
         <p>We will focus here on the second objective. Most gene selection approaches in class prediction problems combine ranking genes (e.g., using an <it>F</it>-ratio or a Wilcoxon statistic) with a specific classifier (e.g., discriminant analysis, nearest neighbor). Selecting an optimal number of features to use for classification is a complicated task, although some preliminary guidelines, based on simulation studies by <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, are available. Frequently an arbitrary decision as to the number of genes to retain is made (e.g., keep the 50 best ranked genes and use them with a linear discriminant analysis as in <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B7">7</abbr></abbrgrp>; keep the best 150 genes as in <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>). This approach, although it can be appropriate when the only objective is to classify samples, is not the most appropriate if the objective is to obtain the smaller possible sets of genes that will allow good predictive performance. Another common approach, with many variants (e.g., <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>), is to repeatedly apply the same classifier over progressively smaller sets of genes (where we exclude genes based either on the ranking statistic or on the effect of the elimination of a gene on error rate) until a satisfactory solution is achieved (often the smallest error rate over all sets of genes tried). A potential problem of this second approach, if the elimination is based on univariate rankings, is that the ranking of a gene is computed in isolation from all other genes, or at most in combinations of pairs of genes <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, and without any direct relation to the classification algorithm that will later be used to obtain the class predictions. Finally, the problem of gene selection is generally regarded as much more problematic in multi-class situations (where there are three or more classes to be differentiated), as evidence by recent papers in this area (e.g., <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B8">8</abbr></abbrgrp>). Therefore, classification algorithms that directly provide measures of variable importance (related to the relevance of the variable in the classification) are of great interest for gene selection, specially if the classification algorithm itself presents features that make it well suited for the types of problems frequently faced with microarray data. Random forest is one such algorithm.</p>
         <p>Random forest is an algorithm for classification developed by Leo Breiman <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> that uses an ensemble of classification trees <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>. Each of the classification trees is built using a bootstrap sample of the data, and at each split the candidate set of variables is a random subset of the variables. Thus, random forest uses both bagging (bootstrap aggregation), a successful approach for combining unstable learners <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>, and random variable selection for tree building. Each tree is unpruned (grown fully), so as to obtain low-bias trees; at the same time, bagging and random variable selection result in low correlation of the individual trees. The algorithm yields an ensemble that can achieve both low bias and low variance (from averaging over a large ensemble of low-bias, high-variance but low correlation trees).</p>
         <p>Random forest has excellent performance in classification tasks, comparable to support vector machines. Although random forest is not widely used in the microarray literature (but see <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>), it has several characteristics that make it ideal for these data sets:</p>
         <p>a) Can be used when there are many more variables than observations.</p>
         <p>b) Can be used both for two-class and multi-class problems of more than two classes.</p>
         <p>c) Has good predictive performance even when most predictive variables are noise, and therefore it does not require a pre-selection of genes (i.e., "shows strong robustness with respect to large feature sets", <it>sensu </it><abbrgrp><abbr bid="B4">4</abbr></abbrgrp>).</p>
         <p>d) Does not overfit.</p>
         <p>e) Can handle a mixture of categorical and continuous predictors.</p>
         <p>f) Incorporates interactions among predictor variables.</p>
         <p>g) The output is invariant to monotone transformations of the predictors.</p>
         <p>h) There are high quality and free implementations: the original Fortran code from L. Breiman and A. Cutler, and an R package from A. Liaw and M. Wiener <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>.</p>
         <p>i) Returns measures of variable (gene) importance.</p>
         <p>j) There is little need to fine-tune parameters to achieve excellent performance. The most important parameter to choose is <it>mtry</it>, the number of input variables tried at each split, but it has been reported that the default value is often a good choice <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. In addition, the user needs to decide how many trees to grow for each forest (<it>ntree</it>) as well as the minimum size of the terminal nodes (<it>nodesize</it>). These three parameters will be thoroughly examined in this paper.</p>
         <p>Given these promising features, it is important to understand the performance of random forest compared to alternative state-of-the-art prediction methods with microarray data, as well as the effects of changes in the parameters of random forest. In this paper we present, as necessary background for the main topic of the paper (gene selection), the first through examination of these issues, including evaluating the effects of <it>mtry</it>, <it>ntree </it>and <it>nodesize </it>on error rate using nine real microarray data sets and simulated data.</p>
         <p>The main question addressed in this paper is gene selection using random forest. A few authors have previously used variable selection with random forest. <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> and <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> use filtering approaches and, thus, do not take advantage of the measures of variable importance returned by random forest as part of the algorithm. Svetnik, Liaw, Tong and Wang <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> propose a method that is somewhat similar to our approach. The main difference is that <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> first find the "best" dimension (<it>p</it>) of the model, and then choose the <it>p </it>most important variables. This is a sound strategy when the objective is to build accurate predictors, without any regards for model interpretability. But this might not be the most appropriate for our purposes as it shifts the emphasis away from selection of specific genes, and in genomic studies the identity of the selected genes is relevant (e.g., to understand molecular pathways or to find targets for drug development).</p>
         <p>The last issue addressed in this paper is the multiplicity (or lack of uniqueness or lack of stability) problem. Variable selection with microarray data can lead to many solutions that are equally good from the point of view of prediction rates, but that share few common genes. This multiplicity problem has been emphasized by <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> and <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> and recent examples are shown in <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> and <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. Although multiplicity of results is not a problem when the only objective of our method is prediction, it casts serious doubts on the biological interpretability of the results <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. Unfortunately most "methods papers" in bioinformatics do not evaluate the stability of the results obtained, leading to a false sense of trust on the biological interpretability of the output obtained. Our paper presents a through and critical evaluation of the stability of the lists of selected genes with the proposed (and two competing) methods.</p>
         <p>In this paper we present the first comprehensive evaluation of random forest for classification problems with microarray data, including an assessment of the effects of changes in its parameters and we show it to be an excellent performer even in multi-class problems, and without any need to fine-tune parameters or pre-select relevant genes. We then propose a new method for gene selection in classification problems (for both two-class and multi-class problems) that uses random forest; the main advantage of this method is that it returns very small sets of genes that retain a high predictive accuracy, and is competitive with existing methods of gene selection.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Evaluation of performance and comparisons with alternative approaches</p>
            </st>
            <p>We have used both simulated and real microarray data sets to evaluate the variable selection procedure. For the real data sets, original reference paper and main features are shown in Table <tblr tid="T1">1</tblr> and further details are provided in the supplementary material [see <supplr sid="S1">Additional file 1</supplr>]. To evaluate if the proposed procedure can recover the signal in the data and can eliminate redundant genes, we need to use simulated data, so that we know exactly which genes are relevant. Details on the simulated data are provided in the methods and in the supplementary material [see <supplr sid="S1">Additional file 1</supplr>].</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Main characteristics of the microarray data sets used</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Dataset</p>
                     </c>
                     <c ca="center">
                        <p>Original ref.</p>
                     </c>
                     <c ca="center">
                        <p>Genes</p>
                     </c>
                     <c ca="center">
                        <p>Patients</p>
                     </c>
                     <c ca="center">
                        <p>Classes</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukaemia</p>
                     </c>
                     <c ca="center">
                        <p>[44]</p>
                     </c>
                     <c ca="center">
                        <p>3051</p>
                     </c>
                     <c ca="center">
                        <p>38</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast</p>
                     </c>
                     <c ca="center">
                        <p>[9]</p>
                     </c>
                     <c ca="center">
                        <p>4869</p>
                     </c>
                     <c ca="center">
                        <p>78</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast</p>
                     </c>
                     <c ca="center">
                        <p>[9]</p>
                     </c>
                     <c ca="center">
                        <p>4869</p>
                     </c>
                     <c ca="center">
                        <p>96</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NCI 60</p>
                     </c>
                     <c ca="center">
                        <p>[61]</p>
                     </c>
                     <c ca="center">
                        <p>5244</p>
                     </c>
                     <c ca="center">
                        <p>61</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Adenocarcinoma</p>
                     </c>
                     <c ca="center">
                        <p>[62]</p>
                     </c>
                     <c ca="center">
                        <p>9868</p>
                     </c>
                     <c ca="center">
                        <p>76</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Brain</p>
                     </c>
                     <c ca="center">
                        <p>[63]</p>
                     </c>
                     <c ca="center">
                        <p>5597</p>
                     </c>
                     <c ca="center">
                        <p>42</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colon</p>
                     </c>
                     <c ca="center">
                        <p>[64]</p>
                     </c>
                     <c ca="center">
                        <p>2000</p>
                     </c>
                     <c ca="center">
                        <p>62</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lymphoma</p>
                     </c>
                     <c ca="center">
                        <p>[65]</p>
                     </c>
                     <c ca="center">
                        <p>4026</p>
                     </c>
                     <c ca="center">
                        <p>62</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Prostate</p>
                     </c>
                     <c ca="center">
                        <p>[66]</p>
                     </c>
                     <c ca="center">
                        <p>6033</p>
                     </c>
                     <c ca="center">
                        <p>102</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Srbct</p>
                     </c>
                     <c ca="center">
                        <p>[67]</p>
                     </c>
                     <c ca="center">
                        <p>2308</p>
                     </c>
                     <c ca="center">
                        <p>63</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <suppl id="S1">
               <title>
                  <p>Additional File 1</p>
               </title>
               <text>
                  <p>A PDF file with additional results, showing error rates and stability for simulated data under various parameters, as well as error rates and stabilities for the real microarray data with other parameters, and further details on the data sets, simulations, and alternative methods.</p>
               </text>
               <file name="1471-2105-7-3-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>We have compared the predictive performance of the variable selection approach with: a) random forest without any variable selection (using <m:math name="1471-2105-7-3-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>m</m:mi><m:mi>t</m:mi><m:mi>r</m:mi><m:mi>y</m:mi><m:mo>=</m:mo><m:msqrt><m:mrow><m:mi>n</m:mi><m:mi>u</m:mi><m:mi>m</m:mi><m:mi>b</m:mi><m:mi>e</m:mi><m:mi>r</m:mi><m:mtext>&#8201;</m:mtext><m:mi>o</m:mi><m:mi>f</m:mi><m:mtext>&#8201;</m:mtext><m:mi>g</m:mi><m:mi>e</m:mi><m:mi>n</m:mi><m:mi>e</m:mi><m:mi>s</m:mi></m:mrow></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGTbqBcqWG0baDcqWGYbGCcqWG5bqEcqGH9aqpdaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjaaykW7cqWGVbWBcqWGMbGzcaaMc8Uaem4zaCMaemyzauMaemOBa4MaemyzauMaem4Camhaleqaaaaa@4876@</m:annotation></m:semantics></m:math>, <it>ntree </it>= 5000, <it>nodesize </it>= 1); b) three other methods that have shown good performance in reviews of classification methods with microarray data <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp> but that do not include any variable selection; c) three methods that carry out variable selection. For the three methods that do not carry out variable selection, <b>Diagonal Linear Discriminant Analysis (DLDA)</b>, <b>K nearest neighbor (KNN)</b>, and <b>Support Vector Machines (SVM) </b>with linear kernel, we have used, based on <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, the 200 genes with the largest <it>F</it>-ratio of between to within groups sums of squares. For <b>KNN</b>, the number of neighbors (<it>K</it>) was chosen by cross-validation as in <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. The methods that incorporate variable selection are two different versions of <b>Shrunken centroids (SC) </b><abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, <b>SC.l </b>and <b>SC.s</b>, as well as <b>Nearest neighbor + variable selection (NN.vs)</b>; further details are provided in the methods and in the supplementary material [see <supplr sid="S1">Additional file 1</supplr>].</p>
         </sec>
         <sec>
            <st>
               <p>Estimation of error rates</p>
            </st>
            <p>To estimate the prediction error rate of all methods we have used the .632+ bootstrap method <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr></abbrgrp>. The .632+ bootstrap method uses a weighted average of the resubstitution error (the error when a classifier is applied to the training data) and the error on samples not used to train the predictor (the "leave-one-out" bootstrap error); this average is weighted by a quantity that reflects the amount of overfitting. It must be emphasized that the error rate used when performing variable selection is not what we report in as prediction error rate in Tables <tblr tid="T2">2</tblr> or <tblr tid="T3">3</tblr>. To calculate the prediction error rate as reported, for example, in Table <tblr tid="T2">2</tblr>, the .632+ bootstrap method is applied to the complete procedure, and thus the samples used to compute the leave-one-out bootstrap error used in the .632+ method are samples that are not used when fitting the random forest, or carrying out variable selection. The .632+ bootstrap method was also used when evaluating the competing methods.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Error rates (estimated using the 0.632+ bootstrap method with 200 bootstrap samples) for the microarray data sets using different methods. The results shown for variable selection with random forest used <it>ntree </it>= 2000, <it>fraction.dropped </it>= 0.2, <it>mtryFactor </it>= 1. Note that the OOB error used for variable selection <it>is not </it>the error reported in this table; the error rate reported is obtained using bootstrap on the complete variable selection process. The column "no info" denotes the minimal error we can make if we use no information from the genes (i.e., we always bet on the most frequent class).</p>
               </caption>
               <tblbdy cols="11">
                  <r>
                     <c ca="left">
                        <p>Data set</p>
                     </c>
                     <c ca="center">
                        <p>no info</p>
                     </c>
                     <c ca="center">
                        <p>SVM</p>
                     </c>
                     <c ca="center">
                        <p>KNN</p>
                     </c>
                     <c ca="center">
                        <p>DLDA</p>
                     </c>
                     <c ca="center">
                        <p>SC.l</p>
                     </c>
                     <c ca="center">
                        <p>SC.s</p>
                     </c>
                     <c ca="center">
                        <p>NN.vs</p>
                     </c>
                     <c ca="center">
                        <p>random forest</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>random forest var.sel.</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>s.e. 0</p>
                     </c>
                     <c ca="center">
                        <p>s.e. 1</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukemia</p>
                     </c>
                     <c ca="center">
                        <p>0.289</p>
                     </c>
                     <c ca="center">
                        <p>0.014</p>
                     </c>
                     <c ca="center">
                        <p>0.029</p>
                     </c>
                     <c ca="center">
                        <p>0.020</p>
                     </c>
                     <c ca="center">
                        <p>0.025</p>
                     </c>
                     <c ca="center">
                        <p>0.062</p>
                     </c>
                     <c ca="center">
                        <p>0.056</p>
                     </c>
                     <c ca="center">
                        <p>0.051</p>
                     </c>
                     <c ca="center">
                        <p>0.087</p>
                     </c>
                     <c ca="center">
                        <p>0. 075</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 2 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.429</p>
                     </c>
                     <c ca="center">
                        <p>0.325</p>
                     </c>
                     <c ca="center">
                        <p>0.337</p>
                     </c>
                     <c ca="center">
                        <p>0.331</p>
                     </c>
                     <c ca="center">
                        <p>0.324</p>
                     </c>
                     <c ca="center">
                        <p>0.326</p>
                     </c>
                     <c ca="center">
                        <p>0.337</p>
                     </c>
                     <c ca="center">
                        <p>0.342</p>
                     </c>
                     <c ca="center">
                        <p>0.337</p>
                     </c>
                     <c ca="center">
                        <p>0. 332</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 3 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.537</p>
                     </c>
                     <c ca="center">
                        <p>0.380</p>
                     </c>
                     <c ca="center">
                        <p>0.449</p>
                     </c>
                     <c ca="center">
                        <p>0.370</p>
                     </c>
                     <c ca="center">
                        <p>0.396</p>
                     </c>
                     <c ca="center">
                        <p>0.401</p>
                     </c>
                     <c ca="center">
                        <p>0.424</p>
                     </c>
                     <c ca="center">
                        <p>0.351</p>
                     </c>
                     <c ca="center">
                        <p>0.346</p>
                     </c>
                     <c ca="center">
                        <p>0. 364</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NCI 60</p>
                     </c>
                     <c ca="center">
                        <p>0.852</p>
                     </c>
                     <c ca="center">
                        <p>0.256</p>
                     </c>
                     <c ca="center">
                        <p>0.317</p>
                     </c>
                     <c ca="center">
                        <p>0.286</p>
                     </c>
                     <c ca="center">
                        <p>0.256</p>
                     </c>
                     <c ca="center">
                        <p>0.246</p>
                     </c>
                     <c ca="center">
                        <p>0.237</p>
                     </c>
                     <c ca="center">
                        <p>0.252</p>
                     </c>
                     <c ca="center">
                        <p>0.327</p>
                     </c>
                     <c ca="center">
                        <p>0.353</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Adenocar.</p>
                     </c>
                     <c ca="center">
                        <p>0.158</p>
                     </c>
                     <c ca="center">
                        <p>0.203</p>
                     </c>
                     <c ca="center">
                        <p>0.174</p>
                     </c>
                     <c ca="center">
                        <p>0.194</p>
                     </c>
                     <c ca="center">
                        <p>0.177</p>
                     </c>
                     <c ca="center">
                        <p>0.179</p>
                     </c>
                     <c ca="center">
                        <p>0.181</p>
                     </c>
                     <c ca="center">
                        <p>0.125</p>
                     </c>
                     <c ca="center">
                        <p>0.185</p>
                     </c>
                     <c ca="center">
                        <p>0. 207</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Brain</p>
                     </c>
                     <c ca="center">
                        <p>0.762</p>
                     </c>
                     <c ca="center">
                        <p>0.138</p>
                     </c>
                     <c ca="center">
                        <p>0.174</p>
                     </c>
                     <c ca="center">
                        <p>0.183</p>
                     </c>
                     <c ca="center">
                        <p>0.163</p>
                     </c>
                     <c ca="center">
                        <p>0.159</p>
                     </c>
                     <c ca="center">
                        <p>0.194</p>
                     </c>
                     <c ca="center">
                        <p>0.154</p>
                     </c>
                     <c ca="center">
                        <p>0.216</p>
                     </c>
                     <c ca="center">
                        <p>0. 216</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colon</p>
                     </c>
                     <c ca="center">
                        <p>0.355</p>
                     </c>
                     <c ca="center">
                        <p>0.147</p>
                     </c>
                     <c ca="center">
                        <p>0.152</p>
                     </c>
                     <c ca="center">
                        <p>0.137</p>
                     </c>
                     <c ca="center">
                        <p>0.123</p>
                     </c>
                     <c ca="center">
                        <p>0.122</p>
                     </c>
                     <c ca="center">
                        <p>0.158</p>
                     </c>
                     <c ca="center">
                        <p>0.127</p>
                     </c>
                     <c ca="center">
                        <p>0.159</p>
                     </c>
                     <c ca="center">
                        <p>0. 177</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lymphoma</p>
                     </c>
                     <c ca="center">
                        <p>0.323</p>
                     </c>
                     <c ca="center">
                        <p>0.010</p>
                     </c>
                     <c ca="center">
                        <p>0.008</p>
                     </c>
                     <c ca="center">
                        <p>0.021</p>
                     </c>
                     <c ca="center">
                        <p>0.028</p>
                     </c>
                     <c ca="center">
                        <p>0.033</p>
                     </c>
                     <c ca="center">
                        <p>0.04</p>
                     </c>
                     <c ca="center">
                        <p>0.009</p>
                     </c>
                     <c ca="center">
                        <p>0.047</p>
                     </c>
                     <c ca="center">
                        <p>0. 042</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Prostate</p>
                     </c>
                     <c ca="center">
                        <p>0.490</p>
                     </c>
                     <c ca="center">
                        <p>0.064</p>
                     </c>
                     <c ca="center">
                        <p>0.100</p>
                     </c>
                     <c ca="center">
                        <p>0.149</p>
                     </c>
                     <c ca="center">
                        <p>0.088</p>
                     </c>
                     <c ca="center">
                        <p>0.089</p>
                     </c>
                     <c ca="center">
                        <p>0.081</p>
                     </c>
                     <c ca="center">
                        <p>0.077</p>
                     </c>
                     <c ca="center">
                        <p>0.061</p>
                     </c>
                     <c ca="center">
                        <p>0. 064</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Srbct</p>
                     </c>
                     <c ca="center">
                        <p>0.635</p>
                     </c>
                     <c ca="center">
                        <p>0.017</p>
                     </c>
                     <c ca="center">
                        <p>0.023</p>
                     </c>
                     <c ca="center">
                        <p>0.011</p>
                     </c>
                     <c ca="center">
                        <p>0.012</p>
                     </c>
                     <c ca="center">
                        <p>0.025</p>
                     </c>
                     <c ca="center">
                        <p>0.031</p>
                     </c>
                     <c ca="center">
                        <p>0.021</p>
                     </c>
                     <c ca="center">
                        <p>0.039</p>
                     </c>
                     <c ca="center">
                        <p>0.038</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Stability of variable (gene) selection evaluated using 200 bootstrap samples. "# Genes": number of genes selected on the original data set. "# Genes boot.": median (1st quartile, 3rd quartile) of number of genes selected from on the bootstrap samples. "Freq. genes": median (1st quartile, 3rd quartile) of the frequency with which each gene in the original data set appears in the genes selected from the bootstrap samples. Parameters for backwards elimination with random forest: <it>mtryFactor </it>= 1, <it>s.e. </it>= 0, <it>ntree </it>= 2000, <it>ntreelterat </it>= 1000, <it>fraction.dropped </it>= 0.2.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Data set</p>
                     </c>
                     <c ca="center">
                        <p>Error</p>
                     </c>
                     <c ca="center">
                        <p># Genes</p>
                     </c>
                     <c ca="center">
                        <p># Genes boot.</p>
                     </c>
                     <c ca="center">
                        <p>Freq. genes</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="5" ca="center">
                        <p>
                           <b>Backwards elimination of genes from random forest</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="5" ca="center">
                        <p><it>s.e. </it>= 0</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukemia</p>
                     </c>
                     <c ca="center">
                        <p>0.087</p>
                     </c>
                     <c ca="right">
                        <p>2</p>
                     </c>
                     <c ca="right">
                        <p>2 (2, 2)</p>
                     </c>
                     <c ca="right">
                        <p>0.38 (0.29, 0.48)<sup>1 </sup></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 2 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.337</p>
                     </c>
                     <c ca="right">
                        <p>14</p>
                     </c>
                     <c ca="right">
                        <p>9 (5, 23)</p>
                     </c>
                     <c ca="right">
                        <p>0.15 (0.1, 0.28)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 3 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.346</p>
                     </c>
                     <c ca="right">
                        <p>110</p>
                     </c>
                     <c ca="right">
                        <p>14 (9, 31)</p>
                     </c>
                     <c ca="right">
                        <p>0.08 (0.04, 0.13)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NCI 60</p>
                     </c>
                     <c ca="center">
                        <p>0.327</p>
                     </c>
                     <c ca="right">
                        <p>230</p>
                     </c>
                     <c ca="right">
                        <p>60 (30, 94)</p>
                     </c>
                     <c ca="right">
                        <p>0.1 (0.06, 0.19)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Adenocar.</p>
                     </c>
                     <c ca="center">
                        <p>0.185</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>3 (2, 8)</p>
                     </c>
                     <c ca="right">
                        <p>0.14 (0.12, 0.15)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Brain</p>
                     </c>
                     <c ca="center">
                        <p>0.216</p>
                     </c>
                     <c ca="right">
                        <p>22</p>
                     </c>
                     <c ca="right">
                        <p>14 (7, 22)</p>
                     </c>
                     <c ca="right">
                        <p>0.18 (0.09, 0.25)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colon</p>
                     </c>
                     <c ca="center">
                        <p>0.159</p>
                     </c>
                     <c ca="right">
                        <p>14</p>
                     </c>
                     <c ca="right">
                        <p>5 (3, 12)</p>
                     </c>
                     <c ca="right">
                        <p>0.29 (0.19, 0.42)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lymphoma</p>
                     </c>
                     <c ca="center">
                        <p>0.047</p>
                     </c>
                     <c ca="right">
                        <p>73</p>
                     </c>
                     <c ca="right">
                        <p>14 (4, 58)</p>
                     </c>
                     <c ca="right">
                        <p>0.26 (0.18, 0.38)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Prostate</p>
                     </c>
                     <c ca="center">
                        <p>0.061</p>
                     </c>
                     <c ca="right">
                        <p>18</p>
                     </c>
                     <c ca="right">
                        <p>5 (3, 14)</p>
                     </c>
                     <c ca="right">
                        <p>0.22 (0.17, 0.43)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Srbct</p>
                     </c>
                     <c ca="center">
                        <p>0.039</p>
                     </c>
                     <c ca="right">
                        <p>101</p>
                     </c>
                     <c ca="right">
                        <p>18 (11, 27)</p>
                     </c>
                     <c ca="right">
                        <p>0.1 (0.04, 0.29)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="5" ca="center">
                        <p><it>s.e. </it>= 1</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukemia</p>
                     </c>
                     <c ca="center">
                        <p>0.075</p>
                     </c>
                     <c ca="right">
                        <p>2</p>
                     </c>
                     <c ca="right">
                        <p>2 (2, 2)</p>
                     </c>
                     <c ca="right">
                        <p>0.4 (0.32, 0.5)<sup>1 </sup></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 2 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.332</p>
                     </c>
                     <c ca="right">
                        <p>14</p>
                     </c>
                     <c ca="right">
                        <p>4 (2, 7)</p>
                     </c>
                     <c ca="right">
                        <p>0.12 (0.07, 0.17)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 3 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.364</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>7 (4, 14)</p>
                     </c>
                     <c ca="right">
                        <p>0.27 (0.22, 0.31)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NCI 60</p>
                     </c>
                     <c ca="center">
                        <p>0.353</p>
                     </c>
                     <c ca="right">
                        <p>24</p>
                     </c>
                     <c ca="right">
                        <p>30 (19, 60)</p>
                     </c>
                     <c ca="right">
                        <p>0.26 (0.17, 0.38)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Adenocar.</p>
                     </c>
                     <c ca="center">
                        <p>0.207</p>
                     </c>
                     <c ca="right">
                        <p>8</p>
                     </c>
                     <c ca="right">
                        <p>3 (2, 5)</p>
                     </c>
                     <c ca="right">
                        <p>0.06 (0.03, 0.12)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Brain</p>
                     </c>
                     <c ca="center">
                        <p>0.216</p>
                     </c>
                     <c ca="right">
                        <p>9</p>
                     </c>
                     <c ca="right">
                        <p>14 (7, 22)</p>
                     </c>
                     <c ca="right">
                        <p>0.26 (0.14, 0.46)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colon</p>
                     </c>
                     <c ca="center">
                        <p>0.177</p>
                     </c>
                     <c ca="right">
                        <p>3</p>
                     </c>
                     <c ca="right">
                        <p>3 (2, 6)</p>
                     </c>
                     <c ca="right">
                        <p>0.36 (0.32, 0.36)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lymphoma</p>
                     </c>
                     <c ca="center">
                        <p>0.042</p>
                     </c>
                     <c ca="right">
                        <p>58</p>
                     </c>
                     <c ca="right">
                        <p>12 (5, 73)</p>
                     </c>
                     <c ca="right">
                        <p>0.32 (0.24, 0.42)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Prostate</p>
                     </c>
                     <c ca="center">
                        <p>0.064</p>
                     </c>
                     <c ca="right">
                        <p>2</p>
                     </c>
                     <c ca="right">
                        <p>3 (2, 5)</p>
                     </c>
                     <c ca="right">
                        <p>0.9 (0.82, 0.99)<sup>1 </sup></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Srbct</p>
                     </c>
                     <c ca="center">
                        <p>0.038</p>
                     </c>
                     <c ca="right">
                        <p>22</p>
                     </c>
                     <c ca="right">
                        <p>18 (11, 34)</p>
                     </c>
                     <c ca="right">
                        <p>0.57 (0.4, 0.88)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="5" ca="center">
                        <p>
                           <b>Alternative approaches</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="5" ca="center">
                        <p>SC.s</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukemia</p>
                     </c>
                     <c ca="center">
                        <p>0.062</p>
                     </c>
                     <c ca="right">
                        <p>82<sup>2 </sup></p>
                     </c>
                     <c ca="right">
                        <p>46 (14, 504)</p>
                     </c>
                     <c ca="right">
                        <p>0.48 (0.45, 0.59)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 2 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.326</p>
                     </c>
                     <c ca="right">
                        <p>31</p>
                     </c>
                     <c ca="right">
                        <p>55 (24, 296)</p>
                     </c>
                     <c ca="right">
                        <p>0.54 (0.51, 0.66)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 3 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.401</p>
                     </c>
                     <c ca="right">
                        <p>2166</p>
                     </c>
                     <c ca="right">
                        <p>4341 (2379, 4804)</p>
                     </c>
                     <c ca="right">
                        <p>0.84 (0.78, 0.88)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NCI 60</p>
                     </c>
                     <c ca="center">
                        <p>0.246</p>
                     </c>
                     <c ca="right">
                        <p>5118<sup>3 </sup></p>
                     </c>
                     <c ca="right">
                        <p>4919 (3711, 5243)</p>
                     </c>
                     <c ca="right">
                        <p>0.84 (0.74, 0.92)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Adenocar.</p>
                     </c>
                     <c ca="center">
                        <p>0.179</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>9 (0, 18)</p>
                     </c>
                     <c ca="right">
                        <p>NA (NA, NA)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Brain</p>
                     </c>
                     <c ca="center">
                        <p>0.159</p>
                     </c>
                     <c ca="right">
                        <p>4177</p>
                     </c>
                     <c ca="right">
                        <p>1257 (295, 3483)</p>
                     </c>
                     <c ca="right">
                        <p>0.38 (0.3, 0.5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colon</p>
                     </c>
                     <c ca="center">
                        <p>0.122</p>
                     </c>
                     <c ca="right">
                        <p>15</p>
                     </c>
                     <c ca="right">
                        <p>22 (15, 34)</p>
                     </c>
                     <c ca="right">
                        <p>0.8 (0.66, 0.87)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lymphoma</p>
                     </c>
                     <c ca="center">
                        <p>0.033</p>
                     </c>
                     <c ca="right">
                        <p>2796</p>
                     </c>
                     <c ca="right">
                        <p>2718 (2030, 3269)</p>
                     </c>
                     <c ca="right">
                        <p>0.82 (0.68, 0.86)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Prostate</p>
                     </c>
                     <c ca="center">
                        <p>0.089</p>
                     </c>
                     <c ca="right">
                        <p>4</p>
                     </c>
                     <c ca="right">
                        <p>3 (2, 4)</p>
                     </c>
                     <c ca="right">
                        <p>0.72 (0.49, 0.92)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Srbct</p>
                     </c>
                     <c ca="center">
                        <p>0.025</p>
                     </c>
                     <c ca="right">
                        <p>37<sup>4 </sup></p>
                     </c>
                     <c ca="right">
                        <p>18 (12, 40)</p>
                     </c>
                     <c ca="right">
                        <p>0.45 (0.34, 0.61)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="5" ca="center">
                        <p>NN.vs</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukemia</p>
                     </c>
                     <c ca="center">
                        <p>0.056</p>
                     </c>
                     <c ca="right">
                        <p>512</p>
                     </c>
                     <c ca="right">
                        <p>23 (4, 134)</p>
                     </c>
                     <c ca="right">
                        <p>0.17 (0.14, 0.24)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 2 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.337</p>
                     </c>
                     <c ca="right">
                        <p>88</p>
                     </c>
                     <c ca="right">
                        <p>23 (4, 110)</p>
                     </c>
                     <c ca="right">
                        <p>0.24 (0.2, 0.31)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast 3 cl.</p>
                     </c>
                     <c ca="center">
                        <p>0.424</p>
                     </c>
                     <c ca="right">
                        <p>9</p>
                     </c>
                     <c ca="right">
                        <p>45 (6, 214)</p>
                     </c>
                     <c ca="right">
                        <p>0.66 (0.61, 0.72)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NCI 60</p>
                     </c>
                     <c ca="center">
                        <p>0.237</p>
                     </c>
                     <c ca="right">
                        <p>1718</p>
                     </c>
                     <c ca="right">
                        <p>880 (360, 1718)</p>
                     </c>
                     <c ca="right">
                        <p>0.44 (0.34, 0.57)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Adenocar.</p>
                     </c>
                     <c ca="center">
                        <p>0.181</p>
                     </c>
                     <c ca="right">
                        <p>9868</p>
                     </c>
                     <c ca="right">
                        <p>73 (8, 1324)</p>
                     </c>
                     <c ca="right">
                        <p>0.13 (0.1, 0.18)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Brain</p>
                     </c>
                     <c ca="center">
                        <p>0.194</p>
                     </c>
                     <c ca="right">
                        <p>1834</p>
                     </c>
                     <c ca="right">
                        <p>158 (52, 601)</p>
                     </c>
                     <c ca="right">
                        <p>0.16 (0.12, 0.25)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colon</p>
                     </c>
                     <c ca="center">
                        <p>0.158</p>
                     </c>
                     <c ca="right">
                        <p>8</p>
                     </c>
                     <c ca="right">
                        <p>9 (4, 45)</p>
                     </c>
                     <c ca="right">
                        <p>0.57 (0.45, 0.72)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lymphoma</p>
                     </c>
                     <c ca="center">
                        <p>0.04</p>
                     </c>
                     <c ca="right">
                        <p>15</p>
                     </c>
                     <c ca="right">
                        <p>15 (5, 39)</p>
                     </c>
                     <c ca="right">
                        <p>0.5 (0.4, 0.6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Prostate</p>
                     </c>
                     <c ca="center">
                        <p>0.081</p>
                     </c>
                     <c ca="right">
                        <p>7</p>
                     </c>
                     <c ca="right">
                        <p>6 (3, 18)</p>
                     </c>
                     <c ca="right">
                        <p>0.46 (0.39, 0.78)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Srbct</p>
                     </c>
                     <c ca="center">
                        <p>0.031</p>
                     </c>
                     <c ca="right">
                        <p>11</p>
                     </c>
                     <c ca="right">
                        <p>17 (11, 33)</p>
                     </c>
                     <c ca="right">
                        <p>0.7 (0.66, 0.85)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>1 </sup>Only two genes are selected from the complete data set; the values are the actual frequencies of those two genes.</p>
                  <p><sup>2 </sup>[33] select 21 genes after visually inspecting the plot of cross-validation error rate vs. amount of shrinkage and number of genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error rate of their procedure.</p>
                  <p><sup>3 </sup>[31] report obtaining more than 2000 genes when using shrunken centroids with this data set and show that the minimum error rate is achieved with about 5000 genes.</p>
                  <p><sup>4 </sup>[33] select 43 genes. The difference is likely due to differences in the random partitions for cross-validation. Repeating 100 times the gene selection process with the full data set the median, 1st quartile, and 3rd quartile of the number of selected genes are 13, 8, and 147. For these data, [31] obtain 72 genes with shrunken centroids, which also falls within the above interval.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Effects of parameters of random forest on prediction error rate</p>
            </st>
            <p>Before examining gene selection, we first evaluated the effect of changes in parameters of random forest on its classification performance. Random forest returns a measure of error rate based on the out-of-bag cases for each fitted tree, the OOB error, and this is the measure of error we will use here to assess the effects of parameters. We examined whether the OOB error rate is substantially affected by changes in <it>mtry</it>, <it>ntree</it>, and <it>nodesize</it>.</p>
            <p>Figure <figr fid="F1">1</figr> and the Figure"error.vs.mtry.pdf" in <supplr sid="S2">Additional file 2</supplr> show that, for both real and simulated data, the relation of OOB error rate with <it>mtry </it>is largely independent of <it>ntree </it>(for <it>ntree </it>between 1000 and 40000) and <it>nodesize </it>(nodesizes 1 and 5). In addition, the default setting of <it>mtry </it>(<it>mtryFactor </it>= 1 in the figures) is often a good choice in terms of OOB error rate. In some cases, increasing <it>mtry </it>can lead to small decreases in error rate, and decreases in <it>mtry </it>often lead to increases in the error rate. This is specially the case with simulated data with very few relevant genes (with very few relevant genes, small <it>mtry </it>results in many trees being built that do not incorporate any of the relevant genes). Since the OOB error and the relation between OOB error and <it>mtry </it>do not change whether we use <it>nodesize </it>of 1 or 5, and because the increase in computing speed from using <it>nodesize </it>of 5 is inconsequential, all further analyses will use only the default <it>nodesize </it>= 1. These results show the robustness of random forest to changes in its parameters; nevertheless, to re-examine robustness of gene selection to these parameters, in the rest of the paper we will report results for different settings of <it>ntree </it>and <it>mtry </it>(and these results will again show the robustness of the gene selection results to changes in <it>ntree </it>and <it>mtry</it>).</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Out-of-Bag (OOB) vs <it>mtryFactor </it>for the nine microarray data sets</p>
               </caption>
               <text>
                  <p><b>Out-of-Bag (OOB) vs <it>mtryFactor </it>for the nine microarray data sets</b>. <it>mtryFactor </it>is the multiplicative factor of the default <it>mtry </it>(<m:math name="1471-2105-7-3-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msqrt><m:mrow><m:mi>n</m:mi><m:mi>u</m:mi><m:mi>m</m:mi><m:mi>b</m:mi><m:mi>e</m:mi><m:mi>r</m:mi><m:mo>&#8901;</m:mo><m:mi>o</m:mi><m:mi>f</m:mi><m:mo>&#8901;</m:mo><m:mi>g</m:mi><m:mi>e</m:mi><m:mi>n</m:mi><m:mi>e</m:mi><m:mi>s</m:mi></m:mrow></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjabgwSixlabd+gaVjabdAgaMjabgwSixlabdEgaNjabdwgaLjabd6gaUjabdwgaLjabdohaZbWcbeaaaaa@4332@</m:annotation></m:semantics></m:math>); thus, an <it>mtryFactor </it>of 3 means the number of genes tried at each split is 3 * <m:math name="1471-2105-7-3-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msqrt><m:mrow><m:mi>n</m:mi><m:mi>u</m:mi><m:mi>m</m:mi><m:mi>b</m:mi><m:mi>e</m:mi><m:mi>r</m:mi><m:mo>&#8901;</m:mo><m:mi>o</m:mi><m:mi>f</m:mi><m:mo>&#8901;</m:mo><m:mi>g</m:mi><m:mi>e</m:mi><m:mi>n</m:mi><m:mi>e</m:mi><m:mi>s</m:mi></m:mrow></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjabgwSixlabd+gaVjabdAgaMjabgwSixlabdEgaNjabdwgaLjabd6gaUjabdwgaLjabdohaZbWcbeaaaaa@4332@</m:annotation></m:semantics></m:math>; an <it>mtryFactor </it>= 0 means the number of genes tried was 1; the <it>mtryFactors </it>examined were = {0, 0.05, 0.1, 0.17, 0.25, 0.33, 0.5, 0.75, 0.8, 1, 1.15, 1.33, 1.5, 2, 3, 4, 5, 6, 8, 10, 13}. Results shown for six different <it>ntree </it>= {1000, 2000, 5000, 10000, 20000, 40000}, <it>nodesize </it>= 1.</p>
               </text>
               <graphic file="1471-2105-7-3-1"/>
            </fig>
            <suppl id="S2">
               <title>
                  <p>Additional File 2</p>
               </title>
               <text>
                  <p>A PDF file with additional plots of OOB error rate vs. <it>mtry </it>for both simulated data and real data under other parameters.</p>
               </text>
               <file name="1471-2105-7-3-S2.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>The error rates of random forest (without gene selection) compared with the alternative methods, using the real microarray data, and estimated in all cases using the .632+ bootstrap method, are shown in Table <tblr tid="T2">2</tblr>. These results clearly show that random forest has a predictive performance comparable to that of the alternative methods, without any need for pre-selection of genes or tuning of its parameters.</p>
         </sec>
         <sec>
            <st>
               <p>Gene selection using random forest</p>
            </st>
            <p>Random forest returns several measures of variable importance. The most reliable measure is based on the decrease of classification accuracy when values of a variable in a node of a tree are permuted randomly <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B36">36</abbr></abbrgrp>, and this is the measure of variable importance (in its unscaled version &#8211; see <supplr sid="S1">Additional file 1</supplr>) that we will use in the rest of the paper. (In the Supplementary material [see <supplr sid="S1">Additional file 1</supplr>] we show that this measure of variable importance is not the same as a non-parametric statistic of difference between groups, such as could be obtained with a Kruskal-Wallis test). Other measures of variable importance are available, however, and future research should compare the performance of different measures of importance.</p>
            <p>To select genes we iteratively fit random forests, at each iteration building a new forest after discarding those variables (genes) with the smallest variable importances; the selected set of genes is the one that yields the smallest OOB error rate. Note that in this section we are using OOB error to choose the final set of genes, not to obtain unbiased estimates of the error rate of this rule. Because of the iterative approach, the OOB error is biased down and cannot be used to asses the overall error rate of the approach, for reasons analogous to those leading to "selection bias" <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B37">37</abbr></abbrgrp>. To assess prediction error rates we will use the bootstrap, not OOB error (see above). (Using error rates affected by selection bias to select the optimal number of genes is not necessarily a bad procedure from the point of view of selecting the final number of genes; see <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>).</p>
            <p>In our algorithm we examine all forests that result from eliminating, iteratively, a fraction, <it>fraction.dropped</it>, of the genes (the least important ones) used in the previous iteration. By default, <it>fraction.dropped </it>= 0.2 which allows for relatively fast operation, is coherent with the idea of an "aggressive variable selection" approach, and increases the resolution as the number of genes considered becomes smaller. We do not recalculate variable importances at each step as <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> mention severe overfitting resulting from recalculating variable importances. After fitting all forests, we examine the OOB error rates from all the fitted random forests. We choose the solution with the smallest number of genes whose error rate is within <it>u </it>standard errors of the minimum error rate of all forests. Setting <it>u </it>= 0 is the same as selecting the set of genes that leads to the smallest error rate. Setting <it>u </it>= 1 is similar to the common "1 s.e. rule", used in the classification trees literature <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>; this strategy can lead to solutions with fewer genes than selecting the solution with the smallest error rate, while achieving an error rate that is not different, within sampling error, from the "best solution". In this paper we will examine both the "1 s.e. rule" and the "0 s.e. rule".</p>
            <p>On the simulated data sets [see <supplr sid="S1">Additional file 1</supplr>, Tables <tblr tid="T3">3</tblr> and 4] backwards elimination often leads to very small sets of genes, often much smaller than the set of "true genes". The error rate of the variable selection procedure, estimated using the .632+ bootstrap method, indicates that the variable selection procedure does not lead to overfitting, and can achieve the objective of aggressively reducing the set of selected genes. In contrast, when the simplification procedure is applied to simulated data sets without signal (see Tables <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr> in <supplr sid="S1">Additional file 1</supplr>), the number of genes selected is consistently much larger and, as should be the case, the estimated error rate using the bootstrap corresponds to that achieved by always betting on the most probable class.</p>
            <p>Results for the real data sets are shown in Tables <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr> (see also <supplr sid="S1">Additional file 1</supplr>, Tables 5, 6, 7, for additional results using different combinations of <it>ntree </it>= {2000, 5000, 20000}, <it>mtryFactor </it>= {1, 13}, <it>se </it>= {0, 1}, <it>fraction.dropped </it>= {0.2, 0.5}). Error rates (see Table <tblr tid="T2">2</tblr>) when performing variable selection are in most cases comparable (within sampling error) to those from random forest without variable selection, and comparable also to the error rates from competing state-of-the-art prediction methods. The number of genes selected varies by data set, but generally (Table <tblr tid="T3">3</tblr>) the variable selection procedure leads to small (&lt; 50) sets of predictor genes, often much smaller than those from competing approaches (see also Table 8 in <supplr sid="S1">Additional file 1</supplr> and discussion). There are no relevant differences in error rate related to differences in <it>mtry</it>, <it>ntree </it>or whether we use the "s.e. 1" or "s.e. 0" rules. The use of the "s.e. 1" rule, however, tends to result in smaller sets of selected genes.</p>
         </sec>
         <sec>
            <st>
               <p>Stability (uniqueness) of results</p>
            </st>
            <p>Following <abbrgrp><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr></abbrgrp>, and <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>, we have evaluated the stability of the variable selection procedure using the bootstrap. This allows us to asses how often a given gene, selected when running the variable selection procedure in the original sample, is selected when running the procedure on bootstrap samples.</p>
            <p>The results here will focus on the real microarray data sets (results from the simulated data are presented in <supplr sid="S1">Additional file 1</supplr>). Table <tblr tid="T3">3</tblr> (see also <supplr sid="S1">Additional file 1</supplr>, Tables 5, 6, 7, for other combinations of <it>ntree</it>, <it>mtryFactor</it>, <it>fraction.dropped</it>, <it>se</it>) shows the variation in the number of genes selected in bootstrap samples, and the frequency with which the genes selected in the original sample appear among the genes selected from the bootstrap samples. In most cases, there is a wide range in the number of genes selected; more importantly, the genes selected in the original samples are rarely selected in more than 50% of the bootstrap samples. These results are not strongly affected by variations in <it>ntree </it>or <it>mtry</it>; using the "s.e. 1" rule can lead, in some cases, to increased stability of the results.</p>
            <p>As a comparison, we also show in Table <tblr tid="T3">3</tblr> the stability of two alternative approaches for gene selection, the shrunken centroids method, and a filter approach combined with a Nearest Neighbor classifier (see Table 8 in <supplr sid="S1">Additional file 1</supplr> for results of SC.l). Error rates are comparable, but both alternative methods lead to much larger sets of selected genes than backwards variable selection with random forests. The alternative approaches seem to lead to somewhat more stable results in variable selection (probably a consequence of the large number of genes selected) but in practical applications this increase in stability is probably far out-weighted by the very large number of selected genes.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>We have first presented an exhaustive evaluation of the performance of random forest for classification problems with microarray data, and shown it to be competitive with alternative methods, without requiring any fine-tuning of parameters or pre-selection of variables. The performance of random forest without variable selection is also equivalent to that of alternative approaches that fine-tune the variable selection process (see below).</p>
         <p>We have then examined the performance of an approach for gene selection using random forest, and compared it to alternative approaches. Our results, using both simulated and real microarray data sets, show that this method of gene selection accomplishes the proposed objectives. Our method returns very small sets of genes compared to alternative variable selection methods, while retaining predictive performance. Our method of gene selection will not return sets of genes that are highly correlated, because they are redundant. This method will be most useful under two scenarios: a) when considering the design of diagnostic tools, where having a small set of probes is often desirable; b) to help understand the results from other gene selection approaches that return many genes, so as to understand which ones of those genes have the largest signal to noise ratio and could be used as surrogates for complex processes involving many correlated genes. A backwards elimination method, precursor to the one used here, has been already used to predict breast tumor type based on chromosomic alterations <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>.</p>
         <p>We have also thoroughly examined the effects of changes in the parameters of random forest (specifically <it>mtry</it>, <it>ntree</it>, <it>nodesize</it>) and the variable selection algorithm (<it>se</it>, <it>fraction.dropped</it>). Changes in these parameters have in most cases negligible effects, suggesting that the default values are often good options, but we can make some general recommendations. Time of execution of the code increases &#8776; linearly with <it>ntree</it>. Larger <it>ntree </it>values lead to slightly more stable values of variable importances, but for the data sets examined, <it>ntree </it>= 2000 or <it>ntree </it>= 5000 seem quite adequate, with further increases having negligible effects. The change in <it>nodesize </it>from 1 to 5 has negligible effects, and thus its default setting of 1 is appropriate. For the backwards elimination algorithm, the parameter <it>fraction.dropped </it>can be adjusted to modify the resolution of the number of variable selected; smaller values of <it>fraction.dropped </it>lead to finer resolution in the examination of number of genes, but to slower execution of the code. Finally, the parameter <it>se </it>has also minor effects on the results of the backwards variable selection algorithm but a value of <it>se </it>= 1 leads to slightly more stable results and smaller sets of selected genes.</p>
         <p>In contrast to other procedures (e.g., <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B8">8</abbr></abbrgrp>) our procedure does not require to pre-specify the number of genes to be used, but rather adaptively chooses the number of genes. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> have conducted an evaluation of several gene selection algorithms, including genetic algorithms and various ranking methods; these authors show results for the Leukemia and NCI60 data sets, but the Leukemia results are not directly comparable since <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> focus on a three-class problem. They report the best results with the NCI60 data set estimated with the .632 bootstrap rule (compared to the .632+ method that we use, the .632 can be downwardly biased specially with highly overfit rules like nearest neighbor that they use &#8211; <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>). These best error rates are 0.408 for their evolutionary algorithm with 30 genes and 0.318 for 40 top-ranked genes. Using a number of genes slightly larger than us, these error rates are similar to ours; however, these are the best error rates achieved over a range of ranking methods and error rates, and not the result of a complete procedure that automatically determines the best number of genes and ranking scheme (such as our method provides). <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> conducted a comparative study of feature selection and multi-class classification. Although they use four-fold cross-validation instead of the bootstrap to assess error rates, their results for three data sets common to both studies (Srbct, Lymphoma, NCI60) are similar to, or worse than, ours. In contrast to our method, their approach pre-selects a set of 150 genes for prediction and their best error rates are those over a set of seven different algorithms and eight different rank selection methods, where no algorithm or gene selection was consistently the best. In contrast, our results with one single algorithm and gene selection method (random forest) match or outperform their results.</p>
         <p>Recently, several approaches that adaptively select the best number of genes or features have been reported. For the Leukemia data set our method consistently returns sets of two genes, similar to <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> using an exhaustive search method, and lower than the numbers given by <abbrgrp><abbr bid="B42">42</abbr></abbrgrp> of 3 to 25. <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> have proposed a Bayesian model averaging (BMA) approach for gene selection; comparing the results for the two common data sets between our study and theirs, in one case (Leukemia) our procedure returns a much smaller set of genes (2 vs. 15), whereas in another (Breast, 2 class) their BMA procedure returns 8 fewer genes (14 vs. 6); in contrast to BMA, however, our procedure does not require setting a limit in the maximum number of relevant genes to be selected. <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> have developed a method for gene selection and classification, LS Bound, related to least-squares SVMs; their method uses an initial pre-filtering (they choose 1000 initial genes) and is not clear how it could be applied to multi-class problems. The performance of their procedure with the leukemia data set is better than that reported by our method, but they use a total of 72 samples (the original 38 training plus the 34 validation of <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>) thus making these results hard to compare. With the colon data sets, however, their best performing results are not better than ours with a number of features that is similar to ours. <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> proposed two Bayesian classification algorithms that incorporate gene selection (though it is not clear how their algorithms can be used in multi-class problems). The results for the Leukemia data set are not comparable to ours (as they use the validation set of 34 samples), but their results for the colon data set show error rates of 0.167 to 0.242, slightly larger than ours (although these authors used random partitions with 50 training and 12 testing samples instead of the .632+ bootstrap to assess error rate), with between 8 and 15 features selected (somewhat larger than those from random forest). Finally, <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>, applied both shrunken centroids and a genetic algorithm + KNN technique to the NCI60 and Srcbt data sets; their results with shrunken centroids are similar to ours with that technique, but the genetic algorithm + KNN technique used larger sets of genes (155 and 72 for the NCI60 and Srbct, respectively) than variable selection with random forest using the suggested parameters. In summary, then, our proposed procedure matches or outperforms alternative approaches for gene selection in terms of error rate and number of genes selected, without any need to fine-tune parameters or preselect genes; in addition, this method is equally applicable to two-class and multi-class problems, and has software readily available. Thus, the newly proposed method is an ideal candidate for gene selection in classification problems with microarray data.</p>
         <p>A reviewer has alerted us to the paper by Jiang et al. <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>, previously unknown to us. In fact, our approach is virtually the same as the one used by Jiang et al., with the exception that these authors recompute variable importances at each step (we do not do this in this paper, although the option is available in our code) and, more importantly, that their gene selection is based both in the OOB error, as well as the prediction error when the forest trained with one data set is applied to a second, independent, data set; thus, this approach for gene selection is not feasible when we only have one data set. Jiang et al. <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> also show the excellent performance of variable selection using random forest when applied to their data sets. The final issue addressed in this paper is instability or multiplicity of the selected sets of genes. From this point of view, the results are slightly disappointing. But so are the results of the competing methods. And so are the results of most examined methods so far with microarray data, as shown in <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> and <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> and discussed thoroughly by <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> for classification and by <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> for the related problem of the effect of threshold choice in gene selection. However, and except for the above cited papers and <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B46">46</abbr></abbrgrp> and <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, this is an issue that still seems largely ignored in the microarray literature. As these papers and the statistical literature on variable selection (e.g., <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B47">47</abbr></abbrgrp>) discusses, the causes of the problem are small sample sizes and the extremely small ratio of samples to variables (i.e., number of arrays to number of genes). Thus, we might need to learn to live with the problem, and try to assess the stability and robustness of our results by using a variety of gene selection features, and examining whether there is a subset of features that tends to be repeatedly selected. This concern is explicitly taken into account in our results, and facilities for examining this problem are part of our R code.</p>
         <p>The multiplicity problem, however, does not need to result in large prediction errors. This and other papers <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B27">27</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr><abbr bid="B48">48</abbr><abbr bid="B49">49</abbr></abbrgrp> (see also above) show that very different classifiers often lead to comparable and successful error rates with a variety of microarray data sets. Thus, although improving prediction rates is important, when trying to address questions of biological mechanism or discover therapeutic targets, probably a more challenging and relevant issue is to identify sets of genes with biological relevance.</p>
         <p>Two areas of future research are using random forest for the selection of potentially large sets of genes that include correlated genes, and improving the computational efficiency of these approaches; in the present work, we have used parallelization of the "embarrassingly parallelizable" tasks using MPI with the Rmpi and Snow packages <abbrgrp><abbr bid="B50">50</abbr><abbr bid="B51">51</abbr></abbrgrp> for R. In a broader context, further work is warranted on the stability properties and biological relevance of this and other gene-selection approaches, because the multiplicity problem casts doubts on the biological interpretability of most results based on a single run of one gene-selection approach.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>The proposed method can be used for variable selection fulfilling the objectives above: we can obtain very small sets of non-redundant genes while preserving predictive accuracy. These results clearly indicate that the proposed method can be profitably used with microarray data and is competitive with existing methods. Given its performance and availability, random forest and variable selection using random forest should probably become part of the "standard tool-box" of methods for the analysis of microarray data.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Simulated data sets</p>
            </st>
            <p>Data have been simulated using different numbers of classes of patients (2 to 4), number of independent dimensions (1 to 3), and number of genes per dimension (5, 20, 100). In all cases, we have set to 25 the number of subjects per class. Each independent dimension has the same relevance for discrimination of the classes. The data come from a multivariate normal distribution with variance of 1, a (within-class) correlation among genes within dimension of 0.9, and a within-class correlation of 0 between genes from different dimensions, as those are independent. The multivariate means have been set so that the unconditional prediction error rate <abbrgrp><abbr bid="B52">52</abbr></abbrgrp> of a linear discriminant analysis using one gene from each dimension is approximately 5%. To each data set we have added 2000 random normal variates (mean 0, variance 1) and 2000 random uniform [-1,1] variates. In addition, we have generated data sets for 2, 3, and 4 classes where no genes have signal (all 4000 genes are random). For the non-signal data sets we have generated four replicate data sets for each level of number of classes. Further details are provided in the supplementary material [see <supplr sid="S1">Additional file 1</supplr>].</p>
         </sec>
         <sec>
            <st>
               <p>Competing methods</p>
            </st>
            <p>We have compared the predictive performance of the variable selection approach with: a) random forest without any variable selection (using <m:math name="1471-2105-7-3-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>m</m:mi><m:mi>t</m:mi><m:mi>r</m:mi><m:mi>y</m:mi><m:mo>=</m:mo><m:msqrt><m:mrow><m:mi>n</m:mi><m:mi>u</m:mi><m:mi>m</m:mi><m:mi>b</m:mi><m:mi>e</m:mi><m:mi>r</m:mi><m:mtext>&#8201;</m:mtext><m:mi>o</m:mi><m:mi>f</m:mi><m:mtext>&#8201;</m:mtext><m:mi>v</m:mi><m:mi>a</m:mi><m:mi>r</m:mi><m:mi>i</m:mi><m:mi>a</m:mi><m:mi>b</m:mi><m:mi>l</m:mi><m:mi>e</m:mi><m:mi>s</m:mi></m:mrow></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGTbqBcqWG0baDcqWGYbGCcqWG5bqEcqGH9aqpdaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjaaykW7cqWGVbWBcqWGMbGzcaaMc8ocbiGae8NDayNae8xyaeMae8NCaiNaemyAaKMaemyyaeMaemOyaiMaemiBaWMaemyzauMaem4Camhaleqaaaaa@4DE7@</m:annotation></m:semantics></m:math>, <it>ntree </it>= 5000, <it>nodesize </it>= 1); b) three other methods that have shown good performance in reviews of classification methods with microarray data <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B31">31</abbr></abbrgrp> but that do not include any variable selection (i.e., they use a number of genes decided before hand); c) two methods that carry out variable selection.</p>
            <p>The three methods that do not carry out variable selection are:</p>
            <p>&#8226; <b>Diagonal Linear Discriminant Analysis (DLDA) </b>DLDA is the maximum likelihood discriminant rule, for multivariate normal class densities, when the class densities have the same diagonal variance-covariance matrix (i.e., variables are uncorrelated, and for each variable, its variance is the same in all classes). This yields a simple linear rule, where a sample is assigned to the class <it>k </it>which minimizes <m:math name="1471-2105-7-3-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>j</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>p</m:mi></m:msubsup><m:mrow><m:msup><m:mrow><m:mo stretchy="false">(</m:mo><m:msub><m:mi>x</m:mi><m:mi>j</m:mi></m:msub><m:mo>&#8722;</m:mo><m:msub><m:mover accent="true"><m:mi>x</m:mi><m:mo>&#175;</m:mo></m:mover><m:mrow><m:mi>k</m:mi><m:mi>j</m:mi></m:mrow></m:msub><m:mo stretchy="false">)</m:mo></m:mrow><m:mn>2</m:mn></m:msup><m:mo>/</m:mo><m:msubsup><m:mover accent="true"><m:mi>&#963;</m:mi><m:mo>^</m:mo></m:mover><m:mi>j</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabcIcaOiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeyOeI0IafmiEaGNbaebadaWgaaWcbaGaem4AaSMaemOAaOgabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaOGaei4la8Iafq4WdmNbaKaadaqhaaWcbaGaemOAaOgabaGaeGOmaidaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabdchaWbqdcqGHris5aaaa@43ED@</m:annotation></m:semantics></m:math>, where <it>p </it>is the number of variables, <it>x</it><sub><it>j </it></sub>is the value on variable (gene) <it>j </it>of the test sample, <m:math name="1471-2105-7-3-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mover accent="true"><m:mi>x</m:mi><m:mo>&#175;</m:mo></m:mover><m:mrow><m:mi>k</m:mi><m:mi>j</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWG4baEgaqeamaaBaaaleaacqWGRbWAcqWGQbGAaeqaaaaa@3127@</m:annotation></m:semantics></m:math> is the sample mean of class <it>k </it>and variable (gene) <it>j</it>, and <m:math name="1471-2105-7-3-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mover accent="true"><m:mi>&#963;</m:mi><m:mo>^</m:mo></m:mover><m:mi>j</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuaHdpWCgaqcamaaDaaaleaacqWGQbGAaeaacqaIYaGmaaaaaa@30FD@</m:annotation></m:semantics></m:math> is the (pooled) estimate of the variance of gene <it>j </it><abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. In spite of its simplicity and its somewhat unrealistic assumptions (independent multivariate normal class densities), this method has been found to work very well.</p>
            <p>&#8226; <b>K nearest neighbor (KNN) </b>KNN is a non-parametric classification method that predicts the sample of a test case as the majority vote among the k nearest neighbors of the test case <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>. To decide on "nearest" we use, as in <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, the Euclidean distance. The number of neighbors used (k) is chosen by cross-validation as in <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>: for a given training set, the performance of the KNN for values of <it>k </it>in {1, 3, 5, ..., 21} is determined by cross-validation, and the <it>k </it>that produces the smallest error is used.</p>
            <p>&#8226; <b>Support Vector Machines (SVM) </b>SVM are becoming increasingly popular classifiers in many areas, including microarrays <abbrgrp><abbr bid="B53">53</abbr><abbr bid="B54">54</abbr><abbr bid="B55">55</abbr></abbrgrp>. SVM (with linear kernel, as used here) try to find an optimal separating hyperplane between the classes. When the classes are linearly separable, the hyperplane is located so that it has maximal margin (i.e., so that there is maximal distance between the hyperplane and the nearest point of any of the classes) which should lead to better performance on data not yet seen by the SVM. When the data are not separable, there is no separating hyperplane; in this case, we still try to maximize the margin but allow some classification errors subject to the constraint that the total error (distance from the hyperplane in the "wrong side") is less than a constant. For problems involving more than two classes there are several possible approaches; the one used here is the "one-against-one" approach, as implemented in "libsvm" <abbrgrp><abbr bid="B56">56</abbr></abbrgrp>. Reviews and introductions to SVM can be found in <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B57">57</abbr></abbrgrp>.</p>
            <p>For each of these three methods we need to decide which of the genes will be used to build the predictor. Based on the results of <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> we have used the 200 genes with the largest <it>F</it>-ratio of between to within groups sums of squares. <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> found that, for the methods they considered, 200 genes as predictors tended to perform as well as, or better than, smaller numbers (30, 40, 50 depending on data set). The three methods that include gene selection are:</p>
            <p>&#8226; <b>Shrunken centroids (SC) </b>The method of "nearest shrunken centroids" was originally described in <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. It uses "de-noised" versions of centroids to classify a new observations to the nearest centroid. The "de-noising" is achieved using soft-thresholding or penalization, so that for each gene, class centroids are shrunken towards the overall centroid. This method is very similar to a DLDA with shrinkage on the centroids. The optimal amount of shrinkage can be found with cross-validation, and used to select the number of genes to retain in the final classifier. We have used two different approaches to determine the best number of features.</p>
            <p>&#160;&#160;&#160;&#160;&#160;- <b>SC.l</b>: we choose the number of genes that minimizes the cross-validated error rate and, in case of several solutions with minimal error rates, we choose the one with largest likelihood.</p>
            <p>&#160;&#160;&#160;&#160;&#160;- <b>SC.s</b>: we choose the number of genes that minimizes the cross-validated error rate and, in case of several solutions with minimal error rates, we choose the one with smallest number of genes (larger penalty).</p>
            <p>&#8226; <b>Nearest neighbor + variable selection (NN.vs) </b>We first rank all genes based on their F-ratio, and then run a Nearest Neighbor classifier (KNN with K = 1; using N = 1 is often a successful rule <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>) on all subsets of variables that result from eliminating 20% of the genes (the ones with the smallest F-ratio) used in the previous iteration. The final number of genes is the one that leads to the smallest cross-validated error rate.</p>
            <p>The ranking of the genes using the F-ratio is done without using the left-out sample. In other words, for a given data set, we first divide it 10 samples of about the same size; then, we repeat 10 times the following:</p>
            <p>a) Exclude sample "i", the "left-out" sample.</p>
            <p>b) Using the other 9 samples, rank the genes using the F-ratio</p>
            <p>c) Predict the values for the left-out sample at each of the pre-specified numbers of genes (subsets of genes), using the genes as given by the ranking in the previous step.</p>
            <p>At the end of the 10 iterations, we average the error rate over the 10 left-out samples, and obtain the average cross-validated error rate at each number of genes. These estimates are not affected by "selection bias" <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B37">37</abbr></abbrgrp> as the error rate is obtained from the left-out samples, but the left-out samples are not involved in the ranking of genes. (Note, that using error rates affected by selection bias to select the optimal number of genes is not necessarily a bad procedure from the point of view of selecting the final number of genes; see <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>).</p>
            <p>Even if we use, as here, error rates not affected by selection bias, using that cross-validated error rate as the estimated error rate of the rule would lead to a biased-down error rate (for reasons analogous to those leading to selection bias). Thus, we do not use these error rates in the tables, but compute the estimated prediction error rate of the rule using the .632+ bootstrap method.</p>
            <p>This type of approach, in its many variants (changing both the classifier and the ordering criterion) is popular in many microarray papers; a recent example is <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, and similar general strategies are implemented in the program Tnasas <abbrgrp><abbr bid="B58">58</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Software and data sets</p>
            </st>
            <p>All simulations and analyses were carried out with R <abbrgrp><abbr bid="B59">59</abbr></abbrgrp>, using packages randomForest (from A. Liaw and M. Wiener) for random forest, e1071 (E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel) for SVM, class (B. Ripley and W. Venables) for KNN, PAM <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> for shrunken centroids, and geSignatures (by R.D.-U.) for DLDA.</p>
            <p>The microarray and simulated data sets are available from the supplementary material web page <abbrgrp><abbr bid="B60">60</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>Our procedure is available both as an R package (varSelRF) and as a web-based application (GeneSrF).</p>
         <sec>
            <st>
               <p>varSelRF</p>
            </st>
            <p><b>Project name: </b>varSelRF.</p>
            <p>
               <b>Project home page: </b>
               <url>http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html</url>
            </p>
            <p><b>Operating system(s): </b>Linux and UNIX, Windows, MacOS.</p>
            <p><b>Programming language: </b>R.</p>
            <p><b>Other requirements: </b>Linux/UNIX and LAM/MPI for parallelized computations.</p>
            <p><b>License: </b>GNU GPL 2.0 or newer.</p>
            <p><b>Any restrictions to use by non-academics: </b>None.</p>
         </sec>
         <sec>
            <st>
               <p>GeneSrF</p>
            </st>
            <p><b>Project name: </b>GeneSrF</p>
            <p>
               <b>Project home page: </b>
               <url>http://genesrf.bioinfo.cnio.es</url>
            </p>
            <p><b>Operating system(s): </b>Platform independent.</p>
            <p><b>Programming language: </b>Python and R.</p>
            <p><b>Other requirements: </b>A web browser.</p>
            <p><b>License: </b>Not applicable. Access non-restricted.</p>
            <p><b>Any restrictions to use by non-academics: </b>None.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>List of abbreviations</p>
         </st>
         <p>&#8226; DLDA: Diagonal linear discriminant analysis.</p>
         <p>&#8226; KNN: K-nearest neighbor.</p>
         <p>&#8226; NN: nearest neighbor (like KNN with <it>K </it>= 1).</p>
         <p>&#8226; NN.vs: Nearest neighbor with variable selection.</p>
         <p>&#8226; OOB error: Out-of-bag error; error rate from samples not used in the construction of a given tree.</p>
         <p>&#8226; SC.l: Shrunken centroids with minimization of error and maximization of likelihood if ties.</p>
         <p>&#8226; SC.s: Shrunken centroids with minimization of error and minimization of features if ties.</p>
         <p>&#8226; SVM: Support vector machine.</p>
         <p>&#8226; <it>mtry</it>: Number of input variables tried at each split by random forest.</p>
         <p>&#8226; <it>mtryFactor</it>: Multiplicative factor of the default <it>mtry </it>(<m:math name="1471-2105-7-3-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msqrt><m:mrow><m:mi>n</m:mi><m:mi>u</m:mi><m:mi>m</m:mi><m:mi>b</m:mi><m:mi>e</m:mi><m:mi>r</m:mi><m:mo>&#8901;</m:mo><m:mi>o</m:mi><m:mi>f</m:mi><m:mo>&#8901;</m:mo><m:mi>g</m:mi><m:mi>e</m:mi><m:mi>n</m:mi><m:mi>e</m:mi><m:mi>s</m:mi></m:mrow></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjabgwSixlabd+gaVjabdAgaMjabgwSixlabdEgaNjabdwgaLjabd6gaUjabdwgaLjabdohaZbWcbeaaaaa@4332@</m:annotation></m:semantics></m:math>)</p>
         <p>&#8226; <it>nodesize</it>: Minimum size of the terminal nodes of the trees in a random forest.</p>
         <p>&#8226; <it>ntree</it>: Number of trees used by random forest.</p>
         <p>&#8226; <it>s.e. </it>0 and <it>s.e. </it>1: "0 s.e." (respectively "1 s.e.") rule for choosing the best solution for gene selection (how far the selected solution can be from the minimal error solution).</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>R.D-U developed the gene selection methodology, designed and carried out the comparative study, wrote the code, and drafted the manuscript. S.A.A. brought up the biological problem that prompted the methodological development and verified and provided discussion on the methodology, and co-authored the manuscript. Both authors read and approved the manuscript.</p>
         <suppl id="S3">
            <title>
               <p>Additional File 3</p>
            </title>
            <text>
               <p>Source code for the R package varSelRF. This is a compressed (tar.gz) file ready to be installed with the usual R installation procedure under Linux/UNIX. Additional formats are available from CRAN <abbrgrp><abbr bid="B68">68</abbr></abbrgrp>, the Comprehensive R Archive Network.</p>
            </text>
            <file name="1471-2105-7-3-S3.gz">
               <p>Click here for file</p>
            </file>
         </suppl>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Most of the simulations and analyses were carried out in the Beowulf cluster of the Bioinformatics unit at CNIO, financed by the RTICCC from the FIS; J. M. Vaquerizas provided help with the administration of the cluster. A. Liaw provided discussion, unpublished manuscripts, and code. C. L&#225;zaro-Perea provided many discussions and comments on the ms. A. S&#225;nchez provided comments on the ms. I. D&#237;az showed R.D.-U. the forest, or the trees, or both. Two anonymous reviewers for comments that have improved the ms. R.D.-U. partially supported by the Ram&#243;n y Cajal program of the Spanish MEC (Ministry of Education and Science); S.A.A. supported by project C.A.M. GR/SAL/0219/2004; funding provided by project TIC2003-09331-C02-02 of the Spanish MEC.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>An extensive evaluation of recent classification tools applied to microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Lee</snm>
                  <fnm>JW</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>JB</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Song</snm>
                  <fnm>SH</fnm>
               </au>
            </aug>
            <source>Computation Statistics and Data Analysis</source>
            <pubdate>2005</pubdate>
            <volume>48</volume>
            <fpage>869</fpage>
            <lpage>885</lpage>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Yeung</snm>
                  <fnm>KY</fnm>
               </au>
               <au>
                  <snm>Bumgarner</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Raftery</snm>
                  <fnm>AE</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>2394</fpage>
            <lpage>2402</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15713736</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes</p>
            </title>
            <aug>
               <au>
                  <snm>Jirapech-Umpai</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Aitken</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>148</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1181625</pubid>
                  <pubid idtype="pmpid" link="fulltext">15958165</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Optimal number of features as a function of sample size for various classification rules</p>
            </title>
            <aug>
               <au>
                  <snm>Hua</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Xiong</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Lowey</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Suh</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Dougherty</snm>
                  <fnm>ER</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>1509</fpage>
            <lpage>1515</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15572470</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Bayesian automatic relevance determination algorithms for classifying gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Tipping</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>1332</fpage>
            <lpage>1339</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12376377</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Supervised methods with genomic data: a review and cautionary view</p>
            </title>
            <aug>
               <au>
                  <snm>D&#237;az-Uriarte</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Data analysis and visualization in genomics and proteomics</source>
            <publisher>New York: Wiley</publisher>
            <editor>Azuaje F, Dopazo J</editor>
            <pubdate>2005</pubdate>
            <fpage>193</fpage>
            <lpage>214</lpage>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Comparison of discrimination methods for the classification of tumors suing gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Fridlyand</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>J Am Stat Assoc</source>
            <pubdate>2002</pubdate>
            <volume>97</volume>
            <issue>457</issue>
            <fpage>77</fpage>
            <lpage>87</lpage>
         </bibl>
         <bibl id="B8">
            <title>
               <p>A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ogihara</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <fpage>2429</fpage>
            <lpage>2437</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15087314</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Gene expression profiling predicts clinical outcome of breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>van't Veer</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Dai</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>van de Vijver</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>YD</fnm>
               </au>
               <au>
                  <snm>Hart</snm>
                  <fnm>AAM</fnm>
               </au>
               <au>
                  <snm>Mao</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Peterse</snm>
                  <fnm>HL</fnm>
               </au>
               <au>
                  <snm>van der Kooy</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Marton</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Witteveen</snm>
                  <fnm>AT</fnm>
               </au>
               <au>
                  <snm>Schreiber</snm>
                  <fnm>GJ</fnm>
               </au>
               <au>
                  <snm>Kerkhoven</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Linsley</snm>
                  <fnm>PS</fnm>
               </au>
               <au>
                  <snm>Bernards</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Friend</snm>
                  <fnm>SH</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>415</volume>
            <fpage>530</fpage>
            <lpage>536</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11823860</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas</p>
            </title>
            <aug>
               <au>
                  <snm>Roepman</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Wessels</snm>
                  <fnm>LF</fnm>
               </au>
               <au>
                  <snm>Kettelarij</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kemmeren</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Miles</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Lijnzaad</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Tilanus</snm>
                  <fnm>MG</fnm>
               </au>
               <au>
                  <snm>Koole</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hordijk</snm>
                  <fnm>GJ</fnm>
               </au>
               <au>
                  <snm>van der Vliet</snm>
                  <fnm>PC</fnm>
               </au>
               <au>
                  <snm>Reinders</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Slootweg</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>Holstege</snm>
                  <fnm>FC</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2005</pubdate>
            <volume>37</volume>
            <fpage>182</fpage>
            <lpage>186</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15640797</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>An accelerated procedure for recursive feature ranking on microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Furlanello</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Serafini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Merler</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Jurman</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Neural Netw</source>
            <pubdate>2003</pubdate>
            <volume>16</volume>
            <fpage>641</fpage>
            <lpage>648</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12850018</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>New feature subset selection procedures for classification of expression profiles</p>
            </title>
            <aug>
               <au>
                  <snm>B&#248;</snm>
                  <fnm>TH</fnm>
               </au>
               <au>
                  <snm>Jonassen</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Genome Biology</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <issue>4</issue>
            <fpage>0017.1</fpage>
            <lpage>0017.11</lpage>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Random forests</p>
            </title>
            <aug>
               <au>
                  <snm>Breiman</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>2001</pubdate>
            <volume>45</volume>
            <fpage>5</fpage>
            <lpage>32</lpage>
         </bibl>
         <bibl id="B14">
            <aug>
               <au>
                  <snm>Breiman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Friedman</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Olshen</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Stone</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Classification and regression trees</source>
            <publisher>New York: Chapman &amp; Hall</publisher>
            <pubdate>1984</pubdate>
         </bibl>
         <bibl id="B15">
            <aug>
               <au>
                  <snm>Ripley</snm>
                  <fnm>BD</fnm>
               </au>
            </aug>
            <source>Pattern recognition and neural networks</source>
            <publisher>Cambridge: Cambridge University Press</publisher>
            <pubdate>1996</pubdate>
         </bibl>
         <bibl id="B16">
            <aug>
               <a