<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-328</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Diaz-Uriarte</snm>
               <fnm>Ram&#243;n</fnm>
               <insr iid="I1"/>
               <email>rdiaz02@gmail.com</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Statistical Computing Team, Structural Biology and Biocomputing Programme, Spanish National Cancer Center (CNIO), Melchor Fern&#225;ndez Almagro 3, Madrid, 28029, Spain</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>328</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/328</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17767709</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-328</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>22</day>
               <month>3</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>03</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>03</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Diaz-Uriarte; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Microarray data are often used for patient classification and gene selection. An appropriate tool for end users and biomedical researchers should combine user friendliness with statistical rigor, including carefully avoiding selection biases and allowing analysis of multiple solutions, together with access to additional functional information of selected genes. Methodologically, such a tool would be of greater use if it incorporates state-of-the-art computational approaches and makes source code available.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have developed GeneSrF, a web-based tool, and varSelRF, an R package, that implement, in the context of patient classification, a validated method for selecting very small sets of genes while preserving classification accuracy. Computation is parallelized, allowing to take advantage of multicore CPUs and clusters of workstations. Output includes bootstrapped estimates of prediction error rate, and assessments of the stability of the solutions. Clickable tables link to additional information for each gene (GO terms, PubMed citations, KEGG pathways), and output can be sent to PaLS for examination of PubMed references, GO terms, KEGG and and Reactome pathways characteristic of sets of genes selected for class prediction. The full source code is available, allowing to extend the software. The web-based application is available from <url>http://genesrf2.bioinfo.cnio.es</url>. All source code is available from Bioinformatics.org or The Launchpad. The R package is also available from CRAN.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>varSelRF and GeneSrF implement a validated method for gene selection including bootstrap estimates of classification error rate. They are valuable tools for applied biomedical researchers, specially for exploratory work with microarray data. Because of the underlying technology used (combination of parallelization with web-based application) they are also of methodological interest to bioinformaticians and biostatisticians.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Patient classification and gene selection related to classification are common uses of microarray data (e.g., review and references in <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>), but statistically rigorous and user-friendly tools for gene selection in the context of class prediction are rare. Such a tool should address two major issues. First, it should provide unbiased estimates of the prediction error rate of the procedure. Most users are by now aware of "selection bias", as originally reported in <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>, but bias caused by trying different methods and/or sets of genes, and choosing the one with the smallest cross-validated error rate <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> is still not widely recognized. In this later case we need a nested <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> or double or full cross-validation <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> to estimate the error rate of the rule or procedure. Second, we need to assess the so called multiplicity (or lack of uniqueness) problem: variable selection with microarray data can lead to many solutions that have similar prediction errors, but that share few common genes <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. Choosing any one particular set of genes without being aware of the variability in solutions can lead to a false sense of certainty in the selected set.</p>
         <p>From a users' perspective, an ideal tool should also be user friendly and provide additional resources to ease the interpretation of results <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Web-based tools are an excellent platform as they do not require software installation or upgrades from the user. In addition, web based tools, can be designed to allow easy access to information such as Gene Ontology terms, the UCSC and Ensembl databases, KEGG and Reactome pathways, or PubMed references, thus enhancing the biological interpretation of results <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Moreover, web-based tools, if implemented appropriately, can harvest computational resources rarely available to most individual users <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, including the increasing availability of multicore processors and easily accessible clusters made with off-the-shelf components <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Currently, the major opportunities for improved performance as well as the ability to analyze ever larger data sets do not lie in faster CPUs but in being able to use parallel and distributed computing to exploit multi-core servers and clusters <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>. In addition to providing a benefit to the end user (decreased execution time), tools that combine parallelization with web-based programming are important methodological developments.</p>
         <p>Finally, a tool that fulfills the above requirements is of much greater relevance if it makes its source code available under an open-source license. Source code availability allows the research community to experiment with, and improve upon, the method and fix bugs, encourages reproducible research, allows to verify claims by method developers, makes the international research community the owner of the tools needed to carry out its work and, thus, creates the conditions for swift progress upon previous work, concerns of particular importance in bioinformatics <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>.</p>
         <p>We have developed GeneSrF and varSelRF (a web-based application and R package, respectively), that satisfy the above requirements. The only available web-based tools with similar scope are M@CBETH <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and Prophet <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. These tools, however, do not examine the multiplicity problem, cannot benefit from multicore processors or computing clusters, and do not make source code available. M@CBETH, in addition, is restricted to two-class problems and does not focus on the gene selection problem. Prophet, in turn, does not seem to solve satisfactorily the biased error rate problem (it reports the error rate as that of the classifier with smallest cross-validated error rate, without evaluating the error rate of the rule itself).</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <p>The core statistical functionality is provided by the varSelRF package for R <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. This package implements the procedure in <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> for gene selection using random forests, building upon the randomForest package <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, an R port by A. Liaw and M. Wiener of the original code by L. Breiman and A. Cutler. We use MPI <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> for parallelization via the R-packages Rmpi <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> by H. Yu, and Snow <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> by L. Tierney, A. J. Rossini, Na Li and H. Sevcikova. In the web-based application, the CGI, initial data validation, and the setting-up and closing of the parallel infrastructure (booting and halting the LAM/MPI universes) is implemented with Python. Our installation runs on a cluster of 30 nodes, each with two dual-core AMD Opteron processors (see Figure <figr fid="F1">1</figr> for details).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Example output</p>
            </caption>
            <text>
               <p><b>Example output</b>. Some figures from the output of the web-based application (see [24]). a) Out-of-bag error rate vs. the number of genes in the class prediction model, for both the complete, original data set (red line) and the 200 bootstrap samples (black lines). These figures can help identify the best number of genes in the class prediction model. It seems, that we can do fairly well using just 2 genes in our model. This is the conclussion we reach both with the complete, original data set and the bootstrap samples. b) Probability of class membership of each sample, from out-of-bag samples (i.e., bootstrap runs where the sample was not included in the training group). Most samples are well classified, specially those from class ALL (their average out of bag probability of membership in their true class is larger than 0.75). c) Importance spectrum plots can help decide on the number of "relevant variables": we compare the variable importance plots from the original data with variable importance plots that are generated when the class labels and the predictors are independent (class labels are randomly permuted). In this case the first 30 variables have importances well above those from sets with randomly permuted class labels. d) Selection probability plots: for each of the top ranked genes from the original sample, the probability that it is included among the top ranked k genes (blue: k = 20; red: k = 100) from the (200) bootstrap samples. Thus, these plots can be a measure of our confidence in the stability of choosing a number of k ranked genes. In this case, with k = 20 only the two or three most important genes are repeatedly chosen among the best 20. If we select the first 100 genes, the 30 best ranked ones appear at least in 75% of the bootstrap samples.</p>
            </text>
            <graphic file="1471-2105-8-328-1"/>
         </fig>
         <p>The input for the web-based application are either plain text files, or files that come from other tools of the Asterias suite <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. GeneSrF has been running in production use for over a year. Further documentation and examples for the web-based application are available from its on-line help, and for the R package from the standard R documentation system. A fully commented example of the output is provided in the on-line help <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. Sample output is shown in Figure <figr fid="F1">1</figr>. Bug-tracking and additional tests are available from Bioinformatics.org and The Launchpad.</p>
      </sec>
      <sec>
         <st>
            <p>Benchmarks and run time</p>
         </st>
         <p>The parallelization has been implemented over bootstrap resamples. The speedups achieved by parallelizing are shown in Figure <figr fid="F2">2a</figr>), where we plot the fold increase in speed achieved by increasing the number of Rslaves (concurrently executing R processes). Parallelization makes a dramatic difference in speed for all the data sets shown. Up to 20 Rmpi slaves, the increases in speed are almost linear with number of slaves. Beyond 20 slaves, speed increases are slower with number of slaves: as is known from the parallelization literature <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B25">25</abbr></abbrgrp>, in addition to number of CPUs other factors can become limiting, in our case most likely bandwith and latency of inter-node communication, and potential bottlenecks from memory and cache in nodes made of dual-core processors <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Benchmarks and run time</p>
            </caption>
            <text>
               <p><b>Benchmarks and run time</b>. a) Fold increase in speed from parallelization. Ratios of the user wall time of execution of the R code (varSelRFBoot without previous model fit) between a run with a single Rmpi slave and runs with different numbers of Rmpi slaves (the number of simultaneously executing R processes) for five data sets (see [1] for details). In the legend, in parentheses the user wall time of the execution with a single Rmpi slave for each data set. In all cases (except "1", "60(2)", and "90(3)") there were four Rmpi slaves per node. The timings were obtained in an otherwise idle cluster with 30 nodes, each with two dual-core AMD Opteron 2.2 GHz CPUs and 6 GB RAM, running Debian GNU/Linux and a stock 2.6.8 kernel, with version 7.1.2 of LAM/MPI and version 2.1.4 (patched) of R. The values for "60(2)" refer two a configuration with 2 slaves per node (recall that a node with two dual core CPUs is not identical to a node with 4 CPUs), and the value "90(3)" to a configuration with 3 slaves per node. b) Scaling of user wall time. User wall time as a function of number of arrays and number of genes when executing the R function varSelRFBoot without previous model fit. Shown are three replicate runs. In each run, the arrays and genes are selected randomly from the complete original data set. Further details about the Prostate data set from [1]. Hardware and software as above. We used 4 Rmpi slaves per node (and, thus, a total of 120 slaves). c) User wall time of the web-based application. User wall time for complete runs (i.e., including upload of files and return of complete HTML page) for ten different data sets (see details in [1]). Under the name of each data set, the number of arrays and the number of genes are indicated. For each data set, three replicate runs were conducted. Hardware and software configuration as above, with the default settings for the web-based application (4 Rmpi slaves per node, and thus a total of 120 slaves).</p>
            </text>
            <graphic file="1471-2105-8-328-2"/>
         </fig>
         <p>The scaling of user wall time of the R code (varSelRFBoot) with number of arrays and number of genes is shown in Figure <figr fid="F2">2b</figr>), with the default parallelization scheme and with a data set that allows for exploring a range of numbers of arrays and genes. User wall time increases approximately linearly with the number of arrays and number of genes over a realistic range of arrays and genes (e.g., when we double the number of arrays from 40 to 80 the user wall time increases by a factor of slightly over 2).</p>
         <p>The run time for the web-based application for a wide range of data sets is shown Figure <figr fid="F2">2c</figr>). These timings include the time needed to upload the files (and thus can be affected by internet connection speed) and to prepare and return to the user the final figures. Note that in most cases the complete analysis is finished within 20 minutes.</p>
         <p>Scripts for timing experiments are included with the source code (directory "Benchmarks").</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>Our procedure is explicitly targeted to select very small sets of genes, and has been shown <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> to have a classification error rate on-par with other, state-of-the-art, classification procedures. Additionally, our programs allow the exploratory usage of random forest for identifying large subsets of genes potentially relevant for class prediction. In contrast to other tools, such as M@CBETH <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, we are not restricted to two-class problems.</p>
         <p>To avoid underestimating the error rate of the classification procedure, we use the bootstrap (the 0.632+ approach of <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>). As in <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, we bootstrap the complete procedure, including selecting the classifier with minimal out-of-bag error rate (thus, this is a "full" or "double" bootstrap procedure, sensu <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>), and thus our estimates of error rate are not affected by selection biases. This contrasts, for instance, with Prophet <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, where the error rate reported is that of the classifier with the smallest cross-validated error rate. Based upon the bootstrap results, we also show the average out-of-bag predictions for each sample, allowing to easily asses poorly predicted samples and potential outliers. There are other tools available for performing cross-validation and bootstrap of classification methods, such as the R package ipred <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> by A. Peters and T. Hothorn, the BioConductor package MCRestimate <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> by M. Ruschhaupt, U. Mansmann, P. Warnat, W. Huber and A. Benner, specifically targeted to computing misclassification error rates combining the gene selection and classification steps, or the caGEDA web application <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> that incorporates bootstrap, leave-one-out, and random resampling validation of several classifiers. Our approach, however, has been tailored to our own variable selection procedure and has been parallelized. A unique feature of GeneSrF and varSelRF are their emphasis on examining possible multiple solutions.</p>
         <p>Since we obtain 200 resamples in the process of bootstrapping (see above) there is little added computational cost to providing analysis of stability and multiplicity of solutions. We report the number of genes selected and the identity of the individual genes selected in the original sample and the 200 bootstrap runs, including frequencies of every gene selected in the solutions. Moreover, the biological interpretation of the results is enhanced by the access to additional information. If the input file contains gene identifiers for either human, mouse, or rat genomes (in the form of Affymetrix IDs, Clone IDs, GenBank Accession numbers, Ensembl Gene IDs, Unigene clusters, or Entrez Gene IDs), for each gene in the results, the web-based application provides a link to IDClight <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, which allows the user to obtain additional information, including mapping between gene and protein identifiers, PubMed references, Gene Ontology terms, and KEGG and Reactome pathways. The multiple solutions can be further studied by sending sets of selected genes to our tool PaLS <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> to examine PubMed references, Gene Ontology terms, KEGG pathways, or Reactome pathways that are common to a user-selected percentage of genes or lists (bootstrap solutions). A fully commented example of the output is provided in the on-line help <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>.</p>
         <p>Finally, GeneSrF is one of the very few tools for the analysis of gene expression data that uses parallelization and, as far as we know, the only web-based tool to use parallelization for gene selection and classification. This is an important methodological novelty, as we can no longer expect that increases in single-CPU speed will allow us to analyze larger data sets in shorter time: the rate of increase in CPU speed has slowed down considerably in the last five years but, in contrast, increasing numbers of CPU cores (either in individual machines &#8211; including laptops &#8211; or via off-the-shelf computing clusters) are becoming much more affordable <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>. Thus, further decreases in user wall time (time to wait for a result) and ability to tackle more complex problems will depend on our ability to use parallel, distributed, and concurrent programming. GeneSrF therefore represents a case example on combining parallel computing with a user-friendly web-based application for the analysis of gene expression data and, by making the full source code available, allows other researchers to build upon our developments.</p>
         <p>Future work focuses on extending the software to use random forest-related techniques applicable to heterogeneous types of variables such as addition of categorical data <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and other clinical information. As well, we are exploring alternative mechanisms and languages for parallelizing and distributing computations, and we are rewriting most of the code using Pylons <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, a Python web framework, to try to simplify installation of the web-based application. Installation now involves several steps (see <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>), and the most time consuming are setting up and verifying the LAM/MPI environment, and using the correct paths in files involved in controlling the MPI environment and executing and controlling R.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>varSelRF and GeneSrF implement a validated method for gene selection and provide bootstrap estimates of classification error rate, take advantage of computing clusters and multicore processors, and encourage careful examination of the multiplicity of solutions problems. Thus, these are both useful tools for applied biomedical researchers using microarray and gene expression data, and represent unique methodological developments in the area of web-based gene expression analysis tools.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>For GeneSrF:</p>
         <p><b>Project name: </b>GeneSrF</p>
         <p>
            <b>Project home page: </b>
            <url>http://genesrf2.bioinfo.cnio.es</url>
         </p>
         <p><b>Operating system: </b>Platform independent (web-based application)</p>
         <p><b>Programming language: </b>R, Python</p>
         <p><b>Other requirements: </b>A web browser.</p>
         <p><b>License: </b>None for usage. Web-based code: Affero GPL (open source).</p>
         <p><b>Any restrictions to use by non-academics: </b>None.</p>
         <p>For varSelRF:</p>
         <p><b>Project name: </b>varSelRF</p>
         <p>
            <b>Project home page: </b>
            <url>http://launchpad.net/varselrf</url>
         </p>
         <p><b>Operating system: </b>Linux, Unix</p>
         <p><b>Programming language: </b>R, Python</p>
         <p><b>Other requirements: </b>LAM/MPI</p>
         <p><b>License: </b>GNU GPL</p>
         <p><b>Any restrictions to use by non-academics: </b>None</p>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>CGI, Common Gateway Interface; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; LAM, Local Area Multicomputer; MPI, Message Passing Interface.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>A. Alib&#233;s and A. Ca&#241;ada for their work on IDClight and PaLS. Two anonymous reviewers for comments that improved the manuscript. Bioinformatics.org and The Launchpad for project and repository hosting. Funding provided by Fundaci&#243;n de Investigaci&#243;n M&#233;dica Mutua Madrile&#241;a and Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science (MEC). R.D.-U. is partially supported by the Ram&#243;n y Cajal programme of the Spanish MEC.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Gene selection and classification of microarray data using random forest</p>
            </title>
            <aug>
               <au>
                  <snm>D&#237;az-Uriarte</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Alvarez de Andr&#233;s</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <issue/>
            <fpage>3</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1363357</pubid>
                  <pubid idtype="pmpid" link="fulltext">16398926</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Selection bias in gene extraction on the basis of microarray gene-expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Ambroise</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>McLachlan</snm>
                  <fnm>GJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <issue>10</issue>
            <fpage>6562</fpage>
            <lpage>6566</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">124442</pubid>
                  <pubid idtype="pmpid" link="fulltext">11983868</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification</p>
            </title>
            <aug>
               <au>
                  <snm>Simon</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Radmacher</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Dobbin</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>McShane</snm>
                  <fnm>LM</fnm>
               </au>
            </aug>
            <source>Journal of the National Cancer Institute</source>
            <pubdate>2003</pubdate>
            <volume>95</volume>
            <fpage>14</fpage>
            <lpage>18</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12509396</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Bias in error estimation when using cross-validation for model selection</p>
            </title>
            <aug>
               <au>
                  <snm>Varma</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1397873</pubid>
                  <pubid idtype="pmpid" link="fulltext">16504092</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Classification in microarray experiments</p>
            </title>
            <aug>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Fridlyand</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Statistical analysis of gene expression microarray data</source>
            <publisher>New York: Chapman &amp; Hall</publisher>
            <editor>Speed T</editor>
            <pubdate>2003</pubdate>
            <fpage>93</fpage>
            <lpage>158</lpage>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions</p>
            </title>
            <aug>
               <au>
                  <snm>Somorjai</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Dolenko</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Baumgartner</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>1484</fpage>
            <lpage>1491</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12912828</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays</p>
            </title>
            <aug>
               <au>
                  <snm>Pan</snm>
                  <fnm>KH</fnm>
               </au>
               <au>
                  <snm>Lih</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Cohen</snm>
                  <fnm>SN</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <fpage>8961</fpage>
            <lpage>8965</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1149502</pubid>
                  <pubid idtype="pmpid" link="fulltext">15951424</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Outcome signature genes in breat cancer: is there a unique set?</p>
            </title>
            <aug>
               <au>
                  <snm>Ein-Dor</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Kela</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Getz</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Givol</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Domany</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>171</fpage>
            <lpage>178</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15308542</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Prediction of cancer outcome with microarrays: a multiple random validation strategy</p>
            </title>
            <aug>
               <au>
                  <snm>Michiels</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Koscielny</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hill</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Lancet</source>
            <pubdate>2005</pubdate>
            <volume>365</volume>
            <fpage>488</fpage>
            <lpage>492</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15705458</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>caGEDA: a web application for the integrated analysis of global gene expression patterns in cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Patel</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Lyons-Weiler</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Applied Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>3</volume>
            <fpage>49</fpage>
            <lpage>62</lpage>
            <xrefbib>
               <pubid idtype="pmpid">16323966</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>IDconverter and IDClight: conversion and annotation of gene and protein IDs</p>
            </title>
            <aug>
               <au>
                  <snm>Alib&#233;s</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Yankilevich</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Ca&#241;ada</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Diaz-Uriarte</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>9</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1779800</pubid>
                  <pubid idtype="pmpid" link="fulltext">17214880</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Asterias: integrated analysis of expression and aCGH data using an open-source, web-based, parallelized software suite</p>
            </title>
            <aug>
               <au>
                  <snm>Diaz-Uriarte</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Alib&#233;s</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Morrissey</snm>
                  <fnm>ER</fnm>
               </au>
               <au>
                  <snm>Ca&#241;ada</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Rueda</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Neves</snm>
                  <fnm>ML</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2007</pubdate>
            <volume>35</volume>
            <fpage>W75</fpage>
            <lpage>W80</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1933128</pubid>
                  <pubid idtype="pmpid" link="fulltext">17488846</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software</p>
            </title>
            <aug>
               <au>
                  <snm>Sutter</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Dr Dobb's Journal</source>
            <pubdate>2005</pubdate>
            <volume>30</volume>
            <issue>3</issue>
            <fpage>202</fpage>
            <lpage>210</lpage>
         </bibl>
         <bibl id="B14">
            <aug>
               <au>
                  <snm>Kontoghiorghes</snm>
                  <fnm>EJ</fnm>
               </au>
            </aug>
            <source>Handbook of Parallel Computing and Statistics</source>
            <publisher>Boca Raton, FL: Chapman &amp; Hall, CRC</publisher>
            <pubdate>2006</pubdate>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Open source software for the analysis of microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Quackenbush</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Biotechniques</source>
            <pubdate>2003</pubdate>
            <issue>Suppl</issue>
            <fpage>45</fpage>
            <lpage>51</lpage>
            <xrefbib>
               <pubid idtype="pmpid">12664684</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Supervised methods with genomic data: a review and cautionary view</p>
            </title>
            <aug>
               <au>
                  <snm>D&#237;az-Uriarte</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Data analysis and visualization in genomics and proteomics</source>
            <publisher>New York: Wiley</publisher>
            <editor>Azuaje F, Dopazo J</editor>
            <pubdate>2005</pubdate>
            <fpage>193</fpage>
            <lpage>214</lpage>
         </bibl>
         <bibl id="B17">
            <title>
               <p>M@CBETH: a microarray classification benchmarking tool</p>
            </title>
            <aug>
               <au>
                  <snm>Pochet</snm>
                  <fnm>NL</fnm>
               </au>
               <au>
                  <snm>Janssens</snm>
                  <fnm>FA</fnm>
               </au>
               <au>
                  <snm>De Smet</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Marchal</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Suykens</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>De Moor</snm>
                  <fnm>BL</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>3185</fpage>
            <lpage>3186</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15890742</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Prophet, a web-based tool for class prediction using microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Medina</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Montaner</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Tarraga</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dopazo</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <inpress/>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">17138587</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <aug>
               <au>
                  <cnm>R Development Core Team</cnm>
               </au>
            </aug>
            <source>R: A language and environment for statistical computing</source>
            <publisher>R Foundation for Statistical Computing, Vienna, Austria</publisher>
            <pubdate>2004</pubdate>
            <url>http://www.R-project.org</url>
            <note>[ISBN 3-900051-00-3]</note>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Classification and Regression by randomForest</p>
            </title>
            <aug>
               <au>
                  <snm>Liaw</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Wiener</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>R News</source>
            <pubdate>2002</pubdate>
            <volume>2</volume>
            <issue>3</issue>
            <fpage>18</fpage>
            <lpage>22</lpage>
            <url>http://CRAN.R-project.org/doc/Rnews/</url>
         </bibl>
         <bibl id="B21">
            <aug>
               <au>
                  <snm>Pacheco</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Parallel programming with MPI</source>
            <publisher>San Francisco: Morgan kaufman</publisher>
            <pubdate>1997</pubdate>
         </bibl>
         <bibl id="B22">
            <aug>
               <au>
                  <snm>Yu</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Rmpi: Interface (Wrapper) to MPI (Message-Passing Interface)</source>
            <url>http://www.stats.uwo.ca/faculty/yu/Rmpi</url>
         </bibl>
         <bibl id="B23">
            <aug>
               <au>
                  <snm>Tierney</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Rossini</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Sevcikova</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>snow: Simple Network of Workstations</source>
            <url>http://cran.r-project.org/src/contrib/Descriptions/snow.html</url>
         </bibl>
         <bibl id="B24">
            <source>GeneSrF on-line commented example</source>
            <url>http://genesrf2.bioinfo.cnio.es/Examples/Leukemia/results.html</url>
         </bibl>
         <bibl id="B25">
            <aug>
               <au>
                  <snm>Foster</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Designing and building parallel programs</source>
            <publisher>Boston: Addison Wesley</publisher>
            <editor/>
            <pubdate>1995</pubdate>
         </bibl>
         <bibl id="B26">
            <title>
               <p>The Impact of Multicore on Computational Science Software</p>
            </title>
            <aug>
               <au>
                  <snm>Dongarra</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Gannon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Fox</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Kenned</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>CTWatch Quarterly</source>
            <pubdate>2007</pubdate>
            <volume>3</volume>
            <fpage>3</fpage>
            <lpage>10</lpage>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Improvements on cross-validation: the .632+ bootstrap method</p>
            </title>
            <aug>
               <au>
                  <snm>Efron</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>RJ</fnm>
               </au>
            </aug>
            <source>J American Statistical Association</source>
            <pubdate>1997</pubdate>
            <volume>92</volume>
            <fpage>548</fpage>
            <lpage>560</lpage>
         </bibl>
         <bibl id="B28">
            <aug>
               <au>
                  <snm>Peters</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hothorn</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>ipred: Improvedt Predictors</source>
            <url>http://cran.r-project.org/src/contrib/Descriptions/ipred.html</url>
         </bibl>
         <bibl id="B29">
            <aug>
               <au>
                  <snm>Ruschhaupt</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mansmann</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Warnat</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Huber</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Benner</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>MCRestimate</source>
            <url>http://www.bioconductor.org/packages/1.9/bioc/html/MCRestimate.html</url>
         </bibl>
         <bibl id="B30">
            <source>PaLS</source>
            <url>http://pals.bioinfo.cnio.es</url>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution</p>
            </title>
            <aug>
               <au>
                  <snm>Strobl</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Boulesteix</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Zeileis</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hothorn</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>25</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1796903</pubid>
                  <pubid idtype="pmpid" link="fulltext">17254353</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <source>Pylons</source>
            <url>http://pylonshq.com</url>
         </bibl>
         <bibl id="B33">
            <source>Download page</source>
            <url>http://bioinformatics.org/asterias/wiki/Main/DownloadPage</url>
         </bibl>
      </refgrp>
   </bm>
</art>
