<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-10-37</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Glez-Pe&#241;a</snm>
               <fnm>Daniel</fnm>
               <insr iid="I1"/>
               <email>dgpena@uvigo.es</email>
            </au>
            <au id="A2">
               <snm>&#193;lvarez</snm>
               <fnm>Rodrigo</fnm>
               <insr iid="I2"/>
               <email>rodrigo.djv@gmail.com</email>
            </au>
            <au id="A3">
               <snm>D&#237;az</snm>
               <fnm>Fernando</fnm>
               <insr iid="I3"/>
               <email>fdiaz@uvigo.es</email>
            </au>
            <au ca="yes" id="A4">
               <snm>Fdez-Riverola</snm>
               <fnm>Florentino</fnm>
               <insr iid="I1"/>
               <email>riverola@uvigo.es</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Escuela Superior de Ingenier&#237;a Inform&#225;tica, University of Vigo, Edificio Polit&#233;cnico, Campus Universitario As Lagoas s/n, 32004 Ourense, Spain</p>
            </ins>
            <ins id="I2">
               <p>Departamento de Inform&#225;tica, University of Vigo, Edificio Fundici&#243;n, Campus As Lagoas-Marcosende, 36310 Vigo, Pontevedra, Spain</p>
            </ins>
            <ins id="I3">
               <p>Escuela Universitaria de Inform&#225;tica, University of Valladolid, Plaza Santa Eulalia, 9-11, 40005 Segovia, Spain</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2009</pubdate>
         <volume>10</volume>
         <issue>1</issue>
         <fpage>37</fpage>
         <url>http://www.biomedcentral.com/1471-2105/10/37</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">19178723</pubid>
               <pubid idtype="doi">10.1186/1471-2105-10-37</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>29</day>
               <month>9</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>29</day>
               <month>1</month>
               <year>2009</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>29</day>
               <month>1</month>
               <year>2009</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2009</year>
         <collab>Glez-Pe&#241;a et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Expression profiling assays done by using DNA microarray technology generate enormous data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this huge amount of data is to develop algorithms to interpret and interconnect results from different genes under different conditions. In this context, fuzzy logic can provide a systematic and unbiased way to both (<it>i</it>) find biologically significant insights relating to meaningful genes, thereby removing the need for expert knowledge in preliminary steps of microarray data analyses and (<it>ii</it>) reduce the cost and complexity of later applied machine learning techniques being able to achieve interpretable models.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>DFP is a new Bioconductor R package that implements a method for discretizing and selecting differentially expressed genes based on the application of fuzzy logic. DFP takes advantage of fuzzy membership functions to assign linguistic labels to gene expression levels. The technique builds a reduced set of relevant genes (FP, <it>Fuzzy Pattern</it>) able to summarize and represent each underlying class (pathology). A last step constructs a biased set of genes (DFP, <it>Discriminant Fuzzy Pattern</it>) by intersecting existing fuzzy patterns in order to detect discriminative elements. In addition, the software provides new functions and visualisation tools that summarize achieved results and aid in the interpretation of differentially expressed genes from multiple microarray experiments.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>DFP integrates with other packages of the Bioconductor project, uses common data structures and is accompanied by ample documentation. It has the advantage that its parameters are highly configurable, facilitating the discovery of biologically relevant connections between sets of genes belonging to different pathologies. This information makes it possible to automatically filter irrelevant genes thereby reducing the large volume of data supplied by microarray experiments. Based on these contributions <smcaps>GENE</smcaps>CBR, a successful tool for cancer diagnosis using microarray datasets, has recently been released.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Microarray techniques have revolutionized genomic research by making it possible to monitor the expression of thousands of genes in parallel. Due to the amount of data being produced by this technology, gene reduction is extremely important because: (<it>i</it>) it generally reduces the computational cost of machine learning techniques, (<it>ii</it>) it usually increases the accuracy of classification algorithms and (<it>iii</it>) it provides clues to researches about genes that are important in a given context (i.e. biomarkers for certain diseases, etc.) <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>.</p>
         <p>Related with this domain, the area of gene identification has been previously addressed by Furman <it>et al</it>. through the utilization of information theory <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Several methods have been proposed to reduce dimensions in the microarray data domain. These works include the application of genetic algorithms <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, wrapper approaches <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, support vector machines <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, spectral biclustering <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, etc. Other approaches focus their attention on redundancy reduction and feature extraction <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>, as well as the identification of similar gene classes making prototypes-genes <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
         <p>In addition, there are also several packages implemented in R for feature selection as iterativeBMA <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, varSelRF <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp> or R-SVM <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. iterativeBMA is a Bioconductor R package which performs multivariate feature selection for multiclass microarray data and it is based on the bayesian model averaging (BMA) approach. The varSelRF package implements a method for gene selection based on the measures of variable importance which return the random forest algorithm and it is also suitable for multivariate and multiclass datasets. The R-SVM method is similar to the varSelRF in the sense that it uses the relative importance of features in SVM classifiers to select relevant genes but it is only applicable to binary classifications. Finally, it is also considered the ttest function of the genefilter package (available from Bioconductor) which implements the conventional t-test method for feature selection. Table <tblr tid="T1">1</tblr> shows a comparative analysis of these R-based methods and the proposed DFP algorithm.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Comparative analysis of R-based methods for gene selection</p>
            </caption>
            <tblbdy cols="6">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>iterativeBMA </b>
                        <abbrgrp>
                           <abbr bid="B11">11</abbr>
                        </abbrgrp>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>varSelRF </b>
                        <abbrgrp>
                           <abbr bid="B12">12</abbr>
                           <abbr bid="B13">13</abbr>
                        </abbrgrp>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>R-SVM </b>
                        <abbrgrp>
                           <abbr bid="B14">14</abbr>
                        </abbrgrp>
                     </p>
                  </c>
                  <c ca="left">
                     <p><b>ttest </b>[genefilter]</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>DFP</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="6">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Method</p>
                  </c>
                  <c ca="left">
                     <p>Bayesian model averaging (BMA) approach over the underlying classification model (logistic regression)</p>
                  </c>
                  <c ca="left">
                     <p>varSelRF uses the measures of variable importance (related to the classification) provided directly by the Random Forest algorithm</p>
                  </c>
                  <c ca="left">
                     <p>R-SVM uses a contribution factor of each feature (computed from the weights of the SVM classifier)</p>
                  </c>
                  <c ca="left">
                     <p>t-test</p>
                  </c>
                  <c ca="left">
                     <p>The selected genes are based on the induced fuzzy pattern for each class</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Type of classification</p>
                  </c>
                  <c ca="left">
                     <p>Multiclass</p>
                  </c>
                  <c ca="left">
                     <p>Multiclass</p>
                  </c>
                  <c ca="left">
                     <p>Binary classifications</p>
                  </c>
                  <c ca="left">
                     <p>Binary classifications</p>
                  </c>
                  <c ca="left">
                     <p>Multiclass</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Dependence among features</p>
                  </c>
                  <c ca="left">
                     <p>Multivariate</p>
                  </c>
                  <c ca="left">
                     <p>Multivariate</p>
                  </c>
                  <c ca="left">
                     <p>Multivariate</p>
                  </c>
                  <c ca="left">
                     <p>Univariate</p>
                  </c>
                  <c ca="left">
                     <p>Univariate</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Remarks</p>
                  </c>
                  <c ca="left">
                     <p>The method facilitates biological interpretation by producing posterior probabilities of selected genes and models. BMA accounts for the uncertainty about the best set to choose by averaging over multiple models</p>
                     <p>The R package is available from Bioconductor</p>
                     <p>The method requires a limit in the maximum number of relevant genes to be selected and the final results are conditioned by an initial selection based on a univariate gene selection method</p>
                  </c>
                  <c ca="left">
                     <p>The method does not require pre-specify the number of genes to be selected, but rather adaptively chooses the number of genes</p>
                     <p>The R package is available from CRAN and its implementation takes advantage of computing clusters and multicore processors</p>
                     <p>The varSelRF is biased to identify small sets of genes that can still achieve good predictive performance (thus, highly correlated genes will not be selected since they are considered as redundant genes)</p>
                  </c>
                  <c ca="left">
                     <p>The algorithm is based on the repeated application of the SVM classifier over progressively smaller sets of genes (where genes are excluded according to the defined contribution factor) until a satisfactory solution is achieved. The number of iterations and the number of features to be selected in each iteration are very <it>ad hoc</it></p>
                     <p>The R-SVM method is only suitable for binary classifications</p>
                  </c>
                  <c ca="left">
                     <p>The computational effort is smaller than multivariate methods</p>
                     <p>The genefilter package is available from Bioconductor</p>
                     <p>It is sensitive against outliers which are frequent in microarray data</p>
                     <p>It requires normal distribution of the expressions levels within both classes</p>
                  </c>
                  <c ca="left">
                     <p>It does not require any assumption about the distribution of the expression levels and</p>
                     <p>It accounts for the noise in the data because, as a fuzzy-based method, it deals with linguistic categories instead of raw data</p>
                     <p>The implementation is computationally efficient and available from Bioconductor</p>
                     <p>The DFP method does not take into consideration that features are influencing a biological outcome in the context of networks of interacting genes</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>In this context, there are many advantages of applying fuzzy logic to the analysis of gene expression data: (<it>i</it>) fuzzy logic inherently accounts for noise in the data because it extracts trends, not crisp values; (<it>ii</it>) in contrast to other automated decision making techniques, algorithms in fuzzy logic are cast in the same language used in day-to-day conversation, so conclusions are easily interpretable and can be extrapolated; (<it>iii</it>) fuzzy logic techniques are computationally efficient and can be scaled to include a high number of components <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>.</p>
         <p>Based on these assumptions, the aim in writing DFP was to provide a simple-to-use library to perform gene selection and data reduction by the application of a supervised fuzzy pattern algorithm able to discretize and filter existing gene expression profiles.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <p>DFP is an extension package for the programming language and statistical environment R <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. The software has been developed to perform fuzzy analysis and gene reduction using microarray data. It employs object classes and functions that are also standard in other packages of the Bioconductor project <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. The whole algorithm comprises of three main steps. First, it represents each gene value in terms of one from the following linguistic labels: Low, Medium, High and their intersections LowMedium and MediumHigh. The output is a <it>fuzzy microarray descriptor </it>(FMD) for each existing sample (microarray) containing the discretized gene expression values. The second phase aims to find all genes that best explain each class, constructing a supervised <it>fuzzy pattern </it>(FP) for each class (pathology). Starting from the previous generated fuzzy patterns, the package is able to discriminate those genes that can provide a substantial discernibility between existing classes, generating an unique <it>discriminant fuzzy pattern </it>(DFP).</p>
         <sec>
            <st>
               <p>Discretizing microarray data using fuzzy labels</p>
            </st>
            <p>In the first step, given a set of <it>n </it>expressed sequence tags (ESTs) or genes belonging to <it>m </it>microarrays, the discretization process is based on determining the membership function of each gene to the previously linguistic labels. In this package, two types of membership functions are used (see additional file <supplr sid="S1">1</supplr>:MembershipFunctions.pdf for more details about the mathematical background). Firstly, a polynomial approximation of a Gaussian membership function which achieve smoothness for the degree of membership of 'normal' expression levels of a gene, and secondly, a polynomial approximation of two sigmoidal membership functions which are able to specify asymmetric membership functions for the 'low' and 'high' expression levels (see Figure <figr fid="F1">1</figr>).</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p><b>Definition of Gaussian membership functions implemented in the DFP package.</b> The membership functions to linguistic labels are defined in a similar way to the form that has been used by Pal and Mitra (2004) [doi:10.1109/TKDE.2003.1262181]. These authors used a polynomial function that approximates a Gaussian membership function, where its centre and amplitude depend on the mean and on the variability of the available data respectively. The original membership functions are considered symmetric, but, in our work we have considered asymmetric functions for the linguistic labels in the extremes (labels Low and High).</p>
               </text>
               <file name="1471-2105-10-37-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Shape of membership function for a specific gene and possible assigned labels given a threshold &#952; = 0.7</p>
               </caption>
               <text>
                  <p><b>Shape of membership function for a specific gene and possible assigned labels given a threshold &#952; = 0.7</b>. The centre and amplitude of each membership function depend on the mean and on the variability of the available data respectively. The Medium membership function is considered symmetric whereas the Low and High functions are asymmetric in the extremes.</p>
               </text>
               <graphic file="1471-2105-10-37-1"/>
            </fig>
            <p>The algorithm defines a threshold value &#952;, which need to be established in order to discretize the original data in a binary way. For concrete values of threshold &#952;, specific zones of the gene values domain for which none of the labels will be activated can exist (neighbor region of the intersection of labels Medium and High in Figure <figr fid="F1">1</figr>). This fact must be interpreted as the specific value of the gene is not enough to assign it a significant linguistic label at the significance degree of membership fixed by threshold &#952;.</p>
            <p>On the other hand, one expression level can simultaneously activate two linguistic labels, since at the significance level given by &#952;, any assignment of the measure to a linguistic label is significant (neighbor region of the intersection of labels Low and Medium in Figure <figr fid="F1">1</figr>).</p>
         </sec>
         <sec>
            <st>
               <p>Assembling a supervised fuzzy pattern of representative genes</p>
            </st>
            <p>A fuzzy pattern is a higher concept built from a set of FMDs belonging to the same class, and it can be viewed as a prototype of them. The FP corresponding to a given class is constructed by selecting the genes with a label which has a relative frequency of appearance equal to or greater than a predefined ratio &#960; (0 &lt; &#960; &#8804; 1). Therefore, the FP captures relevant and common information about the discretized gene expression levels of the FMDs that summarizes.</p>
            <p>The predefined ratio &#960; controls the degree of exigency for selecting a gene as a member of the pattern, since the higher the value of &#960;, the fewer the number of genes which make up the FP. The pattern's quality of fuzziness is given by the fact that the labels, which make it up, come from the linguistic labels defined during the transformation into FMD of an initial observation. Moreover, if a specific label of a gene is very common in all the examples belonging to a given class, this feature will be selected to be included in the FP. Therefore, a frequency-based criterion is used for selecting a gene as part of the fuzzy pattern.</p>
         </sec>
         <sec>
            <st>
               <p>Recognizing valuable genes</p>
            </st>
            <p>The goal of gene selection is to determine a reduced set of genes, which are meaningful given the existing knowledge. Here, the algorithm introduces the notion of discriminant fuzzy pattern with regard to a collection of FPs. A DFP version of a FP only includes those genes that can serve to differentiate it from the rest of the patterns. Therefore, the computed DFP for a specific FP is different depending on what other FPs are compared with it. It's not surprising that the genes used to discern a specific class from others (by mean of its DFP) will be different if the set of rival classes also changes. The pseudo code algorithm used to compute the final DFP containing the selected genes can be consulted in additional file <supplr sid="S2">2</supplr>:DFPpseudocode.pdf.</p>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>Pseudo code algorithm used to compute the final DFP containing the selected genes.</b> A DFP version of a FP only includes those genes that can serve to differentiate it from the rest of the fuzzy patterns.</p>
               </text>
               <file name="1471-2105-10-37-S2.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>The package DFP has been designated for performing fuzzy analysis and gene reduction from a set of microarray experiments. DFP, like any R package, is command-line driven. The functions are called by the user, possibly with arguments and options. Any session using DFP in R starts with the command</p>
         <p>library (DFP)</p>
         <p>which makes the functions of DFP available in the R environment.</p>
         <p>A very quick start example could be carried out using the artificial data set rmadataset, included in the package</p>
         <p>data(rmadataset)</p>
         <p>Once the data is loaded, the whole algorithm can be executed calling its main function discriminantFuzzyPattern(rmadataset) which will work out with the default parameter values, or step by step as in the following example</p>
         <p>mfs&lt;-calculateMembershipFunctions</p>
         <p>+&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;(rmadataset, skipFactor = 3)</p>
         <p>which calculates the membership functions (Low, Medium, High) for each gene. These functions can be displayed using the following command (see Figure <figr fid="F2">2</figr>)</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Membership functions belonging to the first two genes</p>
            </caption>
            <text>
               <p><b>Membership functions belonging to the first two genes</b>. Vertical lines show the expression values corresponding to each microarray sample.</p>
            </text>
            <graphic file="1471-2105-10-37-2"/>
         </fig>
         <p>plotMembershipFunctions</p>
         <p>+&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;(rmadataset, mfs, featureNames(rmadataset [1:2])</p>
         <p>DFP can now convert gene expression values (raw data) into linguistic labels. A gene will have an assigned linguistic label if its expression level exceeds the significance degree of membership fixed by threshold zeta (&#952;). It is done by the command</p>
         <p>dvs&lt;-discretizeExpressionValues</p>
         <p>+&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;(rmadataset, mfs, zeta = 0.5, overlapping = 2)</p>
         <p>showing part of the results with the following function</p>
         <p>showDriscreteValues</p>
         <p>+&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;(dvs, featureNames(rmadataset) [1:10],</p>
         <p>+&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;c("healthy", "AML-inv")))</p>
         <p>The next step involves the generation of a fuzzy pattern that summarizes the most relevant genes of each category. A gene will belong to a FP if its assigned label is present with a frequency higher than piVal (&#960;). It is done by the command</p>
         <p>fps&lt;-calculateFuzzyPatterns</p>
         <p>+&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;(rmadataset, dvs, piVal = 0.9, overlapping)</p>
         <p>showing part of the results with the following function</p>
         <p>showFuzzyPatterns (fps, "healthy") [21:50]</p>
         <p>The last step calculates the discriminant fuzzy pattern by including those genes present in two or more fuzzy patterns with different assigned labels. The following command performs this operation</p>
         <p>dfps&lt;-calculateDiscriminantFuzzyPattern (rmadataset, fps)</p>
         <p>The selected genes can now be shown in both text and graphical mode (see Figure <figr fid="F3">3</figr>) using the function</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>DFP of selected genes (in rows) with its appearance frequency for each category (in columns)</p>
            </caption>
            <text>
               <p><b>DFP of selected genes (in rows) with its appearance frequency for each category (in columns)</b>. In the first table, a NA value is assigned if the frequency of appearance is lower or equal than the piVal parameter, meaning that this gene does not belong to the FP of this category.</p>
            </text>
            <graphic file="1471-2105-10-37-3"/>
         </fig>
         <p>plotDiscriminantFuzzyPattern(dfps, overlapping = 2)</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>DFP is a new Bioconductor R package that performs gene selection and data reduction by the application of a supervised fuzzy pattern algorithm. As other Bioconductor/R packages, DFP offers a high level of standardized documentation through its vignette and the function help pages.</p>
         <p>The implemented algorithm has also been coded and tested in <smcaps>GENE</smcaps>CBR, a multiplatform open source tool for microarray analysis <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. The results obtained using publicly available data sets validate the effectiveness of the proposed algorithm <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p><b>Project name</b>: DFP</p>
         <p><b>Project home page</b>: <url>http://bioconductor.org/packages/2.3/bioc/html/DFP.html</url></p>
         <p><b>Operating systems</b>: Platform independent</p>
         <p><b>Programming language</b>: R</p>
         <p><b>Other requirements</b>: R, Bioconductor</p>
         <p><b>License</b>: GNU GPL</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>DGP and FFR programmed and tested geneCBR application. RA and FD implemented and tested the code of the DFP package. FFR wrote the paper while DGP, RA and FD provided comments and discussion. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Gonzalo G&#243;mez for valuable discussion in early versions of the manuscript. This work is partly funded by the research projects <it>BioTools </it>(ref. 2008-INOU-2) from University of Vigo and <it>Development of computational tools for the classification and clustering of gene expression data in order to discover meaningful biological information in cancer diagnosis </it>(ref. VA100A08) from JCyL (Spain). The work of DGP is supported by a "Maria Barbeito" research contract from Xunta de Galicia.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Dimension reduction for classification with gene expression microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Dai</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lieu</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Rocke</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Stat Appl Genet Mol Biol</source>
            <pubdate>2007</pubdate>
            <volume>5</volume>
            <fpage>Article6</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid">16646870</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>The application of Shannon entropy in the identification of putative drug targets</p>
            </title>
            <aug>
               <au>
                  <snm>Fuhrman</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Cunningham</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Wen</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Zweiger</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Seilhamer</snm>
                  <fnm>JJ</fnm>
               </au>
            </aug>
            <source>Biosystems</source>
            <pubdate>2000</pubdate>
            <volume>55</volume>
            <fpage>5</fpage>
            <lpage>14</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0303-2647(99)00077-5</pubid>
                  <pubid idtype="pmpid" link="fulltext">10745103</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Darden</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Weinberg</snm>
                  <fnm>CR</fnm>
               </au>
               <au>
                  <snm>Levine</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Pedersen</snm>
                  <fnm>LG</fnm>
               </au>
            </aug>
            <source>Comb Chem High Throughput Screen</source>
            <pubdate>2001</pubdate>
            <volume>4</volume>
            <issue>8</issue>
            <fpage>727</fpage>
            <lpage>739</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11894805</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Gene selection for cancer classification using wrapper approaches</p>
            </title>
            <aug>
               <au>
                  <snm>Blanco</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Larra&#241;aga</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Inza</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Sierra</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Int J Pattern Recogn</source>
            <pubdate>2004</pubdate>
            <volume>18</volume>
            <issue>8</issue>
            <fpage>1373</fpage>
            <lpage>1390</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1142/S0218001404003800</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Gene selection for cancer classification using support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Guyon</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Weston</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Barnhill</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Mach Learn</source>
            <pubdate>2002</pubdate>
            <volume>46</volume>
            <issue>1&#8211;3</issue>
            <fpage>389</fpage>
            <lpage>422</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1012487302797</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Gene Expression Data Analysis Using Support Vector Machines</p>
            </title>
            <aug>
               <au>
                  <snm>Chu</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Proceedings of the 2003 IEEE International Joint Conference on Neural Networks: 20&#8211;24 July 2003; Portland, Oregon</source>
            <publisher>Springer</publisher>
            <editor>Udo Seiffert, Lakhmi C Jain</editor>
            <pubdate>2003</pubdate>
            <fpage>167</fpage>
            <lpage>189</lpage>
         </bibl>
         <bibl id="B7">
            <title>
               <p>An efficient semi-unsupervised gene selection method via spectral biclustering</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Wan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>IEEE Trans Nanobioscience</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <issue>2</issue>
            <fpage>110</fpage>
            <lpage>4</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1109/TNB.2006.875040</pubid>
                  <pubid idtype="pmpid">16805107</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Improved gene selection for classification of microarrays</p>
            </title>
            <aug>
               <au>
                  <snm>Jaeger</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sengupta</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Ruzzo</snm>
                  <fnm>WL</fnm>
               </au>
            </aug>
            <source>Proceedings of the eighth Pacific Symposium on Biocomputing: 3&#8211;7 January 2003; Lihue, Hawaii</source>
            <publisher>World Scientific Publishing</publisher>
            <editor>Altman RB, Dunker AK, Hunter L, Jung TA, Klein TE</editor>
            <pubdate>2003</pubdate>
            <fpage>53</fpage>
            <lpage>64</lpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Feature selection and kNN fusion in molecular classification of multiple tumor types</p>
            </title>
            <aug>
               <au>
                  <snm>Qi</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Proceedings of the Mathematics and Engineering Techniques in Medicine and Biological Sciences: 24&#8211;27 June 2002; Las Vegas, Nevada, USA</source>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Improving classification of microarray data using prototype-based feature selection</p>
            </title>
            <aug>
               <au>
                  <snm>Hanczar</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Courtine</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Benis</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hennegar</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Cl&#233;ment</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Zucker</snm>
                  <fnm>J-D</fnm>
               </au>
            </aug>
            <source>ACM SIGKDD Explorations Newsletter</source>
            <pubdate>2003</pubdate>
            <volume>5</volume>
            <issue>2</issue>
            <fpage>23</fpage>
            <lpage>30</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/980972.980977</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarrays data</p>
            </title>
            <aug>
               <au>
                  <snm>Yeung</snm>
                  <fnm>KY</fnm>
               </au>
               <au>
                  <snm>Bumgarner</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Raftery</snm>
                  <fnm>AE</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>10</issue>
            <fpage>2394</fpage>
            <lpage>2402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti319</pubid>
                  <pubid idtype="pmpid" link="fulltext">15713736</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest</p>
            </title>
            <aug>
               <au>
                  <snm>Diaz-Uriarte</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>BMC Bioiformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>328</fpage>
            <xrefbib>
               <pubid idtype="doi">10.1186/1471-2105-8-328</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Gene selection and classification of microarray data using random forest</p>
            </title>
            <aug>
               <au>
                  <snm>D&#237;az-Uriarte</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Alvarez de Andr&#233;s</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>3</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1363357</pubid>
                  <pubid idtype="pmpid" link="fulltext">16398926</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-3</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Lu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Shi</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Xu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Leung</snm>
                  <fnm>HE</fnm>
               </au>
               <au>
                  <snm>Harris</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Iglehart</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Miron</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>WH</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>197</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1456993</pubid>
                  <pubid idtype="pmpid" link="fulltext">16606446</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-197</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Fuzzy logic approach to gene expression data analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Woolf</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Phisiol Genomics</source>
            <pubdate>2000</pubdate>
            <volume>3</volume>
            <fpage>9</fpage>
            <lpage>15</lpage>
         </bibl>
         <bibl id="B16">
            <title>
               <p>R: A language for data analysis and graphics</p>
            </title>
            <aug>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Ihaka</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Journal of Computational and Graphical Statistics</source>
            <pubdate>1996</pubdate>
            <volume>5</volume>
            <fpage>299</fpage>
            <lpage>314</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/1390807</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Bioconductor: open software development for computational biology and bioinformatics</p>
            </title>
            <aug>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Carey</snm>
                  <fnm>VJ</fnm>
               </au>
               <au>
                  <snm>Bates</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Bolstad</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Dettling</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ellis</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Gautier</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ge</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Gentry</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hornik</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hothorn</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Huber</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Iacus</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Leisch</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Maechler</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rossini</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Sawitzki</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Smyth</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Tierney</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genome Biology</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>R80</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">545600</pubid>
                  <pubid idtype="pmpid" link="fulltext">15461798</pubid>
                  <pubid idtype="doi">10.1186/gb-2004-5-10-r80</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Open software tool for microarray analysis</p>
            </title>
            <aug>
               <au>
                  <cnm>geneCBR</cnm>
               </au>
            </aug>
            <url>http://www.genecbr.org</url>
         </bibl>
         <bibl id="B19">
            <title>
               <p><smcaps>GENE</smcaps>-CBR: a Case-Based Reasoning Tool for Cancer Diagnosis using Microarray Datasets</p>
            </title>
            <aug>
               <au>
                  <snm>D&#237;az</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Fdez-Riverola</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Corchado</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Computational Intelligence</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>3&#8211;4</issue>
            <fpage>254</fpage>
            <lpage>268</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1111/j.1467-8640.2006.00287.x</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>

