<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-6-67</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Evaluation of gene importance in microarray data based upon probability of selection</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Fu</snm>
               <mi>M</mi>
               <fnm>Li</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>lifu@patcar.org</email>
            </au>
            <au id="A2">
               <snm>Fu-Liu</snm>
               <mi>S</mi>
               <fnm>Casey</fnm>
               <insr iid="I1"/>
               <email>casey@patcar.org</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Pacific Tuberculosis and Cancer Research Organization, Pasadena, California, USA</p>
            </ins>
            <ins id="I2">
               <p>University of Florida, Gainesville, Florida, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>67</fpage>
         <url>http://www.biomedcentral.com/1471-2105/6/67</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15784140</pubid>
               <pubid idtype="doi">10.1186/1471-2105-6-67</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>19</day>
               <month>11</month>
               <year>2004</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>22</day>
               <month>3</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>22</day>
               <month>3</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Fu and Fu-Liu; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Microarray devices permit a genome-scale evaluation of gene function. This technology has catalyzed biomedical research and development in recent years. As many important diseases can be traced down to the gene level, a long-standing research problem is to identify specific gene expression patterns linking to metabolic characteristics that contribute to disease development and progression. The microarray approach offers an expedited solution to this problem. However, it has posed a challenging issue to recognize disease-related genes expression patterns embedded in the microarray data. In selecting a small set of biologically significant genes for classifier design, the nature of high data dimensionality inherent in this problem creates substantial amount of uncertainty.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Here we present a model for probability analysis of selected genes in order to determine their importance. Our contribution is that we show how to derive the <it>P </it>value of each selected gene in multiple gene selection trials based on different combinations of data samples and how to conduct a reliability analysis accordingly. The importance of a gene is indicated by its associated <it>P </it>value in that a smaller value implies higher information content from information theory. On the microarray data concerning the subtype classification of small round blue cell tumors, we demonstrate that the method is capable of finding the smallest set of genes (19 genes) with optimal classification performance, compared with results reported in the literature.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>In classifier design based on microarray data, the probability value derived from gene selection based on multiple combinations of data samples enables an effective mechanism for reducing the tendency of fitting local data particularities.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Based on the concept of simultaneously studying the expression of a large number of genes, a DNA microarray is a chip on which numerous probes are placed for hybridization with a tissue sample. The DNA microarray has recently emerged as a powerful tool in molecular biology research, offering high throughput analysis of gene expression on a genomic scale. However, biological complexity encoded by a deluge of microarray data is being translated into all sorts of computational, statistical or mathematical problems.</p>
         <p>Driven by the growing genomic technology, molecular medicine has become a rapidly advancing field. An important research topic is to identify disease-related gene expression patterns based on microarray analysis. In one approach, genes are selected for constructing a clinically useful classifier for disease diagnosis. The genes thus selected often shed light on the fundamental molecular mechanisms of the disease <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. As addressed in several research works <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>, the problem of gene selection considered in this context is a difficult one because there are thousands of genes at hand but only a very limited number of samples are available. Mathematically, this problem is characterized by high data dimensionality. To develop a classifier, dimensionality reduction by gene selection is essential. Genes selected for constructing a classifier are believed to be important. Typically, only a small fraction of genes differentially expressed in the diseased tissue will be selected.</p>
         <p>There exist two related but different objectives for gene selection. As mentioned above, one objective is to construct a classifier or predictor for classifying, diagnosing, or predicting the type of cancer tissue according to the expression pattern of selected genes in the tissue <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The other objective is to determine whether the changes in gene expression across two conditions are significant (e.g., SAM) <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. The present work is developed in the first context.</p>
         <p>Here, we report new theoretical developments and research results as an extension of our earlier work <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B8">8</abbr></abbrgrp>, presenting a new probabilistic analysis of gene selection from microarray data, which distinguishes our work from other related work.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Probability analysis of selected genes</p>
            </st>
            <p>Under very high data dimensionality, questions can be raised of whether genes could have been selected by chance and whether selected genes are sufficiently significant beyond any doubt due to inherent uncertainty or data particularity. Quite often, not identical sets of genes are selected from different subsets of the data. At the fundamental level, it would be important to distinguish between the case of diverse patterns and the case of false patterns. To address the problem, we take the approach that takes into account both statistical significance and performance issues. The bootstrapping technique lends itself well as far as the first issue is concerned.</p>
            <p>Suppose we randomly draw samples from a given domain and conduct a gene selection experiment. Assume that we select one gene out of a total of <it>p </it>genes. The probability of the event that a particular gene is selected in a single trial is <it>1/p</it>. According to the information theory, the smaller the probability is, the more informative the event is. Given a large <it>p</it>, it seems that the event is significant, and this would be true only if we have a particular gene in mind before gene selection; otherwise, the probability should be adjusted for the presence of <it>p </it>genes, and then it becomes clear that any gene selected in a single trial is non-informative. Now suppose we conduct multiple trials and ask the question of whether any gene repeatedly selected across trials is significant. Here we devise an analysis for the question.</p>
            <sec>
               <st>
                  <p>Theorem</p>
               </st>
               <p>In <it>r </it>multiple independent trials conducted for gene selection, select one gene out of a total of <it>p </it>genes in each trial. Given the level of significance &#945;, a gene is considered significant if it is selected <it>r </it>times in <it>r </it>trials and</p>
               <p>
                  <graphic file="1471-2105-6-67-i1.gif"/>
               </p>
            </sec>
            <sec>
               <st>
                  <p>Proof</p>
               </st>
               <p>The probability of the event that the same gene is selected <it>r </it>times in <it>r </it>trials is (1/<it>p</it>)<sup><it>r</it></sup>. Since there are <it>p </it>genes, the adjusted probability (analogous to Bonferroni's correction) is <it>p</it>(1/<it>p</it>)<sup><it>r</it></sup>. Therefore,</p>
               <p>
                  <graphic file="1471-2105-6-67-i2.gif"/>
               </p>
               <p>Equivalently,</p>
               <p>
                  <graphic file="1471-2105-6-67-i3.gif"/>
               </p>
               <p>Thus,</p>
               <p>
                  <graphic file="1471-2105-6-67-i4.gif"/>
               </p>
               <p>Note that the value of <graphic file="1471-2105-6-67-i5.gif"/> is negative. The result follows. &#8364;</p>
            </sec>
            <sec>
               <st>
                  <p>Corollary 1</p>
               </st>
               <p>The minimum threshold value of <it>r </it>for reaching the given level of significance is</p>
               <p>
                  <graphic file="1471-2105-6-67-i6.gif"/>
               </p>
               <p>where &#8968;&#8969; is the ceiling operator. This is because <it>r </it>must be an integer greater than or equal to the real threshold.</p>
               <p>For example, consider the leukemia data <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. There are 7129 genes. Assume &#945; = 0.05. From Eq. (1), <it>r</it><sub><it>&#952; </it></sub>= 2.</p>
               <p>Consider a more general case: what is the probability of the event that a gene is selected <it>r </it>times in <it>m </it>trials? The adjusted probability becomes</p>
               <p>
                  <graphic file="1471-2105-6-67-i7.gif"/>
               </p>
               <p>where <graphic file="1471-2105-6-67-i8.gif"/> is the combinatorial function that returns the number of possibilities for choosing <it>r </it>from <it>m </it>objects. Assume a large <it>p </it>so that<graphic file="1471-2105-6-67-i9.gif"/>. Then, we have</p>
               <p>
                  <graphic file="1471-2105-6-67-i10.gif"/>
               </p>
               <p>The level of significance (&#945; in Eq. (1) and (2)) is set to 0.05 by convention in the present work.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Reliability analysis of gene selection</p>
            </st>
            <p>The innovative feature of our method is to conduct reliability analysis for arriving at the gene expression signature. The analysis assesses the repeatability of genes selected and determines the repeatability for gene selection using <it>M</it>-fold cross-validation.</p>
            <p>In the 10-fold cross-validation approach, the data set is divided into 10 disjoint subsets of about equal size. Genes are selected on the basis of nine of these subsets, and then the remaining subset is used to estimate the predictive error of the trained classifier using only the selected genes. This process is repeated 10 times, each time leaving one set out for testing and the others for training. The cross-validation error rate is given by the average of the 10 estimates of the error rate thus obtained.</p>
            <p>In each cross-validation cycle, we conduct SVM-RFE gene ranking and selection operations, as described in the Methods section. We select a minimal set of genes by collecting from the top rank one by one and picking the set associated with minimum error in each cross-validation cycle. There is no guarantee that the same subset of genes will be selected in each of the 10 cycles in 10-fold cross-validation. However, vital genes tend to be selected more consistently than others across cycles. The significance of a gene is correlated with the repeatability of selection according to the probabilistic analysis given earlier. We associate each selected gene with a repeatability value indicating how many times it is selected in the cross-validation experiment. The biological or clinical interpretation of "repeatability" would depend on the objective and design of the microarray experiment. We may consider the validity of a selected gene by its reliability in the sense that the more often a gene is selected, the less likely chance is a factor.</p>
            <p>To select the final set of genes, we need to determine the repeatability threshold. A gene is in the final set if its repeatability reaches (i.e., no less than) the threshold. To this end, second 10-fold cross-validation is performed. Then we choose the repeatability threshold that is associated with the minimal cross-validation error under the given level of significance (&#945; = 0.05). Recall that a gene with a higher repeatability is associated with a small <it>P </it>value, as shown earlier.</p>
            <p>To extend the method from two-class to multi-class classification, we adopt the one-against-all others strategy under which genes are selected for each class one at a time and then combined. For each class, all the other classes are grouped as a single class. In this way, a multi-class gene selection problem is converted into a series of two-class problems. The program was written in Matlab <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. An SVM Matlab toolbox as well as Mathlab is required for the program use.</p>
         </sec>
         <sec>
            <st>
               <p>Case analyses</p>
            </st>
            <p>In cancer research, our current goal is to develop a molecular classifier based on tissue gene expression patterns for diagnosis and subtype classification. With this in mind, we evaluate our method using well-known benchmark microarray data sets including those concerning small round blue cell tumors, colon cancer, leukemia as well as perturbed data sets.</p>
            <p>The small round blue cell tumors (SRBCTs) data set includes 63 training samples and 25 test samples derived from both tumor biopsy and cell lines <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. In consistency with other reports in the literature, we used the test set of 20 samples after 5 non-SRBCT samples were removed. The data set consists of four types of tumor in childhood, including Ewing's sarcoma (EWS), rhabdomyosarcoma (RMS), neuroblastoma (NB), and Burkitt lymphoma (BL). After initial screening, the data set in the public domain contains 2308 genes.</p>
            <p>The colon cancer data set contains 62 tissue samples, each with 2000 gene expression values <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The tissue samples include 22 normal and 40 colon cancer cases. In this study, we used all the 62 samples in the original data.</p>
            <p>The leukemia data consist of 72 tissue samples, each with 7129 genes expression values <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. The samples include 47 ALL (acute lymphoblastic leukemia) and 25 AML (acute myeloid leukemia). The original data have been divided into a training set of 38 samples and a test set of 34 samples.</p>
            <p>The reference method with which we compared our method applied a technique referred to as SVM-RFE <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> to select genes from the training data without reliability assessment. The reference method <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> is a multi-class extension of the SVM-RFE method used for two-class classification. The SVM-RFE method (two-class or multi-class) has not been applied to the SRBCT data before. We implemented the computer algorithm of the reference method for comparison with ours. The same experimental conditions were applied to both methods.</p>
         </sec>
         <sec>
            <st>
               <p>Small round blue cell tumor classification</p>
            </st>
            <p>On the SRBCT data, our method selected 19 genes (Table <tblr tid="T1">1</tblr>) from the microarray gene expression data of the 63 training samples. The SVM classifier trained on the 63 training samples using the 19 selected genes was tested on the 20 different test samples. Both the training and test predictive accuracies were 100%. That is, the trained SVM classifier can accurately predict the tumor class using the 19 gene expression data for both seen and unseen samples. Since the classifier may tend to fit the training data, the generalization performance of the classifier is indicated by the test accuracy.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Genes selected by our method on the microarray dataset of small round blue-cells tumors. Those genes also selected using the methods of Tibshirani et al. [13] and Khan et al. [10] are respectively marked by the symbol &#8226;.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Image ID</p>
                     </c>
                     <c ca="center">
                        <p><it>P </it>Value</p>
                     </c>
                     <c ca="left">
                        <p>Gene Description</p>
                     </c>
                     <c ca="center">
                        <p>Tibshirani et al.</p>
                     </c>
                     <c ca="center">
                        <p>Khan et al.</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>21652</p>
                     </c>
                     <c ca="left">
                        <p>2.3 &#215; 10<sup>-5</sup></p>
                     </c>
                     <c ca="left">
                        <p>catenin (cadherin-associated protein), alpha 1</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>878280</p>
                     </c>
                     <c ca="left">
                        <p>2.3 &#215; 10<sup>-5</sup></p>
                     </c>
                     <c ca="left">
                        <p>collapsin response mediator protein 1</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>377461</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>caveolin 1, caveolae protein</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>325182</p>
                     </c>
                     <c ca="left">
                        <p>2.3 &#215; 10<sup>-5</sup></p>
                     </c>
                     <c ca="left">
                        <p>cadherin 2, N-cadherin (neuronal)</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>1435862</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>MIC2 surface antigen (CD99)</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>42558</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>L-arginine:glycine amidinotransferase</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>812105</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>transmembrane protein</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>41591</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>meningioma 1</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>810057</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>cold shock domain protein A</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>183337</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>major histocompatibility complex, class II, DM alpha</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>796258</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>sarcoglycan, alpha</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>1409509</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>troponin T1, skeletal, slow</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>788107</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>amphiphysin-like</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>770394</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>Fc fragment of IgG, receptor, transporter, alpha</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>82225</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>secreted frizzled-related protein 1</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>814260</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>follicular lymphoma variant translocation 1</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>784224</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>fibroblast growth factor receptor 4</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>308163</p>
                     </c>
                     <c ca="left">
                        <p>2.3 &#215; 10<sup>-5</sup></p>
                     </c>
                     <c ca="left">
                        <p>ESTs</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>212542</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>cDNA DKFZp586J2118</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>The reference method selected 8 genes with 100% training accuracy but with only 90% test accuracy. It seemed that the reference method did not select enough genes even though the selected genes could correctly classify all the training samples &#8211; an example of over-generalization, whereas our bootstrap-like strategy adequately dealt with this problem by taking into account of both reliability and diversity in gene selection.</p>
            <p>We examined the consensus of genes selected by our method and by two other best-known methods: the method of Khan et al. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> based on artificial neural networks and the method of Tibshirani et al. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> based on shrunken centroids, and we found that there was high consensus between our and their results. Out of the 19 genes selected by our method, 18 genes were also selected by Khan's method and 16 genes by Tibshirani's method (Table <tblr tid="T1">1</tblr>). While agreement among results produced by different methods may imply similarities in the inductive biases, these two other methods use fundamentally different representational biases. Thus, such agreement should not be taken for granted and would instead serve as substantial evidence indicative of the validity and significance of our method.</p>
            <p>Whether the selected genes served as meaningful markers for cancer classification was further confirmed by cluster analysis and visualization. To this end, we applied a hierarchical clustering program developed by Eisen <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> to the gene expression data of the selected genes. By visual inspection of the gene expression map, four clearly separated clusters (Figure <figr fid="F1">1</figr>) were identified. Upon verification, each cluster corresponded exactly to a distinct tumor group with 100% accuracy. Thus, a diagnostic chip can be designed based on the selected genes. This result also provides additional evidence to support our method.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>The gene expression map of the 19 genes selected by our method in the domain concerning classification of SRBCTs</p>
               </caption>
               <text>
                  <p>The gene expression map of the 19 genes selected by our method in the domain concerning classification of SRBCTs. The map was generated by Eisen's hierarchical clustering program called CLUSTER and viewed by the TREEVIEW program. Four sample clusters are visually recognizable, corresponding exactly to the four predefined tumor classes (NB, EWS, BL, and RMS) with 100% accuracy.</p>
               </text>
               <graphic file="1471-2105-6-67-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Colon cancer diagnosis</p>
            </st>
            <p>In performance analysis, we conducted multiple experiments with random data partitions. In each experiment, the data were randomly and equally split into training and test sets. The training set was used for gene selection and classifier training, and the test set for determining the predictive performance of the classifier based on the genes selected by the given algorithm. Our method outperformed the reference method by a small margin. This result reflects the underlying fact that there are multiple possible ways of selecting genes for constructing a classifier with comparable performance using different methods.</p>
            <p>Our program selected 15 genes from the colon cancer data (Table <tblr tid="T2">2</tblr>). The selected genes allow the separation of cancer from normal samples in the gene expression map (Figure <figr fid="F2">2</figr>, Table <tblr tid="T3">3</tblr>). Some genes were selected because their activities resulted in the difference in the tissue composition between normal and cancer tissue. Other genes were selected because they played a role in cancer formation or cell proliferation. It was not surprise that some genes implicated in other types of cancer such as breast and prostate cancers were identified in the context of colon cancer because these tissue types shared similarity.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>15 genes selected from the colon cancer microarray data set (62 samples) using our method.</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="center">
                        <p>Gene Accession #</p>
                     </c>
                     <c ca="center">
                        <p><it>P </it>value</p>
                     </c>
                     <c ca="center">
                        <p>Definition</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>H20709</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>myosin light chain alkali, smooth-muscle isoform</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>X57351</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>interferon-inducible protein 1-8D</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>T94579</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>human chitotriosidase precursor</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>T47377</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>S-100P protein (human)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>T98835</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>alpha trans-inducing protein (bovine herpesvirus type 1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>T61661</p>
                     </c>
                     <c ca="left">
                        <p>3.0 &#215; 10<sup>-5</sup></p>
                     </c>
                     <c ca="left">
                        <p>profilin I (human)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>X67325</p>
                     </c>
                     <c ca="left">
                        <p>3.0 &#215; 10<sup>-5</sup></p>
                     </c>
                     <c ca="left">
                        <p>H. sapiens p27</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>T58861</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>60s ribosomal protein L30E</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>T61446</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>putative DNA binding protein A20</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>H88360</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>guanine nucleotide-binding protein G(OLF), alpha subunit</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>L38810</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>Homo sapiens thyroid receptor interactor (TRIP1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>T57882</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>myosin heavy chain, nonmuscle type A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>T92451</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>tropomyosin, fibroblast and epithelial muscle-type</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>J02854</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>myosin regulatory light chain 2, smooth muscle isoform</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>K03474</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>human mullerian inhibiting substance gene</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>The gene expression map of the 15 genes selected from the colon cancer microarray data set using our method</p>
               </caption>
               <text>
                  <p>The gene expression map of the 15 genes selected from the colon cancer microarray data set using our method. Two major sample clusters can be recognized by visual inspection, corresponding to normal and cancer tissue samples, respectively.</p>
               </text>
               <graphic file="1471-2105-6-67-2"/>
            </fig>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Diagnosis results of the colon cancer data samples based on 15 selected genes, in correspondence with the gene expression map.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="center">
                        <p>Normal Tissue</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer Tissue</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>
                              <ul>Sample</ul>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <ul>Diagnosis</ul>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <ul>Sample</ul>
                           </b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <ul>Diagnosis</ul>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-01</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-01</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-02</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-02</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-03</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-03</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-04</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-04</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-05</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-05</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-06</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-06</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-07</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-07</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-08</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-08</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-09</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-09</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-10</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-10</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-11</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-11</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-12</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-12</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-13</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-13</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-14</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-14</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-15</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-15</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-16</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-16</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-17</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-17</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-18</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-18</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-19</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-19</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-20</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-20</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-21</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-21</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Normal-22</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                     <c ca="center">
                        <p>Cancer-22</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-23</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-24</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-25</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-26</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-27</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-28</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-29</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-30</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-31</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-32</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-33</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-34</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-35</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-36</p>
                     </c>
                     <c ca="center">
                        <p>normal</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-37</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-38</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-39</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Cancer-40</p>
                     </c>
                     <c ca="center">
                        <p>cancer</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>Our method is supported by the meaningful biological interpretation of selected genes, as discussed below. New biological hypotheses can be formulated to further investigate the relationship of a particular gene with colon cancer. For example, what is the role of profilin 1 protein in colon cancer? Some discovered genes could potentially serve as novel targets for drugs, vaccines, or gene therapy.</p>
         </sec>
         <sec>
            <st>
               <p>Leukemia classification</p>
            </st>
            <p>On the leukemia data, our method selected four genes (Table <tblr tid="T4">4</tblr>) from the microarray gene expression data of 38 training samples. The SVM classifier trained on the 38 training samples using the selected genes was tested on the 34 different test samples. The training and test accuracies were 100% and 97.06%, respectively. In addition, the AML and ALL samples formed separate clusters in the gene expression map of the selected genes.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Genes selected by our method on the leukemia microarray dataset. Those genes also selected using the methods of Golub et al.[1] and SVM-RFE (the reference algorithm) are respectively marked by the symbol &#8226;.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Access Number</p>
                     </c>
                     <c ca="center">
                        <p><it>P </it>Value</p>
                     </c>
                     <c ca="left">
                        <p>Gene Description</p>
                     </c>
                     <c ca="center">
                        <p>Golub et al.</p>
                     </c>
                     <c ca="center">
                        <p>SVM-RFE</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>M27891</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>CST3 Cystatin C</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Y00787</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.000001</p>
                     </c>
                     <c ca="left">
                        <p>INTERLEUKIN-8 PRECURSOR</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>M19507</p>
                     </c>
                     <c ca="left">
                        <p>0.006</p>
                     </c>
                     <c ca="left">
                        <p>MPO Myeloperoxidase</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>&#8226;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>L20688</p>
                     </c>
                     <c ca="left">
                        <p>0.006</p>
                     </c>
                     <c ca="left">
                        <p>Ly-GDI</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>The reference method also selected four genes and achieved the same level of test accuracy as our method. The original algorithm of SVM-RFE <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> selected 8 or 16 genes on this data set. The method based on shrunken centroids <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> selected 21 genes on this data set. A recent study indicated that the unbiased error estimate of the classifier using a small number of selected genes was virtually non-zero on the leukemia data set <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Taken together, the evidence showed that our method produced optimum results in terms of both predictive performance and the number of selected genes.</p>
         </sec>
         <sec>
            <st>
               <p>Perturbed data</p>
            </st>
            <p>In practical circumstances, noise may arise during sample collection and handling, slide preparation, hybridization, or image analysis, as reflected by variations in microarray results generated from different laboratories. To address this issue, we also conducted performance evaluation of our gene selection method based on perturbed data. 20 data sets were produced by randomly perturbing 5% (rounded up to the nearest integer) of the training cases, reversing their class labels and leaving the test cases intact, in the domains of colon cancer diagnosis and leukemia classification (ten in each domain). The average test predictive accuracies with our method in the two domains were 85.49% and 88.61%, respectively, compared with 80.65% and 86.11% with the reference method. The result suggests the potential advantage with our method in smoothing out data variations due to various sources in practice.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Both cross-validation and bootstrapping are standard statistical methods for arriving at an unbiased estimate of the true error rate associated with a classifying or predicting system. Bootstrapping has also been used for assessing the reliability or stability of phylogenetic trees <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> or cluster analysis <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Bootstrapping is a method for random re-sampling with replacement for a number of times and estimates the error rate by the average error rate over the number of iterations. Cross-validation is a method of assessing the reliability of error; however, its application to learning the pattern in the data is novel. As discussed later, stability emerges as an important issue in gene selection. Here we propose to use bootstrapping or cross-validation for analyzing the issue. Our experience showed that cross-validation was more efficient than bootstrapping. For instance, genes selected based on a single 10-fold cross-validation were more accurate in prediction than those selected using bootstrapping with 10 re-sampling iterations. Since the SVM-based gene selection algorithm is time-consuming, we consider only cross-validation for assessment of error and stability in this study.</p>
         <p>In the original SVM-RFE algorithm <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, error estimation and gene selection are not independent processes because both are based on the same training set. However, it is important to correct for the selection bias by performing a cross-validation or applying a bootstrap external to the selection process <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B17">17</abbr></abbrgrp>. Our implementation of SVM-RFE is based on this idea.</p>
         <p>Genes selected for cancer diagnosis or classification can be validated by their biological significance since these genes are expected to show differential expression between normal and cancer tissue or among subtypes of cancer, and as such, they are implicated in cancer-related mechanisms or pathways. Genes with unknown roles may be discovered through gene selection and later verified by biological studies.</p>
         <p>From the SRBCT data set, genes selected by our method for a particular type of cancer/tumor against other types are generally consistent with its tissue of origin. For example, genes selected for neuroblastoma (NB) are characteristic for nerve cells, such as neuronal N-cadherin, and meningioma 1; genes selected for rhabdomyosarcoma (RMS) are characteristic for muscle cells, such as alpha sarcoglycan, and slow skeletal troponin T1; genes selected for Burkitt lymphoma (BL) are characteristic for lymphocytes or blood cells, such as major histocompatibility complex (class II, DM alpha). Some genes discovered by means of microarray analysis have been reported in the biological literature, e.g., over-expression of MIC2 in Ewing's sarcoma (EWS) <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Some genes are over-expressed in a certain type of tumor but lack specificity. For instance, FGFR4 (fibroblast growth factor receptor 4) was noted to be highly expressed only in RMS and not in normal muscle, but it is also expressed in some other cancers and normal tissues <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. A gene that is under-expressed in a particular type of tumor compared with other types can also be selected as a diagnostic marker. For instance, cold shock domain protein A selected for NB was under-expressed in this tumor, consistent with the fact that this gene is expressed in B cells and skeletal muscle but not in the brain <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
         <p>With our method, four muscle-related genes (H20709, T57882, T92451, and J02854) were selected from the colon cancer data, reflecting the fact that normal colon tissue had higher muscle content, whereas colon cancer tissue had lower muscle content (biased toward epithelial cells) <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The selection of 60s ribosomal protein L30E agreed with an observation that ribosomal protein genes had lower expression in normal than in cancer colon tissue <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The selected interferon inducible protein 1-8D genes were found to be expressed in adenocarcinoma cell lines <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. There was a potential connection of another selected gene, human chitotriosidase, to cancer <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. The implications of cancer among other selected genes are explained as follows. S-100 protein can stimulate cellular proliferation and may function as a tumor growth factor <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Profilin 1 protein can suppress tumorigenicity in breast cancer cells. A study showed consistently lower profilin 1 levels in tumor cells <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. The reduced expression of P27 protein was linked to the possibility of colon carcinoma <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. The A20 protein can inhibit a specific apoptotic pathway <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Recall that apoptosis is a major mechanism for tumor suppression. The guanine nucleotide-binding protein is involved in signal transduction and its abnormality may contribute to cancer development <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. A thyroid receptor interactor could be a target gene of a certain oncogene. The alpha trans-inducing protein (bovine herpesvirus type 1) may be linked to oncogenic activity.</p>
         <p>In the related work <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, 7 genes were selected from the colon cancer data: H08393, M59040, T94579, H81558, R88740, T62947, and H64807. For all of them, a possible link to cancer was found in the biological literature. These 7 genes, however, do not include any muscle-specific gene, despite that muscle content offered a discriminating index for colon cancer <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>.</p>
         <p>In a typical microarray data analysis problem, the data dimensionality is high and the sample size is relatively small. Under this condition, the problem of finding a classification model is under-constrained, and the model found tends to fit the training data so closely that it fails to generalize to unseen data. To address the issue of data overfitting, the SVM has the capability of controlling the model complexity to the point where a satisfactory solution can be produced. On the other hand, the ability of causal discovery based on the SVM-RFE approach or an alternative approach is discounted by the finding that most genes selected are selected only once from one data split to another in <it>M</it>-fold cross-validation <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. This means that the SVM is not free of the data-overfitting problem at least in the context of gene selection from microarray data, and it raises the question about stability or reliability of gene selection, as we address here.</p>
         <p>The research finding that the SVM may assign zero weights to strongly relevant variables and non-weights to weakly relevant (red-herring) features <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> implies the disadvantage with this approach for discovery of causal variables associated with the target variable concerned. This however can be understood since the SVM-RFE is aimed to identify the best features for maximum margin of separation between different classes of samples, regardless of causal implications. In reality, causal variables are not necessarily most discriminant, as the target variable is not always categorized according to its causal factors. The issue of causality becomes even more complicated because of confounding variables leading to so-called spurious causation. The method presented here is developed in the context of cancer subtype classification and evaluated in terms of predictive performance rather than the capability of causal inference. However, some methods are both predictive and causal <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>.</p>
         <p>We emphasize the importance of holding back some data to improve generalization and diversity of the learning outcome. In application of <it>M</it>-fold cross-validation to <it>n </it>samples, <it>M </it>can assume a value ranging from 2 to <it>n</it>. A small <it>M </it>is not sufficient to assess the repeatability of selected genes while a large <it>M </it>(e.g., <it>M = n </it>in the leave-one-out experiment) is associated with high degree of redundancy on data for training and low diversity of genes selected. This argument suggests that there exists an optimum <it>M </it>value. So we conducted experiments to compare predictive accuracies for three cases: <it>M </it>= 5, 10, and 15. Among the three cases, 10-fold cross-validation achieved the best results. It is thus consistent with our intuitive analysis. However, there is no proof that 10-fold cross-validation is always the best choice. In practice, the optimum <it>M </it>value should be determined by the value associated with the best cross-validation accuracy.</p>
         <p>This study highlights the importance of reliability assessment of genes selected from a large-scale microarray data. We show how to derive the <it>P </it>value of each selected gene in multiple gene selection trials based on different data partitions. The importance of a gene is indicated by its associated <it>P </it>value. The distinctive feature of our method is that gene selection is determined by both ranking and reliability analyses. Reliability analysis is conducted using <it>M</it>-fold cross-validation. Some gene selection methods <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B28">28</abbr></abbrgrp> use cross-validation to determine the number of selected genes by minimum cross-validation error but not by optimum repeatability as in our method. Thus, reliability analysis comprising repeatability measurement and optimum repeatability determination defines the novelty of our method, which has enabled a more accurate and cost-effective cancer classifier to be constructed, compared with other methods. Notice, however, the argument about reliability or stability must rest on the assumption of sound performance, as will be clear from the apparent stability with some trivial approaches to gene selection such as the one based on lexicographic ordering of gene names. In fact, the theory behind the analytical scheme we developed is a general one and can therefore be extended to other performance-based gene selection methods.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>The DNA microarray technology has become a standard tool for gathering genome-wide gene expression information. Molecular classification based on gene expression information has emerged as an important approach to cancer diagnosis. A cost-effective approach is to select a small set of genes for classifier design. Moreover, it may be ineffective to use whole microarray data for classification purposes because the data dimensionality (i.e., the number of variables/genes) is often several orders of magnitude greater than the available sample size.</p>
         <p>Experience shows that different sets of genes can be selected from different combinations of microarray data instances with the same gene selection algorithm. At the same time, it is noticed that a biologically significant gene tends to be selected repeatedly across different combinations of data instances. We have developed a method for analyzing this situation. In the domain of small round blue cell tumor subtype classification, we have demonstrated that the method we developed selected only 19 genes that provided 100% accuracy on both training and test data sets. In comparison, the approach based on artificial neural networks <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> selected 96 genes, and the shrunken centroid method <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> selected 43 genes. Thus, our method suggests a mechanism for effectively reducing the tendency of fitting local data particularities in the process of gene selection for classifier design based on microarray data.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <p>This section provides the details of the methods, but the novelty aspects are described in the "Results" section.</p>
         <sec>
            <st>
               <p>Classification based on support vector machines</p>
            </st>
            <p>We use the method of support vector machines (SVM) <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp> for classification. The SVM has been demonstrated as a useful tool for analyzing microarray data <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Consider <it>n </it>training samples {(<graphic file="1471-2105-6-67-i11.gif"/>, <it>y<sub>i</sub></it>) | 1 &#8804; <it>i </it>&#8804; <it>n </it>}, where <graphic file="1471-2105-6-67-i11.gif"/>, is the input feature vector for the <it>i</it>th sample and <it>y</it><sub><it>i </it></sub>is the corresponding target class (output). The basic problem for training an SVM can be reformulated as: given a set of <it>n </it>training instances, each represented as (<graphic file="1471-2105-6-67-i11.gif"/>, <it>y<sub>i</sub></it>), maximize</p>
            <p>
               <graphic file="1471-2105-6-67-i12.gif"/>
            </p>
            <p>subject to</p>
            <p>
               <graphic file="1471-2105-6-67-i13.gif"/>
            </p>
            <p>The optimal hyperplane that separates different classes of objects can be constructed from the solutions &#945;<sub><it>I</it></sub>'s to this maximization problem. The SVM can perform a nonlinear transformation via the inner-product kernel <graphic file="1471-2105-6-67-i14.gif"/> to map the input space into a new high-order feature space where the patterns are linearly separable with high probability. The use of such a kernel function can lead to a decision function that is non-linear in the input space but its image is linear in the transformed space. When the samples are not linearly separable, whether in the input or transformed space, a soft-margin algorithm as an extension of the basic algorithm is available <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <p>The SVM used in this study employed the linear kernel since we found that it yielded a better result than a non-linear kernel for the data under investigation, and this observation is also consistent with the literature <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. All SVM parameters were set to the standard values in accordance with the convention: s = 0 (C-SVM), t = 0, c = 100, v = 10.</p>
            <p>Data normalization in the case of cDNA arrays proceeded as follows: the local background intensity is subtracted from the value of each spot on the array; the two channels are normalized against the median values on that array; the Cy5/Cy3 fluorescence ratios and log<sub>10</sub>-transformed ratios are calculated from the normalized values. In addition, genes that do not change significantly can be removed through a filter in a process called data filtration.</p>
         </sec>
         <sec>
            <st>
               <p>Gene selection</p>
            </st>
            <p>An SVM-based gene selection algorithm has two main components: gene ranking and gene selection. Gene ranking results in a sorted list of genes in decreasing order of importance for classification. This issue is complicated since some genes become important only if combined with other genes. After genes are ranked, genes are selected according to their ranks.</p>
            <p>When there are a large number of features, a conservative strategy is to determine the least important feature one at a time recursively. In this work, we adopted the SVM-RFE (recursive feature elimination) algorithm <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> where the least important feature is identified and removed in each iteration, remaining features are re-evaluated, and the process repeats until no more features are left for consideration. For the linear kernel, the importance of a feature is determined by the associated weight magnitude, and the least important feature refers to the one with the smallest weight value. SVM-RFE essentially implements the strategy of backward feature elimination. In principle, feature ranking becomes more accurate as less important features are removed successively. To improve the speed, a chunk of least important features was eliminated per step until there were 256 genes remained, from which point, one gene was remove per step. The RFE ranking criterion is given by</p>
            <p><it>Rank(g<sub>i</sub>) </it>&lt;<it>Rank(g<sub>j</sub>) </it>&#8660; <it>Order-of-Elimination(g<sub>i</sub>) </it>><it>Order-of-Elimination(g<sub>j</sub>)</it></p>
            <p>That is, the later a gene is eliminated, the higher (smaller) rank it has. So, the first-rank gene is last removed.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>L. Fu developed the method and conducted the experiments. C. Fu-Liu interpreted the data. Both authors drafted, read, and approved the manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work is supported by National Institutes of Health and National Science Foundation under grants HL-080311 and IIS-0221954. E. S. Youn assisted in coding the algorithm.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Molecular classification of cancer: class discovery and class prediction by gene expression monitoring</p>
            </title>
            <aug>
               <au>
                  <snm>Golub</snm>
                  <fnm>TR</fnm>
               </au>
               <au>
                  <snm>Slonim</snm>
                  <fnm>DK</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Huard</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Gaasenbeek</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mesirov</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Coller</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Loh</snm>
                  <fnm>ML</fnm>
               </au>
               <au>
                  <snm>Downing</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Caligiuri</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Bloomfield</snm>
                  <fnm>CD</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1999</pubdate>
            <volume>286</volume>
            <fpage>531</fpage>
            <lpage>537</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.286.5439.531</pubid>
                  <pubid idtype="pmpid" link="fulltext">10521349</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Feature (gene) selection in gene expression-based tumor classification</p>
            </title>
            <aug>
               <au>
                  <snm>Xiong</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Zhao</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Jin</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Boerwinkle</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Mol Genet Metab</source>
            <pubdate>2001</pubdate>
            <volume>73</volume>
            <fpage>239</fpage>
            <lpage>247</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/mgme.2001.3193</pubid>
                  <pubid idtype="pmpid" link="fulltext">11461191</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Gene selection for cancer classification using support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Guyon</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Weston</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Barnhill</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>machine learning</source>
            <pubdate>2002</pubdate>
            <volume>46</volume>
            <fpage>389</fpage>
            <lpage>422</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1012487302797</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Improving reliability of gene selection from microarray functional-genomics data</p>
            </title>
            <aug>
               <au>
                  <snm>Fu</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Youn</snm>
                  <fnm>ES</fnm>
               </au>
            </aug>
            <source>IEEE Transactions on Information Technology in Biomedicine</source>
            <pubdate>2003</pubdate>
            <volume>7</volume>
            <fpage>191</fpage>
            <lpage>196</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1109/TITB.2003.816558</pubid>
                  <pubid idtype="pmpid">14518732</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Gene selection: a Bayesian variable selection approach</p>
            </title>
            <aug>
               <au>
                  <snm>Lee</snm>
                  <fnm>KE</fnm>
               </au>
               <au>
                  <snm>Sha</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Dougherty</snm>
                  <fnm>ER</fnm>
               </au>
               <au>
                  <snm>Vannucci</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mallick</snm>
                  <fnm>BK</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>90</fpage>
            <lpage>97</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/19.1.90</pubid>
                  <pubid idtype="pmpid" link="fulltext">12499298</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Selection bias in gene extraction on the basis of microarray gene-expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Ambroise</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>McLachlan</snm>
                  <fnm>GJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>6562</fpage>
            <lpage>6566</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">124442</pubid>
                  <pubid idtype="pmpid" link="fulltext">11983868</pubid>
                  <pubid idtype="doi">10.1073/pnas.102102699</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Significance analysis of microarrays applied to the ionizing radiation response</p>
            </title>
            <aug>
               <au>
                  <snm>Tusher</snm>
                  <fnm>VG</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Chu</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>5116</fpage>
            <lpage>5121</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">33173</pubid>
                  <pubid idtype="pmpid" link="fulltext">11309499</pubid>
                  <pubid idtype="doi">10.1073/pnas.091062498</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Multi-class cancer subtype classification based on gene expression signatures with reliability analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Fu</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Fu-Liu</snm>
                  <fnm>CS</fnm>
               </au>
            </aug>
            <source>FEBS Lett</source>
            <pubdate>2004</pubdate>
            <volume>561</volume>
            <fpage>186</fpage>
            <lpage>190</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0014-5793(04)00175-9</pubid>
                  <pubid idtype="pmpid" link="fulltext">15013775</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Cancer Subtype Classification Based on Gene Expression Signatures</p>
            </title>
            <aug>
               <au>
                  <snm>Fu</snm>
                  <fnm>LM</fnm>
               </au>
            </aug>
            <url>http://www.cise.ufl.edu/~fu/NSF/cancer_classify_GES.html</url>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks</p>
            </title>
            <aug>
               <au>
                  <snm>Khan</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wei</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Ringner</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Saal</snm>
                  <fnm>LH</fnm>
               </au>
               <au>
                  <snm>Ladanyi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Westermann</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Berthold</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Schwab</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Antonescu</snm>
                  <fnm>CR</fnm>
               </au>
               <au>
                  <snm>Peterson</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Meltzer</snm>
                  <fnm>PS</fnm>
               </au>
            </aug>
            <source>Nat Med</source>
            <pubdate>2001</pubdate>
            <volume>7</volume>
            <fpage>673</fpage>
            <lpage>679</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/89044</pubid>
                  <pubid idtype="pmpid" link="fulltext">11385503</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Alon</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Barkai</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Notterman</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ybarra</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mack</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Levine</snm>
                  <fnm>AJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>1999</pubdate>
            <volume>96</volume>
            <fpage>6745</fpage>
            <lpage>6750</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">21986</pubid>
                  <pubid idtype="pmpid" link="fulltext">10359783</pubid>
                  <pubid idtype="doi">10.1073/pnas.96.12.6745</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Multiclass cancer diagnosis using tumor gene expression signatures</p>
            </title>
            <aug>
               <au>
                  <snm>Ramaswamy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Rifkin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Mukherjee</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Yeang</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Angelo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ladd</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Reich</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Latulippe</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Mesirov</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Poggio</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Gerald</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Loda</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
               <au>
                  <snm>Golub</snm>
                  <fnm>TR</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>15149</fpage>
            <lpage>15154</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">64998</pubid>
                  <pubid idtype="pmpid" link="fulltext">11742071</pubid>
                  <pubid idtype="doi">10.1073/pnas.211566398</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Diagnosis of multiple cancer types by shrunken centroids of gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hastie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Narasimhan</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Chu</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>6567</fpage>
            <lpage>6572</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">124443</pubid>
                  <pubid idtype="pmpid" link="fulltext">12011421</pubid>
                  <pubid idtype="doi">10.1073/pnas.082099299</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Cluster analysis and display of genome-wide expression patterns</p>
            </title>
            <aug>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Spellman</snm>
                  <fnm>PT</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>14863</fpage>
            <lpage>14868</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">24541</pubid>
                  <pubid idtype="pmpid" link="fulltext">9843981</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.25.14863</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Bioinformatics</p>
            </title>
            <aug>
               <au>
                  <snm>Baxevanis</snm>
                  <fnm>AD</fnm>
               </au>
               <au>
                  <snm>Ouellette</snm>
                  <fnm>BFF</fnm>
               </au>
            </aug>
            <publisher>New York, NY, John Wiley &amp; Sons</publisher>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments</p>
            </title>
            <aug>
               <au>
                  <snm>Kerr</snm>
                  <fnm>MK</fnm>
               </au>
               <au>
                  <snm>Churchill</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>8961</fpage>
            <lpage>8965</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">55356</pubid>
                  <pubid idtype="pmpid" link="fulltext">11470909</pubid>
                  <pubid idtype="doi">10.1073/pnas.161273698</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis</p>
            </title>
            <aug>
               <au>
                  <snm>Statnikov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Aliferis</snm>
                  <fnm>CF</fnm>
               </au>
               <au>
                  <snm>Tsamardinos</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Hardin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Levy</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Overexpression of the pseudoautosomal gene MIC2 in Ewing's sarcoma and peripheral primitive neuroectodermal tumor</p>
            </title>
            <aug>
               <au>
                  <snm>Kovar</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Dworzak</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Strehl</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Schnell</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Ambros</snm>
                  <fnm>IM</fnm>
               </au>
               <au>
                  <snm>Ambros</snm>
                  <fnm>PF</fnm>
               </au>
               <au>
                  <snm>Gadner</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Oncogene</source>
            <pubdate>1990</pubdate>
            <volume>5</volume>
            <fpage>1067</fpage>
            <lpage>1070</lpage>
            <xrefbib>
               <pubid idtype="pmpid">1695726</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Gene expression profiling in two morphologically different uterine cervical carcinoma cell lines derived from a single donor using a human cancer cDNA array</p>
            </title>
            <aug>
               <au>
                  <snm>Fujimoto</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Nishikawa</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Iwasaki</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Akutagawa</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Teramoto</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kudo</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Gynecol Oncol</source>
            <pubdate>2004</pubdate>
            <volume>93</volume>
            <fpage>446</fpage>
            <lpage>453</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ygyno.2004.02.012</pubid>
                  <pubid idtype="pmpid" link="fulltext">15099960</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>S-100 protein stimulates cellular proliferation</p>
            </title>
            <aug>
               <au>
                  <snm>Klein</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Hoon</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Nangauyan</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Okun</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Cochran</snm>
                  <fnm>AJ</fnm>
               </au>
            </aug>
            <source>Cancer Immunol Immunother</source>
            <pubdate>1989</pubdate>
            <volume>29</volume>
            <fpage>133</fpage>
            <lpage>138</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF00199288</pubid>
                  <pubid idtype="pmpid">2720706</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Suppression of tumorigenicity in breast cancer cells by the microfilament protein profilin 1</p>
            </title>
            <aug>
               <au>
                  <snm>Janke</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Schluter</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Jandrig</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Theile</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kolble</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Arnold</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Grinstein</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Schwartz</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Estevez-Schwarz</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Schlag</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Jockusch</snm>
                  <fnm>BM</fnm>
               </au>
               <au>
                  <snm>Scherneck</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>J Exp Med</source>
            <pubdate>2000</pubdate>
            <volume>191</volume>
            <fpage>1675</fpage>
            <lpage>1686</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1084/jem.191.10.1675</pubid>
                  <pubid idtype="pmpid" link="fulltext">10811861</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>[Expression of P27 protein and cyclin E in colon cancer]</p>
            </title>
            <aug>
               <au>
                  <snm>Dai</snm>
                  <fnm>JY</fnm>
               </au>
               <au>
                  <snm>Liang</snm>
                  <fnm>XP</fnm>
               </au>
               <au>
                  <snm>Wen</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>CY</fnm>
               </au>
               <au>
                  <snm>Deng</snm>
                  <fnm>CZ</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>ZH</fnm>
               </au>
            </aug>
            <source>Ai Zheng</source>
            <pubdate>2003</pubdate>
            <volume>22</volume>
            <fpage>1093</fpage>
            <lpage>1095</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">14558959</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A20 and A20-binding proteins as cellular inhibitors of nuclear factor-kappa B-dependent gene expression and apoptosis</p>
            </title>
            <aug>
               <au>
                  <snm>Beyaert</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Heyninck</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Van Huffel</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Biochem Pharmacol</source>
            <pubdate>2000</pubdate>
            <volume>60</volume>
            <fpage>1143</fpage>
            <lpage>1151</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0006-2952(00)00404-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">11007952</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>G proteins in cancer: the prostate cancer paradigm</p>
            </title>
            <aug>
               <au>
                  <snm>Daaka</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Sci STKE</source>
            <pubdate>2004</pubdate>
            <volume>2004</volume>
            <fpage>re2</fpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">14734786</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Machine Learning Models For Classification Of Lung Cancer and Selection of Genomic Markers Using Array Gene Expression Data</p>
            </title>
            <aug>
               <au>
                  <snm>Aliferis</snm>
                  <fnm>CF</fnm>
               </au>
               <au>
                  <snm>Tsamardinos</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Massion</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Statnikov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Fananapazir</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Hardin</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B26">
            <title>
               <p>A theoretical characterization of linear SVM-based feature selection: ; Banff, Alberta, Canada.</p>
            </title>
            <aug>
               <au>
                  <snm>Hardin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Tsamardinos</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Aliferis</snm>
                  <fnm>CF</fnm>
               </au>
            </aug>
            <publisher>ACM Press,  New York, NY</publisher>
            <pubdate>2004</pubdate>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Time and sample efficient discovery of Markov blankets and direct causal relations: ; Washington, D.C..</p>
            </title>
            <aug>
               <au>
                  <snm>Tsamardinos</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Constantin F. Aliferis</snm>
                  <fnm>CF</fnm>
               </au>
               <au>
                  <snm>Alexander Statnikov</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <publisher/>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B28">
            <title>
               <p>New gene selection method for classification of cancer subtypes considering within-class variation</p>
            </title>
            <aug>
               <au>
                  <snm>Cho</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>IB</fnm>
               </au>
            </aug>
            <source>FEBS Lett</source>
            <pubdate>2003</pubdate>
            <volume>551</volume>
            <fpage>3</fpage>
            <lpage>7</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0014-5793(03)00819-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">12965195</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Neural Networks: A Comprehensive Foundation</p>
            </title>
            <aug>
               <au>
                  <snm>Haykin</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <publisher>Upper Saddle River, NJ, Prentice Hall</publisher>
            <edition>Second</edition>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Support Vector Machines</p>
            </title>
            <aug>
               <au>
                  <snm>Cristianini</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Shawe-Taylor</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <publisher>Cambridge, UK, University Press</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Knowledge-based analysis of microarray gene expression data by using support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Brown</snm>
                  <fnm>MP</fnm>
               </au>
               <au>
                  <snm>Grundy</snm>
                  <fnm>WN</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Cristianini</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Sugnet</snm>
                  <fnm>CW</fnm>
               </au>
               <au>
                  <snm>Furey</snm>
                  <fnm>TS</fnm>
               </au>
               <au>
                  <snm>Ares</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2000</pubdate>
            <volume>97</volume>
            <fpage>262</fpage>
            <lpage>267</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">26651</pubid>
                  <pubid idtype="pmpid" link="fulltext">10618406</pubid>
                  <pubid idtype="doi">10.1073/pnas.97.1.262</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Support vector networks</p>
            </title>
            <aug>
               <au>
                  <snm>Cortes</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>1995</pubdate>
            <volume>20</volume>
            <fpage>273</fpage>
            <lpage>297</lpage>
         </bibl>
      </refgrp>
   </bm>
</art>
