<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2003-4-12-r83</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>Multiclass classification of microarray data with repeated measurements: application to cancer</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Yeung</snm>
               <mnm>Yee</mnm>
               <fnm>Ka</fnm>
               <insr iid="I1"/>
               <email>kayee@u.washington.edu</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Bumgarner</snm>
               <mi>E</mi>
               <fnm>Roger</fnm>
               <insr iid="I1"/>
               <email>rogerb@uwashington.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Microbiology, Box 358070, University of Washington, Seattle, WA 98195, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2003</pubdate>
         <volume>4</volume>
         <issue>12</issue>
         <fpage>R83</fpage>
         <url>http://genomebiology.com/2003/4/12/R83</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2003-4-12-r83</pubid>
               <pubid idtype="pmpid">14659020</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>4</day>
               <month>6</month>
               <year>2003</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>14</day>
               <month>8</month>
               <year>2003</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>17</day>
               <month>10</month>
               <year>2003</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>24</day>
               <month>11</month>
               <year>2003</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2003</year>
         <collab>Yeung and Bumgarner; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <shorttitle>
         <p>Multiclass classification of microarray data with repeated measurements: application to cancer</p>
      </shorttitle>
      <shortabs>
         <p>Prediction of the diagnostic category of a tissue sample from its gene-expression profile and selection of relevant genes for class prediction have important applications in cancer research. Uncorrelated shrunken centroid and error-weighted, uncorrelated shrunken centroid algorithms have been developed that are applicable to microarray data with any number of classes.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>Prediction of the diagnostic category of a tissue sample from its gene-expression profile and selection of relevant genes for class prediction have important applications in cancer research. We have developed the uncorrelated shrunken centroid (USC) and error-weighted, uncorrelated shrunken centroid (EWUSC) algorithms that are applicable to microarray data with any number of classes. We show that removing highly correlated genes typically improves classification results using a small set of genes.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Rationale</p>
         </st>
         <p>The problem of predicting the diagnostic category of a given tissue sample is of fundamental clinical importance. Conventional diagnostic methods are based on subjective evaluation of the morphological appearance of the tissue sample, which requires a visible phenotype and a trained pathologist to interpret the view. In some cases the class is easily identified by cell morphology or cell-type distribution, but in many cases apparently similar pathologies can lead to very different clinical outcomes. Since the advent of DNA array technology <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, researchers have begun to use expression array analysis as a quantitative phenotyping tool. The potential advantage to using arrays for phenotyping is that they provide a simultaneous quantitative measure of thousands of parameters (for example, gene-expression levels) some of which are likely to have disease relevance. When array analysis is used predominately for phenotyping, we refer to the expression pattern as an 'expression array phenotype'. Owing to the ability to quantify a large number of parameters, the use of expression array in phenotyping promises both more accurate class prediction and the identification of subclasses that could not be defined by traditional methods.</p>
         <p>There has been a recent explosion in the use of expression array phenotyping for identification and/or classification in a variety of diagnostic areas. Examples of diagnostic categories (or classes) include cancer versus non-cancer <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>, different subtypes of tumor <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>, and prediction of responses to various drugs or cancer prognosis <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>. The prediction of the diagnostic category of a tissue sample from its expression array phenotype given the availability of similar data from tissues in identified categories is known as classification (or supervised learning). A challenge in predicting diagnostic categories using microarray data is that the number of genes is usually significantly greater than the number of tissue samples available, and only a subset of the genes is relevant in distinguishing different classes. Selection of relevant genes for classification is known as feature selection. This has three main applications: first, the classification accuracy is often improved using a subset instead of the entire set of genes; second, a small set of relevant genes is convenient for developing diagnostic tests; and third, these genes may lead to biologically interesting insights that are characteristic of the classes of interest.</p>
         <p>There have been many reports that address the classification and feature-selection problems, for example <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B14">14</abbr><abbr bid="B17">17</abbr></abbrgrp>. However, many of these methods are tailored towards binary classification in which there are only two classes <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B14">14</abbr></abbrgrp>. Moreover, there has been very limited effort to develop classification and feature-selection algorithms for microarray data with repeated measurements or error estimates. Array data is well known to be noisy; for example, Lee <it>et al</it>. <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> showed that any single microarray output is subject to substantial variability. This is particularly true for genes with low expression levels, which are more difficult to measure than genes with high expression levels. As the cost of microarray experiments is declining, more research laboratories are generating microarray data with repeated measurements <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B14">14</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. Repeated measurements not only provide improved estimates of gene-expression levels but can also be used to estimate the uncertainty or variability in the measurement. In some cases the repeated measurements are biological replicates (for example, independent samples), whereas in other cases only technical replicates are available. Regardless of the source, however, variability estimates should be taken into account in both clustering and classification algorithms, as variability estimates can potentially be exploited to improve the results.</p>
         <p>We have developed two algorithms called the uncorrelated shrunken centroid (USC) algorithm, and the error-weighted, uncorrelated shrunken centroid (EWUSC) algorithm. Both USC and EWUSC are integrated feature-selection and classification algorithms that are applicable to data with any number of classes. Our primary contribution is that both USC and EWUSC exploit interdependence between genes to reduce the number of selected features. In addition, EWUSC takes advantage of variability estimates over repeated measurements to down-weight noisy genes and noisy experiments so that no <it>ad hoc </it>filtering step is necessary. On the other hand, USC is applicable to microarray datasets with or without repeated measurements.</p>
         <sec>
            <st>
               <p>Introduction to classification and feature selection</p>
            </st>
            <p>Classification is a supervised learning approach, in which the classes (or labels) of a subset of samples are inputs to the algorithm. This is in contrast to clustering, which is an unsupervised approach, in which no knowledge of the samples is assumed. A training set is a set of samples for which the classes are known. A test set is a set of samples for which the classes are assumed to be unknown to the algorithm, and the goal is to predict which classes these samples belong to. The first step in classification is to build a 'classifier' using the given training set, and the second step is to use the classifier to predict the classes of the test set.</p>
            <p>In the context of gene-expression data, the samples are usually the experiments, and the classes (or labels) are usually different types of tissue samples (for example, cancer versus non-cancer, different tumor types, rate of disease progression, and response to therapy). A typical microarray dataset consists of thousands to tens of thousands of genes, and dozens to hundreds of experiments. One challenge of classification using microarray data is that the number of genes is significantly greater than the number of samples. In this situation, it is possible to find both random and biologically relevant correlations of gene behavior with sample type. To protect against spurious results, the goal is to identify the smallest possible subset of genes that correlate most strongly with the known class labels. In addition, a small subset of genes is desirable for the development of expression-based diagnostics. The problem of selecting relevant genes (or features) for classification is known as feature selection.</p>
            <p>Cross validation is a well-established technique used to optimize the parameters or features chosen in a classifier. In m-fold cross-validation, the training set is randomly divided into m disjoint subsets with roughly equal size. Each of these m subsets is left out in turn for evaluation, and the other (m - 1) subsets are used as inputs to the classification algorithm. In this work, we randomly divide each class into m disjoint subsets (where m is less than the size of the smallest class in the training set), so that each class is represented in the subset fed to the classification algorithm. The left-out subset of the training set is used to evaluate classification accuracy because the classes of this subset are known. The most popular form of cross-validation is leave-one-out cross-validation (LOOCV), in which m is equal to the number of samples in the training set, and each sample in the training set is left out in turn to evaluate the prediction results.</p>
         </sec>
         <sec>
            <st>
               <p>Related work</p>
            </st>
            <p>van't Veer <it>et al</it>. <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> recently applied a binary classification algorithm to cDNA array data with repeated measurements, and classified breast cancer patients into good and poor prognosis groups. Their classification algorithm consists of the following steps. The first step is filtering, in which only genes with both small error estimates and significant regulation relative to a reference pool of samples from all patients are chosen. The second step consists of identifying a set of genes whose behaviour is highly correlated with the two sample types (for example, upregulated in one sample type but downregulated in the other). These genes are rank-ordered so that genes with the highest magnitudes of correlation with the sample types have top ranks. In the third step, the set of relevant genes is optimized by sequentially adding genes with top-ranked correlation from the second step. Leave-one-out cross-validation is used to evaluate and choose an optimal set of features. van't Veer <it>et al</it>.'s approach takes variability estimates of repeated measurements into consideration by using error-weighted correlation in their method. However, this method involves an <it>ad hoc </it>filtering step and does not generalize to more than two classes.</p>
            <p>Ramaswamy <it>et al</it>. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> combined support vector machines (SVMs), which are binary classifiers, to solve the multiclass classification problem. They showed that the one-versus-all approach of combining SVM yields the minimum number of classification errors on their Affymetrix data with 14 tumor types. The one-versus-all combination approach builds k (the number of classes) binary classifiers, each of which distinguishes one class from all the other classes. Suppose binary classifier i predicts a discriminant value f<sub>i</sub>(x) for a given sample x in the test set. The combined multiclass classifier assigns sample x to the class for which the corresponding binary classifier produces the highest discriminant value. In addition to not taking variability estimates of repeated measurements into account, this approach selects different relevant features (genes) for each binary classifier.</p>
            <p>Nguyen and Rocke <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp> used partial least squares (PLS) for feature selection, together with traditional classification algorithms such as logistic discrimination and quadratic discrimination to classify multiple tumor types from microarray data. These traditional classification algorithms require the number of samples (experiments) to be greater than the number of variables (genes), and it is therefore essential to reduce the dimensionality before applying these traditional classification techniques. PLS is a dimension-reduction technique that maximizes the covariance between the classes and a linear combination of the genes. This approach can be generalized to multiple classes, but it does not make use of variability estimates of the data. In addition, it is a multistep process that involves a filtering step (to select genes with significant mean differences) and then application of PLS to further reduce the dimensionality so that the number of samples is greater than the number of dimensions.</p>
            <p>Dudoit <it>et al</it>. <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> compared the performance of different discrimination methods (including nearest neighbor classifiers, linear discriminant analysis and classification trees) for classifying multiple tumor types using gene-expression data. None of the discrimination methods they evaluated takes measurement variability into consideration, and their emphasis is on discrimination methods and not feature selection.</p>
            <p>Yeung <it>et al</it>. <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> showed that clustering algorithms that take advantage of repeated measurements (including the error-weighted approach that down-weights noisy measurements) yield more accurate and more stable clusters. Here, we will focus on the supervised learning approach, instead of the unsupervised clustering technique.</p>
            <p>Tibshirani <it>et al</it>. <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> developed a 'shrunken centroid' (SC) algorithm for classifying multiple cancer types. It is an integrated approach for feature selection and classification. Features are selected by considering one gene at a time: the difference between the class centroid (average expression level or ratio within a class) of a gene and the overall centroid (average expression level or ratio over all classes) of a gene is compared to the within-class standard deviation plus a 'shrinkage threshold' which is fixed for all genes. The intuition is that genes with at least one class centroid that is significantly different from the overall centroid are selected as relevant genes. The size of the shrinkage threshold is determined by cross-validation on the training set to minimize classification errors.</p>
         </sec>
         <sec>
            <st>
               <p>Our contributions</p>
            </st>
            <p>Our algorithms have the following desirable characteristics. Both EWUSC and USC exploit the interdependence of genes to reduce the number of selected features. EWUSC takes advantage of the variability of gene-expression data over repeated measurements, so no <it>ad hoc </it>filtering step is necessary. Both EWUSC and USC can be applied to data with any number of classes. Both EWUSC and USC adopt an integrated approach for both feature selection and classification. Both algorithms make no assumption on data distributions.</p>
            <p>We illustrate the advantage of removing correlated genes (for example, the USC algorithm) on the NCI 60 data <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> for which there is no variability information. This dataset has been extensively used in other publications for classification algorithm development <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B25">25</abbr></abbrgrp>. We illustrated and compared our USC and EWUSC algorithms with two real datasets: a multiple tumor dataset from Ramaswamy <it>et al</it>. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> and a breast cancer dataset from van 't Veer <it>et al</it>. <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. These two datasets were chosen as they are publicly available in a form from which we can calculate or obtain error estimates for each gene-expression level or ratio. We used a subset of the multiple tumor data <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> that consists of 7,129 genes and 11 tumor types on Affymetrix chips. There are 96 samples in the training set, and 27 samples in the test set. For the Affymetrix dataset we estimated the variability in the gene-expression levels using the robust multi-array analysis (RMA) tool <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp> from the BioConductor project <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. A subset of the published data was used as we could only obtain raw data (.cel files) for a subset. The breast cancer dataset <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> consists of 25,000 genes with four repeated measurements on cDNA arrays. There are 78 samples in the training set, 19 samples in the test set, and two classes of patients: one class with good prognosis (with more than 5 years of survival time), and another class with poor prognosis (with less than 5 years of survival time). For the breast cancer cDNA array data, published p-values as calculated by Rosetta's Resolver software were used to calculate the error estimates. In addition, we created synthetic datasets with repeated measurements and compared the performance of EWUSC, USC and SC at different noise levels.</p>
            <p>We adopted three criteria for assessing feature selection and classification algorithms: prediction accuracy, number of relevant genes and feature stability. Prediction accuracy is defined as the percentage of correct classifications on the test set. The number of relevant genes is the total number of genes used to achieve optimal prediction accuracy. Feature stability is the level of agreement of selected genes chosen over different cross-validation runs of the algorithm.</p>
            <p>Using these algorithms we obtained the following general results. Exploiting gene interdependence by removal of correlated genes typically results in comparable or higher prediction accuracy using fewer relevant genes. This is highly desirable if one wishes to develop diagnostic tools from the selected set of genes. Using error or variability estimates as weighting factors generally yields higher feature stability and reduces the number of relevant genes on real datasets. On the multiple tumor data, our EWUSC algorithm achieves 16% increase in prediction accuracy, using only 10% of the genes as features (compared with using all the available genes in the published result). On the breast cancer data, our EWUSC algorithm produces the same number of classification errors as the published result using a larger feature set. Unlike the published algorithm for this dataset, however, the EWUSC algorithm is applicable to datasets with more than two classes.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Our integrated classification and feature-selection algorithm</p>
         </st>
         <p>As our USC and EWUSC algorithms are motivated by the shrunken centroid (SC) algorithm <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, we will briefly review the SC algorithm, and then discuss our USC and EWUSC algorithms. Details of these algorithms can be found later in the paper.</p>
         <sec>
            <st>
               <p>The SC approach</p>
            </st>
            <p>The SC approach <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> is essentially a robust version of the 'nearest centroid' approach, in which a sample is assigned to the class with the nearest average pattern. Features are selected by considering each gene individually. The overall centroid of a gene i is defined as the average expression level/ratio of gene i over all the experiments. The class centroid of a gene i in class k is defined to be the average expression level/ratio of gene i over all the samples in class k. A gene is predictive of the class if at least one of its class centroids significantly differs from its overall centroid. One obvious definition of significantly in the previous sentence is 'differs by more than the variation (or standard deviation) within the class', which is essentially a modified form of a <it>t</it>-test. The shrunken centroid method adds an additional term (s<sub>0 </sub>described in <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and in the section Details of algorithms below) to the within-class standard deviation - for example, the difference between the in-class average and the overall average must exceed the in-class variation by s<sub>0</sub>. A <it>t</it>-test like statistic, relative difference (d<sub>ik</sub>), is defined to represent the difference between the class centroid and the overall centroid divided by the variance (in-class variation + s<sub>0</sub>) and the absolute value of d<sub>ik </sub>is reduced by the 'shrinkage threshold' &#916;. &#916; is determined by cross-validation such that the number of classification errors is minimized on the training set.</p>
         </sec>
         <sec>
            <st>
               <p>The USC approach</p>
            </st>
            <p>Our USC algorithm adds a step to the SC algorithm to remove redundant, correlated genes. The benefit of removing highly correlated genes is twofold. First, it reduces the number of relevant features (genes) needed for classification. A small feature set is highly desirable if one wishes to use the results of feature selection and classification to develop diagnostic tools such as reverse transcription PCR (RT-PCR)-based tests on a small number of the most relevant genes. Second, the removal of redundant genes reduces the impact of over-fitting, and hence, potentially improves classification accuracy.</p>
            <p>The SC algorithm produces a set of relevant genes, S<sub>&#916;</sub>, for any given shrinkage threshold &#916;. As &#916; increases, the number of relevant genes in S<sub>&#916; </sub>decreases; for example, the gene list is reduced to selected genes for which the within-class centroids are farther away from the overall centroid and for which the within-class variation is small. Each gene is considered independently in the SC algorithm. Our modification exploits the correlation between genes by removing genes that are highly correlated within the set of relevant genes S<sub>&#916;</sub>. Specifically, we compute the pairwise correlation for each pair of genes (g<sub>i</sub>, g<sub>j</sub>) in S<sub>&#916; </sub>for each &#916;. If the pairwise correlation is greater than a correlation threshold &#961;<sub>0</sub>, the gene g<sub>j </sub>with the smaller relative difference is removed from the set of relevant genes. This results in a set of relevant genes S(&#916;, &#961;<sub>0</sub>) for each shrinkage threshold &#916; and each correlation threshold &#961;<sub>0</sub>. These relevant genes are used to classify new samples. The USC algorithm is equivalent to the SC algorithm when no correlated genes are removed (that is, &#961;<sub>0 </sub>= 1). We apply this USC algorithm to the training set using cross-validation to determine the number of classification errors for each &#916; and each &#961;<sub>0</sub>. The optimal parameters for &#916; and &#961;<sub>0 </sub>are chosen such that the number of cross-validation classification errors is minimized on the training set. These optimal parameters are then used to classify samples from unknown classes on the test set. Our results show that the removal of correlated genes provides a significant improvement over the SC algorithm in classification results, and hence our USC algorithm is useful for datasets in which error estimates are not available.</p>
         </sec>
         <sec>
            <st>
               <p>The EWUSC approach</p>
            </st>
            <p>Our EWUSC algorithm is based on the USC algorithm with a key modification: we take advantage of error estimates or variability over repeated measurements. We define an error-weighted overall centroid, error-weighted class centroid, error-weighted relative difference, error-weighted shrunken class centroid, and error-weighted discriminant score in order to down-weight both noisy genes and noisy experiments. In addition, we adopt the error-weighted correlation in the removal of highly correlated genes to select relevant genes. Thus the EWUSC algorithm is identical to the USC algorithm except for error-weighted definitions to down-weight noisy genes and noisy experiments in our calculations. When all genes and all experiments have the same variability estimates, the EWUSC algorithm is equivalent to the USC algorithm. As our results show, this error-weighted approach typically reduces the number of relevant genes and improves feature stability, and thus the EWUSC is usually the method of choice when error or variability estimates are available. A detailed description of the EWUSC algorithm is given later in the paper.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Datasets used</p>
         </st>
         <sec>
            <st>
               <p>National Cancer Institute NCI 60 data</p>
            </st>
            <p>In the NCI 60 data <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, cDNA microarrays were used to study the expression of approximately 60 cell lines derived from tumors with different sites of origin (see Table <tblr tid="T1">1</tblr>). We used the same pre-processed dataset as in Dudoit <it>et al</it>. <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, which consists of log expression ratios of 5,244 genes over 61 experiments. Two prostate and one unknown cell lines from the original data <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> were excluded in their analysis because of their small class sizes. Only one leukemia and one breast cancer cell line were repeated three times, and hence there are no repeated measurements or variability estimates available for all 61 samples. These repeated experiments of the leukemia and breast cancer cell lines are treated as individual samples. In addition, no additional test set is available for this data. To compare our results with those of Dudoit <it>et al</it>. <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, we adopted their 2:1 scheme in which one third of the samples are reserved as a test set.</p>
            <tbl id="T1" hint_layout="single">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Tumor types and class sizes of the NCI 60 dataset</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Origin of cell lines</p>
                     </c>
                     <c ca="center">
                        <p>Class size (total 61 samples)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Central nervous system</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colon</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukaemia</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Melanoma</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Non-small-cell-lung-carcinoma</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Ovarian</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Renal</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Tumor types and class sizes of the original full data with a total of 61 experiments.</p>
               </tblfn>
            </tbl>
            <p>Specifically, we randomly divided each class in the original data (61 experiments) into roughly three parts such that the training set consists of a total of 43 experiments and the test set consists of a total of 18 experiments. Table <tblr tid="T2">2</tblr> gives the class sizes of the training and test sets. The optimal parameters are determined using cross-validation on the training set with 43 samples, and these optimal parameters are used to classify the 18 samples in the test set. We repeated this random partition of the original data into three parts multiple times.</p>
            <tbl id="T2" hint_layout="single">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Tumor types and class sizes of the randomly partitioned training and test sets of the NCI 60 dataset</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Origin of cell lines</p>
                     </c>
                     <c ca="center">
                        <p>Training set (total 43)</p>
                     </c>
                     <c ca="center">
                        <p>Test set (total 18)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Central nervous system</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colon</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukaemia</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Melanoma</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Non-small-cell-lung-carcinoma</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Ovarian</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Renal</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>As no additional test set is available for the NCI 60 data, we randomly divided each class of these 61 samples into roughly three parts and reserved one third of the samples as a test set.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Multiple tumor data</p>
            </st>
            <p>The multiple tumor dataset <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> consists of a large number of tumor samples spanning 14 different tumor types hybridized to Affymetrix chips. On the Affymetrix platform, each target gene is represented by 11-20 short oligo probes of approximately 25 base-pairs (bp). Our goal is to take advantage of the variability over different oligos for the same genes using our EWUSC algorithm. We pre-processed the raw multiple tumor data with the log scale robust multi-array analysis (RMA) measure <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> implemented in the BioConductor project. The RMA measure is a summary statistic for the expression levels over all the different oligos for the same gene. The standard error of the RMA measure is a variability estimate of the expression level over the different oligos representing the same target gene. In order to obtain the RMA measures and their associated standard errors on the multiple tumor data, the raw data (.cel files) are necessary. Because we have access to only a subset of the raw multiple tumor data, we used a subset of the original data in our study. The subset of multiple tumor data we used consists of 7,129 genes, 96 samples in the training set, and 27 samples in the test set. These samples span 11 different tumor types (Table <tblr tid="T3">3</tblr>). The smallest class size is four on the training set, and hence, four-fold cross-validation (m = 4) is used on this data.</p>
            <tbl id="T3" hint_layout="single">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Tumor types and class sizes for the training set and test set of the subset of multiple tumor data used in this study</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Tumor type</p>
                     </c>
                     <c ca="center">
                        <p>Training set (total 96)</p>
                     </c>
                     <c ca="center">
                        <p>Test set (total 27)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Breast</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lung</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Colorectal</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lymphoma</p>
                     </c>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Melanoma</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Uterus</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Leukemia</p>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Renal</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Pancreas</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Mesotheolima</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CNS</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Breast cancer data</p>
            </st>
            <p>The breast cancer data <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> consists of primary breast tumor samples hybridized to cDNA arrays containing approximately 25,000 genes. Two hybridizations were carried out for each sample using a dye-reversal technique. Hence, there are four repeated measurements for each gene and each sample. The p-values of log expression ratios are also available. These p-values are results of the four repeated measurements and an error model based on extensive control experiments <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. A p-value close to 1 represents low confidence that an expression ratio is significantly different from 1, while a p-value close to 0 represents high confidence that an expression ratio is significantly different from 1. We converted these p-values into error estimates of log ratios, which are used in our EWUSC algorithm.</p>
            <p>The breast cancer dataset consists of approximately 25,000 genes, 78 samples in the training set, and 19 samples in the test set. van't Veer <it>et al</it>. <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> divided these samples into the good and poor prognosis groups, which have greater than 5 and less than 5 years of survival time respectively. Hence, there are two classes in this dataset (see Table <tblr tid="T4">4</tblr>). We performed 10-fold cross-validation (m = 10) on the breast cancer data.</p>
            <tbl id="T4" hint_layout="single">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Prognosis groups and class sizes of the training set and test set of the breast cancer data</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Prognosis group</p>
                     </c>
                     <c ca="center">
                        <p>Training set (total 78)</p>
                     </c>
                     <c ca="center">
                        <p>Test set (total 19)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Good (> 5 years of survival time)</p>
                     </c>
                     <c ca="center">
                        <p>44</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Poor (&#8804;5 years of survival time)</p>
                     </c>
                     <c ca="center">
                        <p>34</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Synthetic data</p>
            </st>
            <p>We also created synthetic datasets to compare the performance of our algorithms. Our approach is to start with 'patterned genes' which have a different expression pattern in each class, and are therefore relevant in classifying unknown samples. The next step is to introduce noise (variation in both the class and non-class values) to these patterned genes in order to reflect 'real-life' data. Finally, 'non-patterned genes', which are irrelevant in classifying samples, are added to these synthetic datasets. Even with this simple synthetic data-generation approach, generating sensible synthetic data turned out to be a nontrivial task. There are two parameters that control the noise levels in the synthetic datasets, the biological noise level (&#945;) and the technical noise level (&#955;). The biological noise level (&#945;) controls the level of biological noise within each class (and hence, the signal-to-noise ratio) such that the classes are less separable with a higher &#945;. The technical noise level (&#955;) controls the noise level over repeated measurements such that a high &#955; indicates relatively noisy repeated measurements. The primary difficulty in generating synthetic data is setting the parameters of &#945; and &#955;, and the proportion of the patterned genes. As it is not obvious how to set these parameters to reflect 'real-life' data, we experimented with different parameter settings, such as different biological noise levels: low (&#945; = 0.1 with signal-to-noise ratio approximately 20), medium (&#945; = 1 with signal-to-noise ratio approximately 2), or high (&#945; = 2 with signal-to-noise ratio approximately 1); and low (&#955; = 1) or high (&#955; = 5 or 10) technical noise. We also experimented with different proportions of patterned genes, and concluded that this parameter does not have any significant impact on the results.</p>
            <p>Another issue in generating 'realistic' synthetic data involves the generation of non-patterned genes that are irrelevant in distinguishing the classes. We addressed this issue by random sampling with replacement from a real dataset (that is, the breast cancer dataset <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>). Specifically, for each non-patterned gene, we randomly sample a gene g from the breast cancer data, and then randomly sample from the experiments of gene g in the breast cancer data such that these non-patterned genes would not show any class-specific expression patterns but would show realistic variations in expression levels over all classes.</p>
            <p>In particular, our synthetic training sets consist of 1,000 genes, 80 samples, and 4 classes such that there are 20 samples in each class. Our synthetic test sets consist of 1,000 genes and 40 samples with 10 samples in each class. We generated 64 patterned genes which have a different expression pattern in each class, for example, genes that are upregulated (or downregulated) in only m of the four classes, where m = 1, 2, 3. In addition, there are five duplicates of each of these 64 patterned genes such that there are a total of 320 patterned genes and (1,000 - 320 = 680) non-patterned genes. Ideally, the perfect classification algorithm would select only one of these five copies of the patterned genes. We also investigated the effect of the number of repeated measurements by generating synthetic datasets with 1, 4 or 20 repeated measurements. These synthetic datasets are available from our supplementary website <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Assessment criteria</p>
         </st>
         <sec>
            <st>
               <p>Prediction accuracy</p>
            </st>
            <p>As the class information for the test sets is available, we define prediction accuracy as the percentage of correct classifications on the test set. The class information on a test set is used only to evaluate the performance of classification and feature-selection algorithms, and is unknown to the algorithms.</p>
         </sec>
         <sec>
            <st>
               <p>Number of relevant features</p>
            </st>
            <p>One of the goals of classification is to select a minimal set of relevant genes (or features) that can be used in future diagnosis or classification of tissue samples. We judge each method by the total number of relevant features required for optimal classification accuracy. A small set of relevant genes is desirable because it is more cost-effective in the development of diagnostic tools based on the results of expression analysis. For example, the cost of an RT-PCR test to classify patient samples is directly proportional to the number of genes which must be tested to make the diagnosis. As shown below, both the USC and EWUSC methods usually result in a significant reduction in the numbers of selected genes for classification. We feel this represents a major advance in classification algorithms.</p>
         </sec>
         <sec>
            <st>
               <p>Feature stability</p>
            </st>
            <p>Because relevant genes are derived from the training set and the choice of the training set is often arbitrary, a set of relevant genes that is insensitive to the training sets used would be desirable. Hence, we define feature stability as the level of agreement between the set of relevant genes chosen in each fold of the cross-validation data with the set of relevant genes chosen using the full training set. Specifically, for each fold of the cross-validation data and for each set of parameters (&#916; and &#961;<sub>0</sub>), we compute the Jaccard index <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> which measures the level of agreement between the set of relevant genes chosen in this fold and the set chosen using the full training set. The Jaccard index lies between 0 and 1. A high Jaccard index (close to 1) implies high level of agreement, and hence, high feature stability (a mathematical definition of the Jaccard index can be found in the section Details of algorithms, below). We define feature stability of one cross-validation run for a given set of parameters (&#916; and &#961;<sub>0</sub>) as the average Jaccard index over all m folds of cross-validation. In our experiments, we usually have five random runs of cross-validation; hence we adopt the average Jaccard index over these five random runs of cross-validation as our measure of overall feature stability for given parameters (&#916; and &#961;<sub>0</sub>).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results on the NCI 60 data</p>
         </st>
         <p>As variability estimates are not available on the NCI 60 data, we compared the prediction accuracy from USC and SC (Figure <figr fid="F1">1</figr>; and Figure S14 of <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>). We showed that USC generally produces higher prediction accuracy than SC using the same number of relevant genes (Figure <figr fid="F1">1</figr>). In particular, USC requires 44% of the available genes (2,315 out of 5,244 genes) to achieve a prediction accuracy of 72%, whereas SC requires 77% of genes (3,998 out of 5,244 genes) to achieve the same prediction accuracy. Our results show that the removal of highly correlated genes reduces the number of selected features while achieving comparable error rates.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Comparison of prediction accuracy of USC and SC on the NCI 60 data</p>
            </caption>
            <text>
               <p>Comparison of prediction accuracy of USC and SC on the NCI 60 data. The percentage of prediction accuracy is plotted against the number of relevant genes using the USC algorithm at &#961;<sub>0 </sub>= 0.6 and the SC algorithm (USC at &#961;<sub>0 </sub>= 1.0). The horizontal axis is shown on a log scale. Because no independent test set is available for this data, we randomly divided the samples in each class into roughly three parts multiple times, such that a third of the samples are reserved as a test set. Thus the training set consists of 43 samples and the test set of 18 samples. The graph represents typical results over these multiple random runs.</p>
            </text>
            <graphic file="gb-2003-4-12-r83-1"/>
         </fig>
         <p>Like Dudoit <it>et al</it>. <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> we observed high error rates on this dataset (around 40-60% using 10-200 relevant genes). USC produces comparable error rates to the results reported in Dudoit <it>et al</it>. <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> using roughly the same number of relevant genes. However, our USC algorithm allows the optimal parameters (which indirectly control the number of selected genes) to be determined. In this case, the optimal parameters produce an error rate of approximately 28% on the cross-validation data. We repeated the random partition of the full dataset with 61 samples into a training set with 43 samples and a test set with 18 samples multiple times, and obtained similar results on different random partitions of the original dataset.</p>
      </sec>
      <sec>
         <st>
            <p>Results on the multiple tumor data</p>
         </st>
         <p>Figure <figr fid="F2">2</figr> shows the results of applying EWUSC to the training set, four-fold cross-validation data, and test set of the multiple tumor data over a range of shrinkage thresholds (&#916;) and correlation thresholds (&#961;<sub>0</sub>). In Figure <figr fid="F2">2a,c</figr> the percentage of classification errors is plotted against &#916; on the training and test sets respectively. In Figure <figr fid="F2">2b</figr>, the average percentage of errors is plotted against &#916; over five random runs of cross-validation. The optimal parameters (&#916; and &#961;<sub>0</sub>) are determined from the cross-validation results. Figure <figr fid="F2">2a-c</figr> shows that prediction accuracy is increased (lower percentage of errors) when &#961;<sub>0 </sub>&lt; 1 over most values of &#916; (especially 2 &#8804; &#916; &#8804; 7) on the training set, cross-validation data and test set. This shows that removing highly correlated genes increases prediction accuracy. In addition, Figure <figr fid="F2">2d</figr> shows that the number of relevant genes is drastically reduced when genes with correlation threshold (&#961;<sub>0</sub>) above 0.9 are removed. From Figure <figr fid="F2">2b</figr>, the average cross-validation error rate gradually reduces when the correlation threshold &#961;<sub>0 </sub>is decreased from 1 to 0.9 to 0.8, but the average error rate increases when &#961;<sub>0 </sub>&lt; 0.8. (This observation also holds for &#961;<sub>0 </sub>&lt; 0.6, which are not shown in Figure <figr fid="F2">2</figr> for clarity.) Therefore, the optimal &#961;<sub>0 </sub>is estimated to be 0.8.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Prediction accuracy on the multiple tumor data using the EWUSC algorithm over the range of &#916; from 0 to 20</p>
            </caption>
            <text>
               <p>Prediction accuracy on the multiple tumor data using the EWUSC algorithm over the range of &#916; from 0 to 20. The percentage of classification errors is plotted against &#916; on <b>(a) </b>the full training set (96 samples) and <b>(c) </b>the test set (27 samples). In <b>(b) </b>the average percentage of errors is plotted against &#916; on the cross-validation data over five random runs of fourfold cross-validation. In <b>(d)</b>, the number of relevant genes is plotted against &#916;. Different colors are used to specify different correlation thresholds (&#961;<sub>0 </sub>= 0.6, 0.7, 0.8, 0.9 or 1). Results of &#961;<sub>0 </sub>&lt; 0.6 are shown in Figure S1 on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. Optimal parameters are inferred from the cross-validation data in (b).</p>
            </text>
            <graphic file="gb-2003-4-12-r83-2"/>
         </fig>
         <p>EWUSC produces the minimum average number of cross-validation errors at &#916; = 0 and &#961;<sub>0 </sub>= 0.9 using 1,626 relevant genes, which achieves 78% prediction accuracy. However, &#916; = 0 is an unsatisfactory shrinkage threshold because we would prefer relevant genes to have class centroids significantly different from their overall centroids. Moreover, the average error rate starts to increase almost linearly when &#916; is greater than 6 on the cross-validation data. This 'bend' is more obvious Figure S1(e) on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, which shows the error rate for each of the five random runs of fourfold cross-validation for &#916; = 0 to 14. The optimal &#916; is estimated to be 5.6. When &#916; = 5.6 and &#961;<sub>0 </sub>= 0.8, the prediction accuracy is 93% and the number of relevant genes is 680 (out of a total of 7,129 genes).</p>
         <p>We also applied the USC and SC algorithms to the multiple tumor data and obtained similar results, except that the error rates are generally higher. Similarly, USC produces the minimum average number of cross-validation errors at &#916; = 0 and &#961;<sub>0 </sub>= 0.9 using 1634 relevant genes, which achieves 74% prediction accuracy. SC produces the minimum average number of cross-validation errors at &#916; = 0.4 using all 7,129 genes. On the other hand, the optimal parameters (&#916;, &#961;<sub>0</sub>) can be estimated by visual observation of 'bends' in the cross-validation curves. In particular, when &#916; = 5.6 and &#961;<sub>0 </sub>= 0.8, the prediction accuracy is 85% and the number of relevant genes is 735 using the USC algorithm (see Figure S2 on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> for detailed results).</p>
         <p>We also compared feature stability of the EWUSC and USC algorithms at correlation threshold (&#961;<sub>0</sub>) = 0.8 with the SC algorithm <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> (which is equivalent to USC at &#961;<sub>0 </sub>= 1) over different numbers of relevant genes (Figure <figr fid="F3">3</figr>), and showed that EWUSC produces higher feature stability (higher average Jaccard index) than the USC and SC algorithms. The relatively high feature stability is due to relatively high numbers of common features selected in different runs of cross-validation (see Figure S5 on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>). We also showed that EWUSC almost always selects relatively more stable sets of relevant genes than USC (even over other correlation thresholds that are not shown). Hence, our results demonstrate that incorporating variability estimates over repeated measurements yields higher feature stability.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Comparison of feature stability of EWUSC, USC and SC on the multiple tumor data</p>
            </caption>
            <text>
               <p>Comparison of feature stability of EWUSC, USC and SC on the multiple tumor data. The average Jaccard index is plotted against the number of relevant genes over five random runs of fourfold cross-validation using EWUSC and USC at &#961;<sub>0 </sub>= 0.8 and SC. A high average Jaccard index indicates high feature stability. The EWUSC algorithm selects the most stable features. Note that the horizontal axis is shown on a log scale.</p>
            </text>
            <graphic file="gb-2003-4-12-r83-3"/>
         </fig>
         <sec>
            <st>
               <p>Comparison with published results</p>
            </st>
            <p>Ramaswamy <it>et al</it>. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> reported 78% classification accuracy on the multiple tumor data using SVMs combined using the one-versus-all approach. In contrast, our EWUSC algorithm achieves a classification accuracy of 93% on the test set of the multiple tumor data. As we used a subset of the original multiple tumor data and pre-processed the raw data using the RMA measures <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, we evaluated the performance of SVM combined with the one-versus-all method on the identical pre-processed subset of multiple tumor data used in our experiments with the EWUSC and the USC algorithms. In our comparison study, we used the signal to noise (S2N) measures <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> to select relevant features for each binary SVM classifier. To produce directly comparable results, we used the exact same five splits of the training set into cross-validation data.</p>
            <p>Figure <figr fid="F4">4</figr> compares the prediction accuracy on the test set of the multiple tumor data using the EWUSC and USC algorithms at the estimated optimal correlation threshold (&#961;<sub>0 </sub>= 0.8), the SC algorithm <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and SVM (with S2N for feature selection). There are a few observations from Figure <figr fid="F4">4</figr>. First, USC produces higher prediction accuracy than SC using the same number of relevant genes. As SC is equivalent to USC at &#961;<sub>0 </sub>= 1, our results show that removing highly correlated genes reduces the number of relevant genes and improves prediction accuracy. Second, EWUSC generally produces higher prediction accuracy than USC using the same number of relevant genes, except when both the number of relevant genes and prediction accuracy is low. This shows that we can potentially improve prediction accuracy by taking advantage of error estimates in the data.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Comparison of prediction accuracy of EWUSC, USC, SVM and SC algorithms on the multiple tumor data</p>
               </caption>
               <text>
                  <p>Comparison of prediction accuracy of EWUSC, USC, SVM and SC algorithms on the multiple tumor data. The horizontal axis shows the total number of distinct genes selected over all binary SVM classifiers on a log scale. Some results are not available on the full range of the total number of genes. For example, the maximum numbers of selected genes for EWUSC and USC are roughly 1,000. The reported prediction accuracy is 78% <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> using all 16,000 available genes on the full data. The EWUSC algorithm achieves 89% prediction accuracy with only 89 genes. With 680 genes, EWUSC produces 93% prediction accuracy.</p>
               </text>
               <graphic file="gb-2003-4-12-r83-4"/>
            </fig>
            <p>Third, our SVM results (on a subset of the multiple tumor data pre-processed with RMA measures) are generally much better than the published result of 78% <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> (on the full dataset pre-processed with MAS 4). Fourth, SVM with S2N as our feature-selection method produces high prediction accuracy at the expense of using a lot of relevant genes. For example, SVM requires a total of 1,699 genes over all the binary classifiers to achieve 93% prediction accuracy, whereas our EWUSC algorithm requires only 610 relevant genes to achieve the same prediction accuracy. If we are willing to trade off prediction accuracy with the number of relevant genes, EWUSC achieves 89% prediction accuracy with only 89 relevant genes.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results on the breast cancer data</p>
         </st>
         <p>We applied the EWUSC, USC and SC algorithms to the breast cancer data, and compared the prediction accuracy of the three algorithms at their optimal correlation thresholds (&#961;<sub>0 </sub>= 0.7 or 0.6), and the SC algorithm (USC at &#961;<sub>0 </sub>= 1). The results are shown in Figure <figr fid="F5">5</figr>. In general, EWUSC produces higher prediction accuracy than USC and SC when the number of relevant genes is less than 1,000 (which is the range of interest). In particular, EWUSC produces fewer classification errors on the test set at its optimal parameters (two errors at &#916; = 0.8 and &#961;<sub>0 </sub>= 0.7) than USC at its optimal parameters (four errors at &#916; = 1.15 and &#961;<sub>0 </sub>= 0.6).</p>
         <fig id="F5">
            <title>
               <p>Figure 5</p>
            </title>
            <caption>
               <p>Comparison of prediction accuracy of EWUSC, USC and SC on the breast cancer data</p>
            </caption>
            <text>
               <p>Comparison of prediction accuracy of EWUSC, USC and SC on the breast cancer data. The percentage of prediction accuracy is plotted against the number of relevant genes using the EWUSC algorithm at &#961;<sub>0 </sub>= 0.7, the USC algorithm at &#961;<sub>0 </sub>= 0.6 and the SC algorithm (USC at &#961;<sub>0 </sub>= 1.0). Note that the horizontal axis is shown on a log scale.</p>
            </text>
            <graphic file="gb-2003-4-12-r83-5"/>
         </fig>
         <p>Moreover, EWUSC generally selects relevant genes with relatively small error bars (or low p-values). For example, there are two genes with p-values equal to 1 across all 78 samples in the training set. In other words, we have very low confidence that the expression ratios of these two genes are changed in any of the 78 samples of the training set. It is undesirable to classify new samples using these genes that do not show any expression patterns. With EWUSC (which takes error estimates into consideration), these two genes are eliminated for all &#916; > 0. On the contrary, one of these two genes is selected as a relevant gene by USC for &#916; = 0, 0.05, ..., 0.7 at &#961;<sub>0 </sub>= 1.</p>
         <p>The detailed results of applying the EWUSC and USC algorithms to the breast cancer data are shown in Figures S8 and S9 on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. Surprisingly, removing highly correlated genes does not produce any considerable improvement in prediction accuracy and does not drastically reduce the number of relevant genes. This is probably due to the fact that the numbers of classification errors on the cross-validation data are not well correlated with those on the test set (see <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>). Because the test set is an additional independent dataset, there might be some heterogeneity between the training and test sets. Nevertheless, USC achieves comparable prediction accuracy to SC using relatively fewer selected genes (under 100 genes) over different correlation thresholds &#961;<sub>0</sub>.</p>
         <p>We compared the feature stability of EWUSC, USC and SC at their optimal correlation thresholds &#961;<sub>0 </sub>in Figure <figr fid="F6">6</figr>. We showed that EWUSC and SC produce relatively stable relevant features than USC. The detailed comparison of feature stability in terms of the average numbers of true/false positives/negatives are shown in Figures S12 and S13 on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The relatively high feature stability of SC is due to its relatively high true-positive rate (common genes chosen in both random cross-validation and using the entire training set), and its relatively low false-negative rate (genes chosen using the entire training set but not in the cross-validation data). However Figure S12 in <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> shows that this effect is drastic at high numbers of relevant genes and is relatively less significant at our optimal parameters with approximately 100 to 300 relevant genes.</p>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>Comparison of feature stability of EWUSC, USC and SC on the breast cancer data</p>
            </caption>
            <text>
               <p>Comparison of feature stability of EWUSC, USC and SC on the breast cancer data. The average Jaccard index is plotted against the number of relevant genes over five random runs of 10-fold cross-validation using the EWUSC algorithm at &#961;<sub>0 </sub>= 0.7, the USC algorithm at &#961;<sub>0 </sub>= 0.6 and the SC algorithm (USC at &#961;<sub>0 </sub>= 1). The EWUSC algorithm produces relatively more stable features when the number of relevant genes is small.</p>
            </text>
            <graphic file="gb-2003-4-12-r83-6"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Results on the synthetic data</p>
         </st>
         <p>We compared the performance of EWUSC, USC and SC on synthetic datasets with different numbers of repeated measurements, different biological and technical noise levels. As the biological noise levels of typical real microarray datasets are not known, we generated synthetic datasets with four repeated measurements at different biological noise levels (&#945; = 0.1, 1 or 2) and some typical results are shown in Table <tblr tid="T5">5a</tblr>. Our complete results are shown in Tables S1, S2 and S3 on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. In most cases, USC achieves better or comparable prediction accuracy (lower number of errors on the test set) than SC using fewer relevant genes. There are a few exceptions to this observation (see <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>). The optimal parameters (&#916;, &#961;<sub>0</sub>) are determined from the minimum average number of cross-validation errors. In some cases, there are very small differences between the average numbers of cross-validation errors of two sets of parameters, and the set of parameters that produces a slightly higher average cross-validation error rate yields fewer relevant genes. Therefore, this 'exception' is due to the fact that the optimal parameters are not derived from the random cross-validation data. At low biological noise level (&#945;), the inference of optimal parameters is obvious and USC always yields fewer relevant genes than SC (see Table S2 on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>). This observation demonstrates the power of removing highly correlated genes in the USC algorithm. Our results also showed that EWUSC consistently achieves the same prediction accuracy using fewer relevant genes at low biological noise (&#945; = 0.1, with signal-to-noise ratio approximately 20) at different technical noise levels (Table <tblr tid="T5">5a</tblr>). However, as &#945; is increased, the performance of EWUSC compared to USC deteriorates. For example, EWUSC selects more relevant genes than USC at low technical noise level but it selects fewer relevant genes than USC at &#945; = 1 (with signal-to-noise ratio approximately 2). The relative performance of EWUSC is even less favorable at high biological noise level (&#945; = 2 with signal-to-noise ratio roughly 1). The results in Table <tblr tid="T5">5a</tblr> suggest that EWUSC is the method of choice when the classes are relatively separable (at low biological noise and high signal-to-noise ratio), but USC would be the method of choice at high biological noise.</p>
         <tbl id="T5" hint_layout="double">
            <title>
               <p>Table 5</p>
            </title>
            <caption>
               <p>Comparison of classification accuracy results from EWUSC, USC and SC on synthetic datasets at optimal parameters</p>
            </caption>
            <tblbdy cols="8">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>&#945;</p>
                  </c>
                  <c ca="center">
                     <p>Number of measurements</p>
                  </c>
                  <c ca="center">
                     <p>&#955;</p>
                  </c>
                  <c ca="left">
                     <p>EWUSC</p>
                  </c>
                  <c ca="left">
                     <p>USC</p>
                  </c>
                  <c ca="left">
                     <p>SC</p>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><b>(a)</b> Different noise levels with four repeated measurements</p>
                  </c>
                  <c ca="center">
                     <p>0.1</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>Low</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>10</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>24</p>
                  </c>
                  <c ca="left">
                     <p>72</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(18, 0.8)</p>
                  </c>
                  <c ca="left">
                     <p>(17, 0.7)</p>
                  </c>
                  <c ca="left">
                     <p>(17.5, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>0.1</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>High</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>8</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>16</p>
                  </c>
                  <c ca="left">
                     <p>22</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(12.5, 0.9)</p>
                  </c>
                  <c ca="left">
                     <p>(12.5, 0.9)</p>
                  </c>
                  <c ca="left">
                     <p>(12.5, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>Low</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>144</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>119</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>124</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(2.8, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(3.1, 0.6)</p>
                  </c>
                  <c ca="left">
                     <p>(3.1, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>High</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>100%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>89</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>120</p>
                  </c>
                  <c ca="left">
                     <p>122</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(1.9, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(2.6, 0.6)</p>
                  </c>
                  <c ca="left">
                     <p>(2.6, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>Low</p>
                  </c>
                  <c ca="left">
                     <p>96.8%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>99.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>98.8%</p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>97.5%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>270</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>326</p>
                  </c>
                  <c ca="left">
                     <p>326</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(1.1, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1, 0.4)</p>
                  </c>
                  <c ca="left">
                     <p>(1.2, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>High</p>
                  </c>
                  <c ca="left">
                     <p>93.3%</p>
                  </c>
                  <c ca="left">
                     <p>98.8%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>99.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>92.5%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>97.5%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>97.5%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>186</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>159</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>159</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(1, 0.7)</p>
                  </c>
                  <c ca="left">
                     <p>(1.5, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1.5, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><b>(b)</b> Different numbers of repeated measurements at high biological noise levels</p>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>Low</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>99.5%</p>
                  </c>
                  <c ca="left">
                     <p>99.5%</p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>285</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>304</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>(1.2, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1.2, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>High</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>96.5%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>95.5%</p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>92.5%</p>
                  </c>
                  <c ca="left">
                     <p>92.5%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>258</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>282</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>(1.2, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1.2, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>8</p>
                  </c>
                  <c ca="center">
                     <p>Low</p>
                  </c>
                  <c ca="left">
                     <p>99.8%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>246</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>220</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>221</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(1.3, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1.4, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1.4, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>8</p>
                  </c>
                  <c ca="center">
                     <p>High</p>
                  </c>
                  <c ca="left">
                     <p>98.3%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>99.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>99.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>97.5%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>171</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>242</p>
                  </c>
                  <c ca="left">
                     <p>245</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(1, 0.4)</p>
                  </c>
                  <c ca="left">
                     <p>(1.3, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1.3, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>20</p>
                  </c>
                  <c ca="center">
                     <p>Low</p>
                  </c>
                  <c ca="left">
                     <p>99.8%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>226</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>296</p>
                  </c>
                  <c ca="left">
                     <p>325</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(1.3, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1.2, 0.6)</p>
                  </c>
                  <c ca="left">
                     <p>(1.2, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>20</p>
                  </c>
                  <c ca="center">
                     <p>High</p>
                  </c>
                  <c ca="left">
                     <p>99.8%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>100.0%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Average % CV prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>100.0%</p>
                  </c>
                  <c ca="left">
                     <p>% prediction accuracy</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>221</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>252</p>
                  </c>
                  <c ca="left">
                     <p>252</p>
                  </c>
                  <c ca="left">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>(0.9, 0.6)</p>
                  </c>
                  <c ca="left">
                     <p>(1.3, 0.5)</p>
                  </c>
                  <c ca="left">
                     <p>(1.3, 1)</p>
                  </c>
                  <c ca="left">
                     <p>(&#916;, &#961;)</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>Synthetic datasets were generated at different levels of biological noise (&#945;) and technical noise (&#955;). The average percentage of cross validation (% CV) accuracy, the percentage of prediction accuracy on the test set, the number of relevant genes at the optimal parameters (&#916;, &#961;<sub>0</sub>) are shown. For each synthetic dataset, the algorithm with the maximum percentage of average cross validation accuracy, maximum prediction accuracy, or the minimum number of relevant genes is shown in bold. <b>(a) </b>Typical classification accuracy results using synthetic datasets with four repeated measurements at different biological noise levels (&#945; = 0.1, 1 or 2) and difference technical noise levels (&#955; = 1, 5 or 10). When the biological noise level is low (&#945; = 0.1), EWUSC consistently achieves the same prediction accuracy using fewer relevant genes at various technical noise levels. However, at medium biological noise level (&#945; = 1), EWUSC typically outperforms USC and SC at high technical noise level and not at low technical noise level. When the biological noise level is high (&#945; = 2), EWUSC is often not the method of choice. <b>(b) </b>Typical classification accuracy results using synthetic datasets at high biological noise level (&#945; = 2) with 1, 8, or 20 repeated measurements at different technical noise levels. When there is no repeated measurement (the number of repeated measurements = 1), there are no variability estimates over repeated measurements and hence, EWUSC is reduced to USC. The results with four repeated measurement at &#945; = 2 are shown in (a). Our results over multiple synthetic datasets showed that EWUSC only outperforms USC with a large number of repeated measurements (20) at high biological noise (&#945; = 2). We also showed that USC typically outperforms SC by choosing a smaller number of relevant genes in most scenarios (over different biological and technical noise levels, and different numbers of repeated measurements).</p>
            </tblfn>
         </tbl>
         <p>In general, the performance of EWUSC increases as the number of repeated measurements increases. In particular, we studied the effect of the number of repeated measurements on the relative performance of EWUSC, USC and SC at high biological noise (&#945; = 2). The prediction accuracy results using 1, 8 or 20 repeated measurements at high biological noise (&#945; = 2) are shown in Table <tblr tid="T5">5b</tblr>. The results at &#945; = 2 with four repeated measurements are shown in Table <tblr tid="T5">5a</tblr>. USC typically outperforms SC by selecting fewer relevant genes over different numbers of repeated measurements. In addition, we showed that EWUSC usually selects fewer relevant genes than USC at high biological noise when there are 20 repeated measurements. However, when the biological noise level is high (with signal-to-noise ratio approximately 1) and the number of repeated measurements is low (1, 4 or 8), USC usually selects fewer relevant genes than EWUSC.</p>
         <p>Table <tblr tid="T5">5a,b</tblr> shows that EWUSC produces lower prediction accuracy than USC at high biological noise when there are few repeated measurements. However, the levels of biological noise on real microarray datasets are not known. In practice, we recommend users of our algorithms to compare the average numbers of errors on the cross-validation data and the numbers of relevant genes from the EWUSC and USC algorithms, and then select the algorithm that produces lower average cross-validation errors using fewer relevant genes. In most cases, the prediction accuracy on the test set shows the same trend as the average number of cross-validation errors.</p>
         <p>It is interesting that prediction accuracy is not necessarily reduced and the number of relevant genes is not necessarily increased at higher technical noise levels. However, prediction accuracy is generally reduced and the number of relevant genes is typically increased at higher biological noise levels (see Additional data files Tables S1, S2 and S3 at <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>). All three algorithms (EWUSC, USC and SC) produce comparable feature stability at different noise levels when the number of relevant genes is below 300 (see Figures S20, S21 at <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>).</p>
      </sec>
      <sec>
         <st>
            <p>Summary of results on real data</p>
         </st>
         <p>Table <tblr tid="T6">6</tblr> summarizes our prediction accuracy results using the EWUSC, USC and SC algorithms on the NCI 60 data, multiple tumor data and breast cancer data at optimal parameters. In general, we showed that using variability over repeated measurements to down-weight noisy genes/experiments and the removal of highly correlated genes usually reduce the number of relevant genes necessary for accurate class predictions. In addition, using variability of repeated measurements to down-weight noisy genes/experiments generally increases feature stability. Hence, our EWUSC and USC algorithms represent advances over the published SC algorithm <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>.</p>
         <tbl id="T6" hint_layout="double">
            <title>
               <p>Table 6</p>
            </title>
            <caption>
               <p>Summary of prediction accuracy results</p>
            </caption>
            <tblbdy cols="6">
               <r>
                  <c ca="left">
                     <p>Data</p>
                  </c>
                  <c ca="left">
                     <p>Parameters</p>
                  </c>
                  <c ca="left">
                     <p>EWUSC</p>
                  </c>
                  <c ca="left">
                     <p>USC</p>
                  </c>
                  <c ca="left">
                     <p>SC</p>
                  </c>
                  <c ca="left">
                     <p>Published results</p>
                  </c>
               </r>
               <r>
                  <c cspan="6">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>NCI 60 data*</p>
                  </c>
                  <c ca="left">
                     <p>&#961;<sub>0</sub></p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>0.6</p>
                  </c>
                  <c ca="left">
                     <p>1.0</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>&#916;</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>1.0</p>
                  </c>
                  <c ca="left">
                     <p>1.0</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Number of relevant genes</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>2,315</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>3998</p>
                  </c>
                  <c ca="left">
                     <p>200</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Prediction accuracy</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>72%</p>
                  </c>
                  <c ca="left">
                     <p>72%</p>
                  </c>
                  <c ca="left">
                     <p>~40-60% <abbrgrp><abbr bid="B23">23</abbr></abbrgrp></p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Multiple tumor data (estimated optimal parameters)<sup>&#8224;</sup></p>
                  </c>
                  <c ca="left">
                     <p>&#961;<sub>0</sub></p>
                  </c>
                  <c ca="left">
                     <p>0.8</p>
                  </c>
                  <c ca="left">
                     <p>0.8</p>
                  </c>
                  <c ca="left">
                     <p>1.0</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>&#916;</p>
                  </c>
                  <c ca="left">
                     <p>5.6</p>
                  </c>
                  <c ca="left">
                     <p>5.6</p>
                  </c>
                  <c ca="left">
                     <p>8.8</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Number of relevant genes</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>680</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>735</p>
                  </c>
                  <c ca="left">
                     <p>3902</p>
                  </c>
                  <c ca="left">
                     <p>All genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Prediction accuracy</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>93%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>85%</p>
                  </c>
                  <c ca="left">
                     <p>78%</p>
                  </c>
                  <c ca="left">
                     <p>78% <abbrgrp><abbr bid="B10">10</abbr></abbrgrp></p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Multiple tumor data (global optimal parameters)<sup>&#8225;</sup></p>
                  </c>
                  <c ca="left">
                     <p>&#961;<sub>0</sub></p>
                  </c>
                  <c ca="left">
                     <p>0.9</p>
                  </c>
                  <c ca="left">
                     <p>0.9</p>
                  </c>
                  <c ca="left">
                     <p>1.0</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>&#916;</p>
                  </c>
                  <c ca="left">
                     <p>0</p>
                  </c>
                  <c ca="left">
                     <p>0</p>
                  </c>
                  <c ca="left">
                     <p>0.4</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Number of relevant genes</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>1626</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>1634</p>
                  </c>
                  <c ca="left">
                     <p>7129</p>
                  </c>
                  <c ca="left">
                     <p>All genes</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Prediction accuracy</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>78%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>74%</p>
                  </c>
                  <c ca="left">
                     <p>74%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>78% </b>
                        <abbrgrp>
                           <abbr bid="B10">10</abbr>
                        </abbrgrp>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Breast cancer data</p>
                  </c>
                  <c ca="left">
                     <p>&#961;<sub>0</sub></p>
                  </c>
                  <c ca="left">
                     <p>0.7</p>
                  </c>
                  <c ca="left">
                     <p>0.6</p>
                  </c>
                  <c ca="left">
                     <p>1.0</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>&#916;</p>
                  </c>
                  <c ca="left">
                     <p>0.80</p>
                  </c>
                  <c ca="left">
                     <p>1.15</p>
                  </c>
                  <c ca="left">
                     <p>1.1</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Number of relevant genes</p>
                  </c>
                  <c ca="left">
                     <p>271</p>
                  </c>
                  <c ca="left">
                     <p>82</p>
                  </c>
                  <c ca="left">
                     <p>187</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>70</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>Prediction accuracy</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>89%</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>79%</p>
                  </c>
                  <c ca="left">
                     <p>84%</p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>89% </b>
                        <abbrgrp>
                           <abbr bid="B14">14</abbr>
                        </abbrgrp>
                     </p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>The optimal parameters (&#961;<sub>0 </sub>and &#916;), number of relevant genes chosen, and prediction accuracy for the NCI 60 data, multiple tumor data and breast cancer data are summarized here. Both EWUSC (error-weighted, uncorrelated shrunken centroid) and USC (uncorrelated shrunken centroid) were motivated by SC (shrunken centroid) <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. Both EWUSC and USC take advantage of interdependence between genes by removing highly correlated relevant genes. EWUSC makes use of error estimates or variability over repeated measurements. SC <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> is equivalent to USC at &#961;<sub>0 </sub>= 1. The optimal parameters (&#916;, &#961;<sub>0</sub>) for EWUSC are estimated from the cross-validation results of EWUSC, while the optimal parameters (&#916;, &#961;<sub>0</sub>) for USC are independently estimated from the cross-validation results of USC. Entries with the minimum number of selected genes or highest prediction accuracy across all methods are highlighted in boldface type. *Since no repeated measurements or error estimates are available, EWUSC is not applicable to the NCI 60 data. In addition, there is no separate test set available for the NCI 60 data, typical results of random partitions of the original 61 samples into training and test sets are shown. <sup>&#8224;</sup>The prediction accuracy and number of relevant genes are produced using optimal parameters (&#916;, &#961;<sub>0</sub>) estimated by visual observation of 'bends' in the random cross-validation curves. <sup>&#8225;</sup>The prediction accuracy and number of relevant genes are produced using global optimal parameters, that is (&#916;, &#961;<sub>0</sub>) that produces the minimum average numbers of cross-validation errors over all &#916; and all &#961;<sub>0</sub>.</p>
            </tblfn>
         </tbl>
         <p>On the NCI 60 data, USC generally produces higher prediction accuracy than SC using the same number of relevant genes. This result shows that the removal of highly correlated genes reduces the number of genes necessary for class prediction while achieving comparable or higher prediction accuracy.</p>
         <p>On the multiple tumor data, EWUSC has the following advantages over other methods: EWUSC produces higher prediction accuracy and selects fewer relevant genes than all other approaches. In particular, EWUSC achieves 93% of prediction accuracy using less than 10% of the genes compared to 78% of prediction accuracy using all the available genes in the published results <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Each of the binary SVM classifiers chooses a different subset of relevant genes while our EWUSC algorithm uses only one set of relevant genes for all classes.</p>
         <p>van't Veer <it>et al</it>. <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> reported two classification errors using 70 relevant genes on the test set of the breast cancer data (out of a total of 19 samples). Our EWUSC produces the same number of errors on the test set with 271 relevant genes. However, our EWUSC algorithm has the following advantages over the prognostic classifier used in <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. No <it>ad hoc </it>filtering step is necessary. The EWUSC algorithm automatically avoids choosing noisy genes. The EWUSC algorithm can be applied to data with any number of classes. This is in contrast to the prognostic classifier, which is not applicable to the multiple tumor data (which consists of 11 classes) or the NCI 60 data (which consists of 8 classes).</p>
      </sec>
      <sec>
         <st>
            <p>Comparison of USC, EWUSC and SC algorithms</p>
         </st>
         <p>The key characteristics of EWUSC, USC and SC are summarized in Table <tblr tid="T7">7</tblr>. We illustrated the EWUSC and USC algorithm on both real and synthetic datasets. Our results on real data are summarized in Table <tblr tid="T6">6</tblr>. We compared the performance of USC with SC, and showed that USC typically achieves comparable prediction accuracy using a smaller set of relevant genes on both real and synthetic datasets. We showed that the step of removing highly correlated genes in USC is effective in reducing the number of relevant genes without sacrificing prediction accuracy, and hence, USC is an improvement over SC.</p>
         <tbl id="T7" hint_layout="single">
            <title>
               <p>Table 7</p>
            </title>
            <caption>
               <p>Summary of EWUSC, USC and SC</p>
            </caption>
            <tblbdy cols="4">
               <r>
                  <c ca="left">
                     <p>Desirable features</p>
                  </c>
                  <c ca="center">
                     <p>EWUSC</p>
                  </c>
                  <c ca="center">
                     <p>USC</p>
                  </c>
                  <c ca="center">
                     <p>SC</p>
                  </c>
               </r>
               <r>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Make use of variability over repeated measurements</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Applicable to data with any number of classes</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Exploit dependence relationships between genes</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Integrated approach for both feature selection and classification</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>No assumption on data distributions</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
                  <c ca="center">
                     <p>+</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>We also compared the performance of EWUSC (which down-weights noisy genes and noisy experiments) with USC on both real and synthetic datasets. On real microarray datasets (multiple tumor data and breast cancer data), we showed that EWUSC usually achieves higher or comparable feature stability using a smaller set of relevant genes, and EWUSC avoids choosing noisy relevant genes for classification of samples. Hence, we showed that using variability over repeated measurements improves classification and feature-selection results. Moreover, we compared EWUSC with other existing classification and feature-selection algorithms, and showed that EWUSC produces better or at least comparable results than previously reported results on real datasets (see Table <tblr tid="T6">6</tblr>). On the other hand, our results on synthetic datasets showed that EWUSC is usually the method of choice when the classes are well separated (that is, when biological noise is low or signal-to-noise ratio is high).</p>
         <p>Our main contribution is that we use cross-validation to select a correlation threshold (&#961;<sub>0</sub>) for the removal of highly correlated genes. This idea is adopted in both USC and EWUSC, which in turn take advantage of the interdependence of genes without sacrificing prediction accuracy. Our second major contribution is that we adopted the error-weighted method in our integrated feature-selection and classification algorithm, EWUSC. To the best of our knowledge, EWUSC is the only classification algorithm applicable to microarray data with any number of classes that takes advantage of variability in repeated measurements.</p>
         <p>There are many directions for future work. The error-weighted idea can be applied to other distance-based classification algorithms, for example, the k-nearest neighbour, which was shown to achieve high prediction accuracy <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Our next step is to compare the performance of the EWUSC and USC algorithms with a wide range of other classification and feature selection algorithms. One problem in the literature is that researchers often use different pre-processed subsets of published array data, which makes direct comparisons of published results difficult. Therefore, there is a need to conduct a large-scale evaluation study of various classification and feature selection algorithms on microarray data.</p>
      </sec>
      <sec>
         <st>
            <p>Details of algorithms</p>
         </st>
         <sec>
            <st>
               <p>The SC algorithm of Tibshirani <it>et al. </it><abbrgrp><abbr bid="B17">17</abbr></abbrgrp></p>
            </st>
            <p>Let x<sub>ij </sub>be the expression level for gene i = 1, 2, ..., p and samples j = 1, 2, ..., n. Suppose there are a total of K classes, and let C<sub>k </sub>be the set of all n<sub>k </sub>samples in class k. The overall centroid of gene i is, </p>
            <p><graphic file="gb-2003-4-12-r83-i1.gif"/>,</p>
            <p>and the class centroid of class k and gene i is,</p>
            <p><graphic file="gb-2003-4-12-r83-i2.gif"/>.</p>
            <p>The relative difference, d<sub>ik</sub>, is the difference in class centroid (<graphic file="gb-2003-4-12-r83-i3.gif"/>) and overall centroid (<graphic file="gb-2003-4-12-r83-i4.gif"/>), standardized by the within-class standard deviation of gene i (s<sub>i</sub>); that is,</p>
            <p><graphic file="gb-2003-4-12-r83-i5.gif"/>,</p>
            <p>where</p>
            <p><graphic file="gb-2003-4-12-r83-i6.gif"/>,<graphic file="gb-2003-4-12-r83-i7.gif"/>,</p>
            <p>and s<sub>0 </sub>is the median value of the s<sub>i</sub>s over all genes i. The relative difference d<sub>ik </sub>is similar to a <it>t</it>-statistic, comparing the class centroid to the overall centroid. The shrunken relative difference d'<sub>ik </sub>reduces d<sub>ik </sub>by an amount &#916; if |d<sub>ik</sub>| > &#916;, otherwise, sets d'<sub>ik </sub>to zero; that is,</p>
            <p><graphic file="gb-2003-4-12-r83-i8.gif"/>.</p>
            <p>Hence, d'<sub>ik </sub>gets rid of genes with class centroids not significantly different from the overall centroids. The amount of shrinkage &#916; is determined by m-fold cross-validation such that the number of cross-validation classification errors is minimized. Genes with at least one positive shrunken relative difference d'<sub>ik </sub>(over all classes k) are selected as relevant features. The shrunken class centroid (<graphic file="gb-2003-4-12-r83-i9.gif"/>) is defined as <graphic file="gb-2003-4-12-r83-i10.gif"/>. The discriminant score for a new sample x* and class k is defined as </p>
            <p><graphic file="gb-2003-4-12-r83-i11.gif"/>,</p>
            <p>where &#960;<sub>k </sub>= n<sub>k</sub>/n. The first term in the discriminant score represents the standardized squared distance of x* to the shrunken class centroid, and the second term represents a correction for the class prior probability. Sample x* is assigned to the class k with the minimum discriminant score.</p>
         </sec>
         <sec>
            <st>
               <p>Our EWUSC algorithm</p>
            </st>
            <sec>
               <st>
                  <p>Mathematical definitions</p>
               </st>
               <p>The EWUSC algorithm is a modification of the SC algorithm with two key differences: noisy measurements are down-weighted and redundant genes (features) are removed. Let &#963;<sub>ij </sub>be the variability estimate of gene i and sample j over repeated measurements, where i = 1, 2, ..., p and j = 1, 2, ..., n. The weighted overall centroid for gene i is defined as </p>
               <p><graphic file="gb-2003-4-12-r83-i12.gif"/>,</p>
               <p> and the weighted class centroid for gene i and class k is </p>
               <p><graphic file="gb-2003-4-12-r83-i13.gif"/>.</p>
               <p>Noisy measurements with a large variability estimate &#963;<sub>ij </sub>are down-weighted in the weighted overall and class centroids. The weighted relative difference is similarly defined as </p>
               <p><graphic file="gb-2003-4-12-r83-i14.gif"/>,</p>
               <p>where the weighted within-class standard deviation, </p>
               <p><graphic file="gb-2003-4-12-r83-i15.gif"/>,</p>
               <p>average variability estimate for class k,</p>
               <p><graphic file="gb-2003-4-12-r83-i16.gif"/>,</p>
               <p>the scaling factor </p>
               <p><graphic file="gb-2003-4-12-r83-i17.gif"/>,</p>
               <p><graphic file="gb-2003-4-12-r83-i18.gif"/> is the median of all <graphic file="gb-2003-4-12-r83-i19.gif"/>s over all genes i, and &#969;<sub>i </sub>is the median variability estimate for gene i across all n experiments. When the variability estimates are equal for all samples; that is, &#963;<sub>ij </sub>= &#963;<sub>i </sub>for j = 1, 2, ..., n, the above definitions for <graphic file="gb-2003-4-12-r83-i20.gif"/>, <graphic file="gb-2003-4-12-r83-i21.gif"/>, <graphic file="gb-2003-4-12-r83-i22.gif"/> and <graphic file="gb-2003-4-12-r83-i23.gif"/> can be simplified to the corresponding formulae from the SC algorithm. The intuition behind these error-weighted definitions is that noisy samples with large variability estimates &#963;<sub>ij </sub>are down-weighted. The median variability for gene i (&#969;<sub>i</sub>) in the denominator of the weighted relative difference (<graphic file="gb-2003-4-12-r83-i24.gif"/>) down-weights noisy genes such that genes with large variabilitiy over all samples are less likely to be selected as relevant genes. The definition of weighted shrunken relative difference <graphic file="gb-2003-4-12-r83-i25.gif"/> is very similar to that of d'<sub>ik</sub>; that is,</p>
               <p><graphic file="gb-2003-4-12-r83-i26.gif"/>,</p>
               <p>where the amount of shrinkage &#916; is determined by cross-validation. Similarly, the weighted shrunken centroid is defined as <graphic file="gb-2003-4-12-r83-i27.gif"/>, and the weighted discriminant score for a new sample x* with variability estimate &#963;<sub>I</sub>* and class k is</p>
               <p><graphic file="gb-2003-4-12-r83-i28.gif"/>.</p>
            </sec>
            <sec>
               <st>
                  <p>Error-weighted correlation</p>
               </st>
               <p>Hughes <it>et al</it>. <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> defined error-weighted correlation that weighs expression values with error estimates such that expression values with relatively high errors are down-weighted. Let &#963;<sub>ge </sub>be the error estimate of the expression level of gene g under experiment e, where g = 1, ..., p and e = 1, ..., n. The error-weighted correlation between a pair of genes i and j is defined as</p>
               <p>
                  <graphic file="gb-2003-4-12-r83-i29.gif"/>
               </p>
               <p>where</p>
               <p>
                  <graphic file="gb-2003-4-12-r83-i30.gif"/>
               </p>
               <p>is the weighted average expression level of gene i.</p>
            </sec>
            <sec>
               <st>
                  <p>Algorithm outline for EWUSC</p>
               </st>
               <p><b>Inputs to the algorithm: </b>training set (with known classes) and test set</p>
               <p>For each gene i and each class k,</p>
               <p>&#160;&#160;Compute <graphic file="gb-2003-4-12-r83-i20.gif"/>, <graphic file="gb-2003-4-12-r83-i21.gif"/>, <graphic file="gb-2003-4-12-r83-i22.gif"/> and <graphic file="gb-2003-4-12-r83-i24.gif"/> using the training set.</p>
               <p>For each &#916;,</p>
               <p>&#160;&#160;Compute <graphic file="gb-2003-4-12-r83-i25.gif"/> for each gene i and class k.</p>
               <p>&#160;&#160;For each gene i, denote the maximum shrunken relative difference over all K classes by <graphic file="gb-2003-4-12-r83-i31.gif"/>.</p>
               <p>&#160;&#160;Let S<sub>&#916; </sub>be the set of genes with at least one positive shrunken relative difference over all the K classes; that is, S<sub>&#916; </sub>= {g:&#946;<sub>g </sub>> 0}.</p>
               <p>&#160;&#160;Sort the genes g in S<sub>&#916; </sub>in descending order of &#946;<sub>g</sub>. Denote this sorted set by G = {g<sub>1</sub>, g<sub>2</sub>, ..., g<sub>t</sub>}.</p>
               <p>&#160;&#160;For &#961;<sub>0 </sub>= 1, 0.9, 0.8, ..., 0.1, 0,</p>
               <p>&#160;&#160;&#160;&#160;Consider all pairs of genes (g<sub>i</sub>, g<sub>j</sub>) in G such that i &lt; j (that is, &#946;<sub>gi </sub>> &#946;<sub>gj</sub>).</p>
               <p>&#160;&#160;&#160;&#160;Compute the error-weighted correlation &#961; between (g<sub>i</sub>, g<sub>j</sub>). If &#961; &#8805; &#961;<sub>0</sub>, remove gene g<sub>j </sub>from S<sub>&#916;</sub>.</p>
               <p>&#160;&#160;Let S(&#916;, &#961;<sub>0</sub>) be the set of genes left in S<sub>&#916; </sub>after removing the highly correlated genes.</p>
               <p>&#160;&#160;Apply the discriminant score to predict the classes of samples in the test set using the relevant genes in S(&#916;, &#961;<sub>0</sub>).</p>
               <p><b>Output of the algorithm: </b>a predicted class for each sample in the test set for each &#916; and each &#961;<sub>0</sub>.</p>
               <p>The above algorithm is applied to the m-fold cross-validation data to determine the optimal parameters &#916; and &#961;<sub>0 </sub>that minimize the number of classification errors on the training set. The optimal parameters are then used to predict classes on the unknown samples on the test set.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>The Jaccard index as a measure of feature stability</p>
            </st>
            <p>We define feature stability as the average level of agreement between the set of relevant genes chosen in a fold of the cross-validation data and the set of relevant genes chosen using the full training set over all m folds of the cross-validation data. Let <it>S</it>(&#916;, &#961;<sub>0</sub>) be the set of relevant genes chosen using the entire training set, and let S(m, &#916;, &#961;<sub>0</sub>) be the set of relevant genes chosen in the mth fold of the cross-validation data with parameters &#916; and &#961;<sub>0</sub>. We define the number of true positives (TP) as the number of relevant genes chosen in both <it>S</it>(&#916;, &#961;<sub>0</sub>) and S(m, &#916;, &#961;<sub>0</sub>). Similarly, we define the number of false positives (FP) as the number of relevant genes chosen in S(m, &#916;, &#961;<sub>0</sub>) but not in <it>S</it>(&#916;, &#961;<sub>0</sub>), and the number of false negatives (FN) as the number of relevant genes chosen in <it>S</it>(&#916;, &#961;<sub>0</sub>) but not in S(m, &#916;, &#961;<sub>0</sub>). The Jaccard index, J(m, &#916;, &#961;<sub>0</sub>), is defined as TP/(TP + FP + FN). Intuitively, the level of agreement is high when there are many true positives, and relatively few false positives and false negatives. Hence, a high Jaccard index indicates a high level of agreement. Feature stability is the average Jaccard index over all m folds; that is, J(&#916;, &#961;<sub>0</sub>) = average of J(m, &#916;, &#961;<sub>0</sub>) over all m folds.</p>
         </sec>
         <sec>
            <st>
               <p>Support vector machines (SVMs)</p>
            </st>
            <p>The basic idea behind SVM <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> is that it maps data points to a high-dimensional space such that the data points are linearly separable. However, SVM avoids computations in high-dimensional space by the use of kernel functions, which allows computations in the input space. There are many different types of kernel functions, with different effects. Brown <it>et al</it>. <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> showed that the radial kernel functions work very well in classifying genes on array data.</p>
            <p>We augmented the SVM implementation by Noble <it>et al</it>. <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> to incorporate the signal to noise (S2N) measure for feature selection. The S2N measure is defined as the difference of the means in the two classes divided by the sum of the standard deviations of the two classes. Because we adopt the one-versus-all approach <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B36">36</abbr></abbrgrp> to combine the binary SVM classifiers, each binary classifier distinguishes samples of a given class from samples from all the other classes. The multiple tumor dataset consists of 11 classes (see Table <tblr tid="T3">3</tblr> for details), and so there is a total of 11 binary SVM classifiers for this data. We applied the S2N measure to select a given number of relevant genes on the four-fold cross-validation data using a binary SVM classifier (with a radial kernel function). We then combined the results from each of the 11 SVMs by assigning the sample to the class of the classifier with the maximum discriminant value. This process is repeated for each of the five random fourfold splits of the training set. The results on the cross-validation data are shown in Figure S7(a) on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, in which the average number of classification errors is plotted against the number of relevant genes chosen. The next step is to apply this process to the entire training set, and use the selected genes to classify the samples on the test set. The results on the test set are shown in Figure S7(b) on <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, in which the number of classification errors on the test set is plotted against the number of relevant genes chosen.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Details of dataset analysis</p>
         </st>
         <sec>
            <st>
               <p>Multiple tumor data</p>
            </st>
            <p>In order to process the multiple tumor data <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> with the RMA measure implemented in the Bioconductor project, we need the raw data (.cel files) which contain the expression level for each oligo (probe cell). The original multiple tumor data consists of 14 tumor types which were hybridized to both the Affymetrix Hu6800 and Hu35K chips. However, only a subset of the original '.cel' files (mostly data from the Hu6800 chips) is available. Hence, the subset of the multiple tumor data we used consists of all the 7,129 genes on the Hu6800 chips and 11 tumor types, with 96 samples in the training set and 27 samples in the test set. Table <tblr tid="T3">3</tblr> shows the tumor types and class sizes for both the training and test sets.</p>
         </sec>
         <sec>
            <st>
               <p>Error model in the breast cancer data</p>
            </st>
            <p>The log ratios and their associated p-values are available from the breast cancer data. The p-values are confidence measures that expression ratios are significantly different from 1. Using the error model documented in the 'Error Model' supplement of Hughes <it>et al</it>. <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, we converted the p-values into error estimates. Assuming the distribution of error magnitudes can be approximated by the normal distribution, significance values (or p-values) can be derived from the Gaussian error function of the ratio of an observed log expression ratio to its error estimate <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. The p-value (p) for an observed log ratio (r) is related to the error estimate of the observed log ratio (s) by p = 2 * (1 - Erf(|X|) where X represents the ratio of an observed log expression ratio (r) to its error estimate (&#963;) and Erf is the Gaussian error function. Hence, the error estimates of the log expression ratios can be derived from the p-values. However, when a p-value is equal to 1, the error estimate is arbitrarily large. Hence, we ignored the corresponding expression ratio in our EWUSC algorithm when its p-value is equal to 1.</p>
         </sec>
         <sec>
            <st>
               <p>Synthetic data</p>
            </st>
            <p>The synthetic training sets consist of 1,000 genes, 80 samples, and four classes such that there are 20 samples in each class, and the synthetic test sets consist of 1,000 genes and 40 samples with 10 samples in each class. Two parameters control the noise levels in the synthetic datasets - the biological noise level (&#945;) and the technical noise level (&#955;). Let P be the matrix of patterns with 64 rows and 4 columns such that each entry P [i,j] is the ith pattern of class j (i = 1,2,..., 64, j = 1,2,3,4). Table <tblr tid="T8">8</tblr> shows the pattern matrix P used to generate synthetic datasets in our study. Let X(i, j) be the true expression value of gene i under experiment j before technical noise is added. Let Y(i, j, r) be the rth measured expression value of gene i under experiment j, where i = 1, 2, ..., p, j = 1,2, ..., n, r = 1,2,..., R. Suppose gene i is generated from the mth patterned gene that belongs to class k. X(i, j) is generated from the random normal distribution with mean P [m,k] and standard deviation &#945;. Technical noise is randomly sampled from a real dataset. Four hybridizations were repeated on the yeast galactose data <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, and the standard deviation of each gene under each experiment is adopted as our estimated technical noise. Let &#949; be the randomly sampled technical noise (standard deviation over four repeated measurements) from the yeast galactose data <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. Y(i,j,r) is generated from the random normal distribution with mean X(i,j) and standard deviation &#949;&#955;. Hence, a high technical noise level &#955; indicates noisy repeated measurements. Moreover, there are five duplicates of each of these 64 patterned genes so that there is a total of 320 patterned genes. Each of these five duplicated patterned genes is generated using the same row in the pattern matrix P.</p>
            <tbl id="T8">
               <title>
                  <p>Table 8</p>
               </title>
               <caption>
                  <p>Pattern matrix for synthetic data</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="center">
                        <p>Class 1</p>
                     </c>
                     <c ca="center">
                        <p>Class 2</p>
                     </c>
                     <c ca="center">
                        <p>Class 3</p>
                     </c>
                     <c ca="center">
                        <p>Class 4</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>-1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Each row represents a pattern, and each column represents a class such that entry P(i, j) is the ith pattern of class j. An entry of 1 means upregulated while an entry of -1 means downregulated. For example, the first row indicates that a patterned gene is upregulated in class 1 compared to all the other three classes.</p>
               </tblfn>
            </tbl>
            <p>For non-patterned genes, we randomly sample from the breast cancer data such that these non-patterned genes do not exhibit any class-specific expression patterns. Specifically, let q be a non-patterned gene. Suppose we randomly sample a gene g and experiment e from the breast cancer data such that E[g,e] is the expression ratio of gene g and experiment e and s[g,e] is the error estimate of gene g and experiment e in the breast cancer data. Y(q,j) is generated from a random normal distribution with mean E[g,e] and standard deviation s[g,e] for sample j in the synthetic training or test set. Note that all expression values of the non-patterned gene q are sampled from the same gene g (which is chosen randomly) from the breast cancer data. As experiment e is independently sampled for each sample j, any class specific expression pattern in the original breast cancer data would be destroyed.</p>
            <p>Both the synthetic training and test sets are generated using the same model described above. In our experiments, we set p = 1000, &#945; = 0.1, 1 or 2, and &#955; = 1 (low technical noise) or 10 (high technical noise) with R = 1 or 4 or 20 repeated measurements. We also experimented with synthetic datasets with a higher fraction of non-patterned genes and showed that these larger datasets produce similar results (data not shown).</p>
         </sec>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We would like to thank Sridhar Ramaswamy for providing us with the raw (.cel) files for the multiple tumor data. We also thank Jane Fridlyand for the processed NCI 60 dataset. We would also like to acknowledge the publicly available BioConductor project <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> and GIST <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. We would like to thank William Noble for general discussions and Mette Peters for her suggestions on this writeup. This work was supported by NIH-NIDDK grant 5U24DK058813-02.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Hybridization analyses of arrayed cDNA libraries.</p>
            </title>
            <aug>
               <au>
                  <snm>Lennon</snm>
                  <fnm>GG</fnm>
               </au>
               <au>
                  <snm>Lehrach</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>1991</pubdate>
            <volume>7</volume>
            <fpage>314</fpage>
            <lpage>317</lpage>
            <xrefbib>
               <pubid idtype="pmpid">1781028</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Differential gene expression in the murine thymus assayed by quantitative hybridization of arrayed cDNA clones.</p>
            </title>
            <aug>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Rocha</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Granjeaud</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Baldit</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bernard</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Naquet</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Jordan</snm>
                  <fnm>BR</fnm>
               </au>
            </aug>
            <source>Genomics</source>
            <pubdate>1995</pubdate>
            <volume>29</volume>
            <fpage>207</fpage>
            <lpage>216</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/geno.1995.1233</pubid>
                  <pubid idtype="pmpid" link="fulltext">8530073</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Novel gene transcripts preferentially expressed in human muscles revealed by quantitative hybridization of a high density cDNA array.</p>
            </title>
            <aug>
               <au>
                  <snm>Pietu</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Alibert</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Guichard</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Lamy</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Bois</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Leroy</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Mariage-Sampson</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Houlgatte</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Soularue</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Auffray</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1996</pubdate>
            <volume>6</volume>
            <fpage>492</fpage>
            <lpage>503</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8828038</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Quantitative monitoring of gene expression patterns with a complementary DNA microarray.</p>
            </title>
            <aug>
               <au>
                  <snm>Schena</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Shalon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Davis</snm>
                  <fnm>RW</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1995</pubdate>
            <volume>270</volume>
            <fpage>467</fpage>
            <lpage>470</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7569999</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Expression monitoring by hybridization to high-density oligonucleotide arrays.</p>
            </title>
            <aug>
               <au>
                  <snm>Lockhart</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Dong</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Byrne</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Follettie</snm>
                  <fnm>MT</fnm>
               </au>
               <au>
                  <snm>Gallo</snm>
                  <fnm>MV</fnm>
               </au>
               <au>
                  <snm>Chee</snm>
                  <fnm>MS</fnm>
               </au>
               <au>
                  <snm>Mittmann</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Kobayashi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Horton</snm>
                  <fnm>H</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>1996</pubdate>
            <volume>14</volume>
            <fpage>1675</fpage>
            <lpage>1680</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9634850</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Light-generated oligonucleotide arrays for rapid DNA sequence analysis.</p>
            </title>
            <aug>
               <au>
                  <snm>Pease</snm>
                  <fnm>AC</fnm>
               </au>
               <au>
                  <snm>Solas</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Sullivan</snm>
                  <fnm>EJ</fnm>
               </au>
               <au>
                  <snm>Cronin</snm>
                  <fnm>MT</fnm>
               </au>
               <au>
                  <snm>Holmes</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Fodor</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1994</pubdate>
            <volume>91</volume>
            <fpage>5022</fpage>
            <lpage>5026</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">43922</pubid>
                  <pubid idtype="pmpid" link="fulltext">8197176</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.</p>
            </title>
            <aug>
               <au>
                  <snm>Alon</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Barkai</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Notterman</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ybarra</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mack</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Levine</snm>
                  <fnm>AJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1999</pubdate>
            <volume>96</volume>
            <fpage>6745</fpage>
            <lpage>6750</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">21986</pubid>
                  <pubid idtype="pmpid" link="fulltext">10359783</pubid>
                  <pubid idtype="doi">10.1073/pnas.96.12.6745</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Comparative hybridization of an array of 21500 ovarian cDNAs for the discovery of genes overexpressed in ovarian carcinomas.</p>
            </title>
            <aug>
               <au>
                  <snm>Schummer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ng</snm>
                  <fnm>WV</fnm>
               </au>
               <au>
                  <snm>Bumgarner</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Nelson</snm>
                  <fnm>PS</fnm>
               </au>
               <au>
                  <snm>Schummer</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Bednarski</snm>
                  <fnm>DW</fnm>
               </au>
               <au>
                  <snm>Hassell</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Baldwin</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Karlan</snm>
                  <fnm>BY</fnm>
               </au>
               <au>
                  <snm>Hood</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>1999</pubdate>
            <volume>238</volume>
            <fpage>375</fpage>
            <lpage>385</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">10570965</pubid>
                  <pubid idtype="doi">10.1016/S0378-1119(99)00342-X</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.</p>
            </title>
            <aug>
               <au>
                  <snm>Golub</snm>
                  <fnm>TR</fnm>
               </au>
               <au>
                  <snm>Slonim</snm>
                  <fnm>DK</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Huard</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Gaasenbeek</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mesirov</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Coller</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Loh</snm>
                  <fnm>ML</fnm>
               </au>
               <au>
                  <snm>Downing</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Caligiuri</snm>
                  <fnm>MA</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>1999</pubdate>
            <volume>286</volume>
            <fpage>531</fpage>
            <lpage>537</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.286.5439.531</pubid>
                  <pubid idtype="pmpid" link="fulltext">10521349</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Multiclass cancer diagnosis using tumor gene expression signatures.</p>
            </title>
            <aug>
               <au>
                  <snm>Ramaswamy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Rifkin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Mukherjee</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Yeang</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Angelo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ladd</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Reich</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Latulippe</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Mesirov</snm>
                  <fnm>JP</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>15149</fpage>
            <lpage>15154</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">64998</pubid>
                  <pubid idtype="pmpid" link="fulltext">11742071</pubid>
                  <pubid idtype="doi">10.1073/pnas.211566398</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.</p>
            </title>
            <aug>
               <au>
                  <snm>Alizadeh</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Davis</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Ma</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lossos</snm>
                  <fnm>IS</fnm>
               </au>
               <au>
                  <snm>Rosenwald</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Boldrick</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Sabet</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Tran</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>X</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>403</volume>
            <fpage>503</fpage>
            <lpage>511</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35000501</pubid>
                  <pubid idtype="pmpid" link="fulltext">10676951</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Systematic variation in gene expression patterns in human cancer cell lines.</p>
            </title>
            <aug>
               <au>
                  <snm>Ross</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Scherf</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Perou</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Rees</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Spellman</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Iyer</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Jeffrey</snm>
                  <fnm>SS</fnm>
               </au>
               <au>
                  <snm>Van de Rijn</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Waltham</snm>
                  <fnm>M</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2000</pubdate>
            <volume>24</volume>
            <fpage>227</fpage>
            <lpage>235</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/73432</pubid>
                  <pubid idtype="pmpid" link="fulltext">10700174</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.</p>
            </title>
            <aug>
               <au>
                  <snm>Bhattacharjee</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Richards</snm>
                  <fnm>WG</fnm>
               </au>
               <au>
                  <snm>Staunton</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Monti</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Vasa</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Ladd</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Beheshti</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bueno</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gillette</snm>
                  <fnm>M</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>13790</fpage>
            <lpage>13795</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">61120</pubid>
                  <pubid idtype="pmpid" link="fulltext">11707567</pubid>
                  <pubid idtype="doi">10.1073/pnas.191502998</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Gene expression profiling predicts clinical outcome of breast cancer.</p>
            </title>
            <aug>
               <au>
                  <snm>van't Veer</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Dai</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>van de Vijver</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>YD</fnm>
               </au>
               <au>
                  <snm>Hart</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Mao</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Peterse</snm>
                  <fnm>HL</fnm>
               </au>
               <au>
                  <snm>van der Kooy</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Marton</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Witteveen</snm>
                  <fnm>AT</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>415</volume>
            <fpage>530</fpage>
            <lpage>536</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/415530a</pubid>
                  <pubid idtype="pmpid" link="fulltext">11823860</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Gene expression-based classification of malignant gliomas correlates better with survival than histological classification.</p>
            </title>
            <aug>
               <au>
                  <snm>Nutt</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Mani</snm>
                  <fnm>DR</fnm>
               </au>
               <au>
                  <snm>Betensky</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Cairncross</snm>
                  <fnm>JG</fnm>
               </au>
               <au>
                  <snm>Ladd</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Pohl</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Hartmann</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>McLaughlin</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Batchelor</snm>
                  <fnm>TT</fnm>
               </au>
               <etal/>
            </aug>
            <source>Cancer Res</source>
            <pubdate>2003</pubdate>
            <volume>63</volume>
            <fpage>1602</fpage>
            <lpage>1607</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12670911</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.</p>
            </title>
            <aug>
               <au>
                  <snm>Shipp</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Ross</snm>
                  <fnm>KN</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Weng</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Kutok</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Aguiar</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Gaasenbeek</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Angelo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Reich</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Pinkus</snm>
                  <fnm>GS</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Med</source>
            <pubdate>2002</pubdate>
            <volume>8</volume>
            <fpage>68</fpage>
            <lpage>74</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nm0102-68</pubid>
                  <pubid idtype="pmpid" link="fulltext">11786909</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Diagnosis of multiple cancer types by shrunken centroids of gene expression.</p>
            </title>
            <aug>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hastie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Narasimhan</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Chu</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>6567</fpage>
            <lpage>6572</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">124443</pubid>
                  <pubid idtype="pmpid" link="fulltext">12011421</pubid>
                  <pubid idtype="doi">10.1073/pnas.082099299</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations.</p>
            </title>
            <aug>
               <au>
                  <snm>Lee</snm>
                  <fnm>MLT</fnm>
               </au>
               <au>
                  <snm>Kuo</snm>
                  <fnm>FC</fnm>
               </au>
               <au>
                  <snm>Whitmore</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Sklar</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2000</pubdate>
            <volume>97</volume>
            <fpage>9834</fpage>
            <lpage>9839</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">27599</pubid>
                  <pubid idtype="pmpid" link="fulltext">10963655</pubid>
                  <pubid idtype="doi">10.1073/pnas.97.18.9834</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Identification of novel tumor markers in hepatitis C virus-associated hepatocellular carcinoma.</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>MW</fnm>
               </au>
               <au>
                  <snm>Yue</snm>
                  <fnm>ZN</fnm>
               </au>
               <au>
                  <snm>Geiss</snm>
                  <fnm>GK</fnm>
               </au>
               <au>
                  <snm>Sadovnikova</snm>
                  <fnm>NY</fnm>
               </au>
               <au>
                  <snm>Carter</snm>
                  <fnm>VS</fnm>
               </au>
               <au>
                  <snm>Boix</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Lazaro</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Rosenberg</snm>
                  <fnm>GB</fnm>
               </au>
               <au>
                  <snm>Bumgarner</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Fausto</snm>
                  <fnm>N</fnm>
               </au>
               <etal/>
            </aug>
            <source>Cancer Res</source>
            <pubdate>2003</pubdate>
            <volume>63</volume>
            <fpage>859</fpage>
            <lpage>864</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12591738</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4(+)-T-cell lines.</p>
            </title>
            <aug>
               <au>
                  <snm>Van't Wout</snm>
                  <fnm>AB</fnm>
               </au>
               <au>
                  <snm>Lehrman</snm>
                  <fnm>GK</fnm>
               </au>
               <au>
                  <snm>Mikheeva</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>O'Keeffe</snm>
                  <fnm>GC</fnm>
               </au>
               <au>
                  <snm>Katze</snm>
                  <fnm>MG</fnm>
               </au>
               <au>
                  <snm>Bumgarner</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Geiss</snm>
                  <fnm>GK</fnm>
               </au>
               <au>
                  <snm>Mullins</snm>
                  <fnm>JI</fnm>
               </au>
            </aug>
            <source>J Virol</source>
            <pubdate>2003</pubdate>
            <volume>77</volume>
            <fpage>1392</fpage>
            <lpage>1402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">140827</pubid>
                  <pubid idtype="pmpid" link="fulltext">12502855</pubid>
                  <pubid idtype="doi">10.1128/JVI.77.2.1392-1402.2003</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Tumor classification by partial least squares using microarray gene expression data.</p>
            </title>
            <aug>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>DV</fnm>
               </au>
               <au>
                  <snm>Rocke</snm>
                  <fnm>DM</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>39</fpage>
            <lpage>50</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.1.39</pubid>
                  <pubid idtype="pmpid" link="fulltext">11836210</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Multi-class cancer classification via partial least squares with gene expression profiles.</p>
            </title>
            <aug>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>DV</fnm>
               </au>
               <au>
                  <snm>Rocke</snm>
                  <fnm>DM</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>1216</fpage>
            <lpage>1226</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.9.1216</pubid>
                  <pubid idtype="pmpid" link="fulltext">12217913</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Comparison of discrimination methods for the classification of tumors using gene expression data.</p>
            </title>
            <aug>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Fridlyand</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>J Am Stat Assoc</source>
            <pubdate>2002</pubdate>
            <volume>97</volume>
            <fpage>77</fpage>
            <lpage>87</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1198/016214502753479248</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Clustering gene-expression data with repeated measurements.</p>
            </title>
            <aug>
               <au>
                  <snm>Yeung</snm>
                  <fnm>KY</fnm>
               </au>
               <au>
                  <snm>Medvedovic</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bumgarner</snm>
                  <fnm>RE</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>R34</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">156590</pubid>
                  <pubid idtype="pmpid" link="fulltext">12734014</pubid>
                  <pubid idtype="doi">10.1186/gb-2003-4-5-r34</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Supervised clustering of genes.</p>
            </title>
            <aug>
               <au>
                  <snm>Dettling</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Buhlmann</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>research0069.1</fpage>
            <lpage>0069.15</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">151171</pubid>
                  <pubid idtype="pmpid" link="fulltext">12537558</pubid>
                  <pubid idtype="doi">10.1186/gb-2002-3-12-research0069</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Exploration, normalization, and summaries of high density oligonucleotide array probe level data.</p>
            </title>
            <aug>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Hobbs</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Collin</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Beazer-Barclay</snm>
                  <fnm>YD</fnm>
               </au>
               <au>
                  <snm>Antonellis</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Scherf</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Biostatistics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>249</fpage>
            <lpage>264</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/biostatistics/4.2.249</pubid>
                  <pubid idtype="pmpid" link="fulltext">12925520</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Summaries of Affymetrix GeneChip probe level data.</p>
            </title>
            <aug>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Bolstad</snm>
                  <fnm>BM</fnm>
               </au>
               <au>
                  <snm>Collin</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Cope</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Hobbs</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>e15</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">150247</pubid>
                  <pubid idtype="pmpid" link="fulltext">12582260</pubid>
                  <pubid idtype="doi">10.1093/nar/gng015</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>BioConductor open source software for bioinformatics</p>
            </title>
            <url>http://www.bioconductor.org</url>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Functional discovery via a compendium of expression profiles.</p>
            </title>
            <aug>
               <au>
                  <snm>Hughes</snm>
                  <fnm>TR</fnm>
               </au>
               <au>
                  <snm>Marton</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Stoughton</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Armour</snm>
                  <fnm>CD</fnm>
               </au>
               <au>
                  <snm>Bennett</snm>
                  <fnm>HA</fnm>
               </au>
               <au>
                  <snm>Coffey</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Dai</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>YD</fnm>
               </au>
               <etal/>
            </aug>
            <source>Cell</source>
            <pubdate>2000</pubdate>
            <volume>102</volume>
            <fpage>109</fpage>
            <lpage>126</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10929718</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Supplementary web site.</p>
            </title>
            <aug>
               <au>
                  <snm>Yeung</snm>
                  <fnm>KY</fnm>
               </au>
               <au>
                  <snm>Bumgarner</snm>
                  <fnm>RE</fnm>
               </au>
            </aug>
            <url>http://expression.washington.edu/public</url>
         </bibl>
         <bibl id="B31">
            <aug>
               <au>
                  <snm>Jain</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Dubes</snm>
                  <fnm>RC</fnm>
               </au>
            </aug>
            <source>Algorithms for Clustering Data</source>
            <publisher>Englewood Cliffs, NJ: Prentice Hall</publisher>
            <pubdate>1988</pubdate>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Integrated genomic and proteomic analyses of a systemically perturbed metabolic network.</p>
            </title>
            <aug>
               <au>
                  <snm>Ideker</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Thorsson</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Ranish</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Christmas</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Buhler</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Eng</snm>
                  <fnm>JK</fnm>
               </au>
               <au>
                  <snm>Bumgarner</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Goodlett</snm>
                  <fnm>DR</fnm>
               </au>
               <au>
                  <snm>Aebersold</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hood</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2001</pubdate>
            <volume>292</volume>
            <fpage>929</fpage>
            <lpage>934</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.292.5518.929</pubid>
                  <pubid idtype="pmpid" link="fulltext">11340206</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <aug>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>VN</fnm>
               </au>
            </aug>
            <source>Statistical Learning Theory</source>
            <publisher>New York: Wiley</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Knowledge-based analysis of microarray gene expression data by using support vector machines.</p>
            </title>
            <aug>
               <au>
                  <snm>Brown</snm>
                  <fnm>MP</fnm>
               </au>
               <au>
                  <snm>Grundy</snm>
                  <fnm>WN</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Cristianini</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Sugnet</snm>
                  <fnm>CW</fnm>
               </au>
               <au>
                  <snm>Furey</snm>
                  <fnm>TS</fnm>
               </au>
               <au>
                  <snm>Ares</snm>
                  <fnm>M</fnm>
                  <suf>Jr</suf>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2000</pubdate>
            <volume>97</volume>
            <fpage>262</fpage>
            <lpage>267</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">26651</pubid>
                  <pubid idtype="pmpid" link="fulltext">10618406</pubid>
                  <pubid idtype="doi">10.1073/pnas.97.1.262</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>GIST</p>
            </title>
            <url>http://microarray.cpmc.columbia.edu/gist</url>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Molecular classification of multiple tumor types.</p>
            </title>
            <aug>
               <au>
                  <snm>Yeang</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Ramaswamy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Mukherjee</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Rifkin</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Angelo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Reich</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Mesirov</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Golub</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>Suppl 1</issue>
            <fpage>S316</fpage>
            <lpage>S322</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11473023</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
