<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-9-125</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Merging microarray data from separate breast cancer studies provides a robust prognostic test</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Xu</snm>
               <fnm>Lei</fnm>
               <insr iid="I1"/>
               <email>leixu@jhu.edu</email>
            </au>
            <au id="A2">
               <snm>Tan</snm>
               <mnm>Choon</mnm>
               <fnm>Aik</fnm>
               <insr iid="I1"/>
               <email>actan@jhu.edu</email>
            </au>
            <au id="A3">
               <snm>Winslow</snm>
               <mi>L</mi>
               <fnm>Raimond</fnm>
               <insr iid="I1"/>
               <email>rwinslow@jhu.edu</email>
            </au>
            <au id="A4">
               <snm>Geman</snm>
               <fnm>Donald</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>geman@jhu.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>The Institute for Computational Medicine and Center for Cardiovascular Bioinformatics and Modeling, Johns Hopkins University, Baltimore, MD 21218, USA</p>
            </ins>
            <ins id="I2">
               <p>Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>125</fpage>
         <url>http://www.biomedcentral.com/1471-2105/9/125</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18304324</pubid>
               <pubid idtype="doi">10.1186/1471-2105-9-125</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>12</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>27</day>
               <month>2</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>27</day>
               <month>2</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Xu et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>There is an urgent need for new prognostic markers of breast cancer metastases to ensure that newly diagnosed patients receive appropriate therapy. Recent studies have demonstrated the potential value of gene expression signatures in assessing the risk of developing distant metastases. However, due to the small sample sizes of individual studies, the overlap among signatures is almost zero and their predictive power is often limited. Integrating microarray data from multiple studies in order to increase sample size is therefore a promising approach to the development of more robust prognostic tests.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In this study, by using a highly stable data aggregation procedure based on expression comparisons, we have integrated three independent microarray gene expression data sets for breast cancer and identified a structured prognostic signature consisting of 112 genes organized into 80 pair-wise expression comparisons. A classical likelihood ratio test based on these comparisons, essentially weighted voting, achieves 88.6% sensitivity and 54.6% specificity in an independent external test set of 154 samples. The test is highly informative in assessing the risk of developing distant metastases within five years (hazard ratio 9.3 with 95% CI 2.9&#8211;29.9).</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Rank-based features provide a stable way to integrate patient data from separate microarray studies due to invariance to data normalization, and such features can be combined into a useful predictor of distant metastases in breast cancer within a statistical modeling framework which begins to capture gene-gene interactions. Upon further confirmation on large-scale independent data, such prognostic signatures and tests could provide a powerful tool to guide adjuvant systemic treatment that could greatly reduce the cost of breast cancer treatment, both in terms of toxic side effects and health care expenditures.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Breast cancer is the most common form of cancer and the second leading cause of cancer death among women in the United States, with an estimated ~213,000 new cases and ~41,000 deaths in 2006 <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. The main cause of breast cancer death comes from its metastases to distant sites. Early diagnosis and adjuvant systemic therapy (hormone therapy and chemotherapy) substantially reduce the risk of distant metastases. However, adjuvant therapy has serious short- and long-term side effects and involves high medical costs <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Therefore, highly accurate prognostic tests are essential to aid clinicians in deciding which patients are at high risk of developing metastases and should receive adjuvant therapy. Currently, the most widely used treatment guidelines, St. Gallen <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> and the US National Institutes of Health (NIH) <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> consensus criteria, assess a patient's risk of distant metastases based on clinical prognostic factors such as tumor size, lymph node status, and histologic grade. These guidelines cannot accurately identify at-risk patients and about 70&#8211;80% of patients defined as being at risk by these criteria and receiving adjuvant therapy would have survived without it <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. In addition, many patients who would be cured by local or regional treatment alone are "over-treated" and suffer toxic side effects of adjuvant therapy unnecessarily. Therefore, there is an urgent need for new prognostic tests to precisely define a patient's risk of developing metastases to ensure that the patient receives appropriate therapy.</p>
         <p>The advent of DNA microarray technology provides a powerful tool in various aspects of cancer research. Simultaneous assessment of the expression of thousands of genes in a single experiment could allow better understanding of the complex and heterogeneous molecular properties of breast cancer. Such information may lead to more accurate prognostic signatures for prediction of metastasis risk in breast cancer patients. Over the past few years, a number of studies have identified prognostic gene expression signatures and proposed corresponding prognostic tests based on these genes. In many cases, the prediction of breast cancer outcome is superior to conventional prognostic tests <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. Among these studies, the two largest have attempted to identify gene expression signatures and prognostic tests strongly predictive of distant metastases. van't Veer <it>et al</it>. applied a supervised method to identify a 70-gene signature, and a correlation-based test capable of predicting a short interval to distant metastases, in a cohort of 78 young breast cancer patients (&lt;55 years of age) with lymph-node-negative tumors <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The test was applied to a cohort of 295 patients with either lymph-node-negative or lymph-node-positive breast tumors <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Using a different microarray platform, Wang <it>et al</it>. derived a 76-gene prognostic signature from 115 lymph-node-negative patients who had not received adjuvant systemic treatment. The signature could be used to predict distant metastasis within five years in breast cancer patients of all age groups with lymph-node-negative tumors and was subsequently applied to a set of 171 lymph-node-negative patients <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. These studies have shown that tests based on gene expression signatures would result in a substantial reduction of the number of patients receiving unnecessary adjuvant systemic treatment, thereby preventing over-treatment in a considerable number of breast cancer patients.</p>
         <p>The most striking observation when comparing the signatures from different studies is the lack of overlap of signature genes. For instance, in the studies of van't Veer <it>et al</it>. and Wang <it>et al</it>., despite the similar clinical and statistical designs, there is an overlap of only three genes in the two gene signature lists. These diverse results make it difficult to identify the most predictive genes for breast cancer prognosis. The disagreements in gene signatures may be partly due to the use of different microarray platforms and differences in patient selection, normalization procedures and other experimental choices. Moreover, in a recent study <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, reanalysis of the van't Veer data has shown that the prognostic signature is even strongly influenced by the subset of the patients used for signature selection within a particular study. This observation indicates that given the small number of samples in the training sets, many genes might show what appear to be significant correlations with clinical outcome and the differences among these correlations might be small. Therefore, it is possible to combine genes in many ways to generate different signatures with similar predictive power when validated on internal test sets <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Moreover, in general, these prognostic tests are not robust, meaning that they cannot be validated on independent, external data sets <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. Independent reanalysis on other microarray data sets has shown very similar findings <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Given the large numbers of features (~10,000 to 40,000 genes) in microarray data and the relatively small numbers of samples (~100 patients) used in the training set of each study, it is highly possible to accidentally find a set of genes with good predictive power on internal test sets. This is the type of "over-fitting" that is typical when the number of observed variables far exceeds the number of samples. In light of this general "small-sample dilemma" in statistical learning and the particular observations from the two reanalysis studies mentioned above, the disagreements in gene signatures obtained from different data sets are not surprising. We believe that much larger numbers of samples (patients), perhaps thousands, are needed to develop more robust prognostic tests and signatures.</p>
         <p>The rapid accumulation of microarray gene expression data suggests that combining microarray data from different studies may be a useful way to increase sample size and diversity. In particular, "meta-analyses" have recently been used to merge different studies in order to develop prognostic gene expression signatures for breast cancer <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. However, effectively integrating microarray data from different studies is not straightforward due to several issues of compatibility, such as differing microarray platforms, experimental protocols and data preprocessing methods. Instead of directly integrating microarray gene expression values, meta-analyses combine results (e.g. <it>t </it>statistics) of individual studies to increase statistical power. The major limitation of meta-analyses is that the small sample sizes typical of individual studies, coupled with variation due to differences in study protocols, inevitably degrades the results. Also, deriving separate statistics and then averaging is often less powerful than directly computing statistics from aggregated data.</p>
         <p>In contrast to the meta-analysis approach, in which the results of individual studies are combined at an interpretative level, other methods, such as Z-score, Distance Weighted Discrimination (DWD), integrate microarray data from different studies at the expression value level after transforming the expressions to numerically comparable measures <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. In general, the procedure involves the following steps. First, a list of genes common to multiple distinct microarray platforms is extracted based on cross-referencing the annotation of each probe set represented on the microarrays. Cross-referencing of expression data is usually achieved using the UniGene database <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Next, for each individual data set, numerically comparable quantities are derived from the expression values of genes in the common list by applying specific data transformation and normalization methods. Finally, the newly derived quantities from individual data sets are combined to increase sample size and statistical methods are applied to the combined data to build diagnostic and prognostic signatures. One major limitation of these direct integration methods is that there is still no consensus on how best to perform data transformation and normalization.</p>
         <p>In our previous work <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, we proposed a novel method for molecular classification which builds predictors from <it>relative </it>expression values, which can be directly applied to integrated microarray data and which generates very simple decision rules. Because this method is based only on the ranks of the expression values within a profile (sample), there is no need to prepare the data for integration, in particular there is no need for data normalization, since ranks are invariant to all types of within-array monotonic preprocessing. This approach to data integration was validated on prostate cancer data <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, resulting in a powerful two-gene diagnostic classifier. It has also been applied recently to differentiating between gastrointestinal stromal tumors and leiomyosarcomas <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. Here, we extend this method to predict distant metastases in breast cancer, and attempt to overcome the limitations of previous study-specific methods and meta-analyses.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Summary</p>
            </st>
            <p>We integrate three independent microarray gene expression data sets to obtain an integrated training set of 358 samples and identify a set of features for predicting distant metastases. All the samples included in this study are from lymph-node-negative patients who have not received adjuvant systemic treatment. Each feature is based on an ordered pair of genes and assumes the value one if the first gene is expressed less than the second gene, and assumes the value zero otherwise. These genes may not all be highly differentially expressed, and one gene in the pair may serve as a "reference" for the other one. Since the features are rank-based, no data normalization is needed before data integration. A classical likelihood ratio test is used to classify patients as either poor-outcome, meaning they are likely to metastasize, or good-outcome, meaning that they are unlikely to develop distant metastases. The choice of features is motivated by achieving the highest possible specificity at an acceptable level of sensitivity, taken here to be 90% in accordance with the St. Gallen and NIH treatment guidelines. The number of features chosen in the prognostic signature, as well as the threshold in the likelihood ratio test (LRT), is optimized with <it>k</it>-fold cross-validation on the integrated training set. The optimal feature number is estimated to be 80, corresponding to 112 genes (since some genes appear in more than one feature). The prognostic test based on this signature is validated using an independent microarray data set. Upon further validation on large-scale independent data, the prognostic gene expression signature could support other breast cancer prognostic tests with high enough specificity to help avoid over-treatment of newly diagnosed patients.</p>
         </sec>
         <sec>
            <st>
               <p>Study data</p>
            </st>
            <p>Four breast cancer microarray data sets are included in this study. Each data set has been downloaded from publicly available gene expression repositories (e.g. Gene Expression Omnibus) or supporting web sites <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B11">11</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>. All four data sets are generated from the same Affymetrix HG-U133A microarray platform. Here, the names of the first authors of individual studies are used as the names of the data sets. Three data sets, Miller (251 patients), Sotiriou (189 patients) and Wang (286 patients), are used as training data and the other one, Pawitan (159 patients), is used as independent test data. The reason for this division into training and test data is that detailed clinical information has been provided for the Miller, Sotiriou and Wang data sets and this information has been used to select specific patients for training, whereas little clinical information is provided for the Pawitan study. For the Miller, Sotiriou and Pawitan studies, because the gene expression data sets provided by them have undergone cross-sample normalization, we have downloaded the raw CEL files and calculated expression values using the Affymetrix GeneChip Operating Software version 1.4. There is an 85-patient overlap between Miller and Sotiriou data sets, so we have excluded the replicate samples from our study. Detailed patient information in each study has been described in the corresponding literature.</p>
            <p>Motivated by a recent study <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, we employ the idea of restricting training data to extreme patient samples, which are more informative in identifying a prognostic signature. Extreme patients are either short-term survivors with poor-outcome within a short period or long-term survivors who maintain a good-outcome after a long follow-up time. Specifically, we select patients who developed distant metastases (relapse) within five years as poor-outcome samples and patients who were free of distant metastases (relapse) during the follow-up for a period of at least eight years as good-outcome samples. The sharp contrast between short-term and long-term survivors should identify more informative and reliable genes for a prognostic signature. Only early stage lymph-node-negative patients who had not received adjuvant systemic treatment are included in the training data because adjuvant treatment is likely to modify patient outcome. The selection is irrespective of age, tumor size and other clinical parameters. After applying the above selection criteria, a total of 358 patients are identified from the three training data sets and used to learn a prognostic signature and prognostic test. The numbers of selected patients from each training data set are listed in Table <tblr tid="T1">1</tblr>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Training data sets: lymph-node-negative patients with no adjuvant treatment</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Data Set</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>No. of Patients</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>No. of Good-outcome</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>No. of Poor-outcome</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Miller [25]</p>
                     </c>
                     <c ca="left">
                        <p>106</p>
                     </c>
                     <c ca="left">
                        <p>92</p>
                     </c>
                     <c ca="left">
                        <p>14</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Sotiriou [11]</p>
                     </c>
                     <c ca="left">
                        <p>43</p>
                     </c>
                     <c ca="left">
                        <p>30</p>
                     </c>
                     <c ca="left">
                        <p>13</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Wang [7]</p>
                     </c>
                     <c ca="left">
                        <p>209</p>
                     </c>
                     <c ca="left">
                        <p>114</p>
                     </c>
                     <c ca="left">
                        <p>95</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total</p>
                     </c>
                     <c ca="left">
                        <p>358</p>
                     </c>
                     <c ca="left">
                        <p>236</p>
                     </c>
                     <c ca="left">
                        <p>122</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>A prognostic signature from integrated data</p>
            </st>
            <p>We directly merge the three microarray data sets in Table <tblr tid="T1">1</tblr>, using the 22283 probe sets on Affymetrix HG-U133A microarray, to form an integrated training data set. The integrated data set consists of 122 extreme poor-outcome samples (distant metastases within five years after surgery) and 236 extreme good-outcome samples (free of distant metastases during the follow-up for a period of at least eight years after surgery). Recall that each feature is based on a pair of genes. The integrated training set is used to estimate the relationship between the number <it>m </it>of features in a prognostic classifier and the specificity at 90% sensitivity level, evaluated by the 40-fold cross-validation, as described in 'Methods'. The result is plotted in Figure <figr fid="F1">1</figr>. As can be seen, the specificity is nearly constant after about 80 features are included. Our final prognostic signature then consists of the 80 top-ranked features (gene pairs) from the feature list generated from the original integrated training data, using the feature selection and transformation procedures described in 'Methods'. Because some genes appear in more than one feature, the 80 top-ranked gene pairs in our prognostic signature include 112 distinct genes (Table <tblr tid="T2">2</tblr>). To illustrate the behavior of the 80 features in the signature on the Wang data set (part of the integrated training data), we show the difference in expression between the two genes in each of the 80 gene pairs in the form of a heat map in Figure <figr fid="F2">2</figr>. Distinct patterns of expression differences can be observed for good- and poor-outcome samples.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Choosing size of the signature</p>
               </caption>
               <text>
                  <p><b>Choosing size of the signature</b>. The relationship between the number of features in a prognostic signature and the specificity at 90% sensitivity of the corresponding prognostic test, evaluated by 40-fold cross-validation. We select <it>m</it><sub><it>opt </it></sub>= 80, the smallest value that achieves roughly maximum specificity at the 90% sensitivity level. The specificity observed on the validation set is in fact higher.</p>
               </text>
               <graphic file="1471-2105-9-125-1"/>
            </fig>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>The heat map of the 80 signature gene pairs</p>
               </caption>
               <text>
                  <p><b>The heat map of the 80 signature gene pairs</b>. The Wang data set is used to illustrate the gene expression values of the signature genes. A heat map is generated using the matrix2png software [34]. There are 80 rows corresponding to the 80 gene pairs; the displayed intensities are the differences between the expression values of the two genes in each pair. The expression value for each difference is normalized across the samples to zero mean and one standard deviation (SD) for visualization purposes. Differences with expression levels greater than the mean are colored in red and those below the mean are colored in green. The scale indicates the number of SDs above or below the mean.</p>
               </text>
               <graphic file="1471-2105-9-125-2"/>
            </fig>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Genes in the identified prognostic signature. For each probe set the first column lists the subset of the eighty pairs which contain it. The pairs are ordered from 1 to 80 by their scores.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Pair Rank</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Probe Set</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Gene Symbol</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Gene Title</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>1, 43</p>
                     </c>
                     <c ca="left">
                        <p>91816_f_at</p>
                     </c>
                     <c ca="left">
                        <p>RKHD1</p>
                     </c>
                     <c ca="left">
                        <p>ring finger and KH domain containing 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>1, 6, 73</p>
                     </c>
                     <c ca="left">
                        <p>204641_at</p>
                     </c>
                     <c ca="left">
                        <p>NEK2</p>
                     </c>
                     <c ca="left">
                        <p>NIMA (never in mitosis gene a)-related kinase 2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p>213139_at</p>
                     </c>
                     <c ca="left">
                        <p>SNAI2</p>
                     </c>
                     <c ca="left">
                        <p>snail homolog 2 (Drosophila)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>2, 4, 9, 33</p>
                     </c>
                     <c ca="left">
                        <p>212188_at</p>
                     </c>
                     <c ca="left">
                        <p>KCTD12</p>
                     </c>
                     <c ca="left">
                        <p>potassium channel tetramerisation domain containing 12</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p>212022_s_at</p>
                     </c>
                     <c ca="left">
                        <p>MKI67</p>
                     </c>
                     <c ca="left">
                        <p>antigen identified by monoclonal antibody Ki-67</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3, 61, 80</p>
                     </c>
                     <c ca="left">
                        <p>219716_at</p>
                     </c>
                     <c ca="left">
                        <p>APOL6</p>
                     </c>
                     <c ca="left">
                        <p>apolipoprotein L, 6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>4</p>
                     </c>
                     <c ca="left">
                        <p>205264_at</p>
                     </c>
                     <c ca="left">
                        <p>CD3EAP</p>
                     </c>
                     <c ca="left">
                        <p>CD3e molecule, epsilon associated protein</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>5</p>
                     </c>
                     <c ca="left">
                        <p>206687_s_at</p>
                     </c>
                     <c ca="left">
                        <p>PTPN6</p>
                     </c>
                     <c ca="left">
                        <p>protein tyrosine phosphatase, non-receptor type 6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>5, 67</p>
                     </c>
                     <c ca="left">
                        <p>218009_s_at</p>
                     </c>
                     <c ca="left">
                        <p>PRC1</p>
                     </c>
                     <c ca="left">
                        <p>protein regulator of cytokinesis 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>6, 35, 39, 55</p>
                     </c>
                     <c ca="left">
                        <p>219579_at</p>
                     </c>
                     <c ca="left">
                        <p>RAB3IL1</p>
                     </c>
                     <c ca="left">
                        <p>RAB3A interacting protein (rabin3)-like 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>7</p>
                     </c>
                     <c ca="left">
                        <p>221824_s_at</p>
                     </c>
                     <c ca="left">
                        <p>MARCH8</p>
                     </c>
                     <c ca="left">
                        <p>membrane-associated ring finger (C3HC4) 8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>7</p>
                     </c>
                     <c ca="left">
                        <p>209574_s_at</p>
                     </c>
                     <c ca="left">
                        <p>C18orf1</p>
                     </c>
                     <c ca="left">
                        <p>chromosome 18 open reading frame 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>8</p>
                     </c>
                     <c ca="left">
                        <p>210199_at</p>
                     </c>
                     <c ca="left">
                        <p>CRYAA</p>
                     </c>
                     <c ca="left">
                        <p>crystallin, alpha A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>8, 24, 26, 31</p>
                     </c>
                     <c ca="left">
                        <p>219493_at</p>
                     </c>
                     <c ca="left">
                        <p>SHCBP1</p>
                     </c>
                     <c ca="left">
                        <p>SHC SH2-domain binding protein 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>9</p>
                     </c>
                     <c ca="left">
                        <p>204177_s_at</p>
                     </c>
                     <c ca="left">
                        <p>KLHL20</p>
                     </c>
                     <c ca="left">
                        <p>kelch-like 20 (Drosophila)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>10, 34</p>
                     </c>
                     <c ca="left">
                        <p>203010_at</p>
                     </c>
                     <c ca="left">
                        <p>STAT5A</p>
                     </c>
                     <c ca="left">
                        <p>signal transducer and activator of transcription 5A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>10</p>
                     </c>
                     <c ca="left">
                        <p>212747_at</p>
                     </c>
                     <c ca="left">
                        <p>ANKS1A</p>
                     </c>
                     <c ca="left">
                        <p>ankyrin repeat and sterile alpha motif domain containing 1A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>11, 19, 21</p>
                     </c>
                     <c ca="left">
                        <p>205034_at</p>
                     </c>
                     <c ca="left">
                        <p>CCNE2</p>
                     </c>
                     <c ca="left">
                        <p>cyclin E2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>11, 65</p>
                     </c>
                     <c ca="left">
                        <p>217427_s_at</p>
                     </c>
                     <c ca="left">
                        <p>HIRA</p>
                     </c>
                     <c ca="left">
                        <p>HIR histone cell cycle regulation defective homolog A (S. cerevisiae)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>12, 46, 54, 74</p>
                     </c>
                     <c ca="left">
                        <p>222077_s_at</p>
                     </c>
                     <c ca="left">
                        <p>RACGAP1</p>
                     </c>
                     <c ca="left">
                        <p>Rac GTPase activating protein 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>12, 62</p>
                     </c>
                     <c ca="left">
                        <p>36545_s_at</p>
                     </c>
                     <c ca="left">
                        <p>SFI1</p>
                     </c>
                     <c ca="left">
                        <p>Sfi1 homolog, spindle assembly associated (yeast)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>13, 17, 72</p>
                     </c>
                     <c ca="left">
                        <p>218883_s_at</p>
                     </c>
                     <c ca="left">
                        <p>MLF1IP</p>
                     </c>
                     <c ca="left">
                        <p>MLF1 interacting protein</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>13</p>
                     </c>
                     <c ca="left">
                        <p>203332_s_at</p>
                     </c>
                     <c ca="left">
                        <p>INPP5D</p>
                     </c>
                     <c ca="left">
                        <p>inositol polyphosphate-5-phosphatase, 145kDa</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>14, 15</p>
                     </c>
                     <c ca="left">
                        <p>211584_s_at</p>
                     </c>
                     <c ca="left">
                        <p>NPAT</p>
                     </c>
                     <c ca="left">
                        <p>nuclear protein, ataxia-telangiectasia locus</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>14</p>
                     </c>
                     <c ca="left">
                        <p>219512_at</p>
                     </c>
                     <c ca="left">
                        <p>C20orf172</p>
                     </c>
                     <c ca="left">
                        <p>chromosome 20 open reading frame 172</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>15</p>
                     </c>
                     <c ca="left">
                        <p>221193_s_at</p>
                     </c>
                     <c ca="left">
                        <p>ZCCHC10</p>
                     </c>
                     <c ca="left">
                        <p>zinc finger, CCHC domain containing 10</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>16</p>
                     </c>
                     <c ca="left">
                        <p>221521_s_at</p>
                     </c>
                     <c ca="left">
                        <p>GINS2</p>
                     </c>
                     <c ca="left">
                        <p>GINS complex subunit 2 (Psf2 homolog)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>16</p>
                     </c>
                     <c ca="left">
                        <p>209671_x_at</p>
                     </c>
                     <c ca="left">
                        <p>TRA@///TRAC</p>
                     </c>
                     <c ca="left">
                        <p>T cell receptor alpha locus///T cell receptor alpha locus</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>17</p>
                     </c>
                     <c ca="left">
                        <p>208952_s_at</p>
                     </c>
                     <c ca="left">
                        <p>LARP5</p>
                     </c>
                     <c ca="left">
                        <p>La ribonucleoprotein domain family, member 5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>18, 30</p>
                     </c>
                     <c ca="left">
                        <p>218726_at</p>
                     </c>
                     <c ca="left">
                        <p>DKFZp762E1312</p>
                     </c>
                     <c ca="left">
                        <p>hypothetical protein DKFZp762E1312</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>18, 51</p>
                     </c>
                     <c ca="left">
                        <p>211581_x_at</p>
                     </c>
                     <c ca="left">
                        <p>LST1</p>
                     </c>
                     <c ca="left">
                        <p>leukocyte specific transcript 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>19</p>
                     </c>
                     <c ca="left">
                        <p>221273_s_at</p>
                     </c>
                     <c ca="left">
                        <p>DKFZP761H1710</p>
                     </c>
                     <c ca="left">
                        <p>hypothetical protein DKFZp761H1710</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>20</p>
                     </c>
                     <c ca="left">
                        <p>205395_s_at</p>
                     </c>
                     <c ca="left">
                        <p>MRE11A</p>
                     </c>
                     <c ca="left">
                        <p>MRE11 meiotic recombination 11 homolog A (S. cerevisiae)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>20, 59</p>
                     </c>
                     <c ca="left">
                        <p>214973_x_at</p>
                     </c>
                     <c ca="left">
                        <p>IGHD</p>
                     </c>
                     <c ca="left">
                        <p>immunoglobulin heavy constant delta</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>21, 27</p>
                     </c>
                     <c ca="left">
                        <p>211881_x_at</p>
                     </c>
                     <c ca="left">
                        <p>IGLJ3</p>
                     </c>
                     <c ca="left">
                        <p>immunoglobulin lambda joining 3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>22</p>
                     </c>
                     <c ca="left">
                        <p>202602_s_at</p>
                     </c>
                     <c ca="left">
                        <p>HTATSF1</p>
                     </c>
                     <c ca="left">
                        <p>HIV-1 Tat specific factor 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>22</p>
                     </c>
                     <c ca="left">
                        <p>218143_s_at</p>
                     </c>
                     <c ca="left">
                        <p>SCAMP2</p>
                     </c>
                     <c ca="left">
                        <p>secretory carrier membrane protein 2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>23</p>
                     </c>
                     <c ca="left">
                        <p>212911_at</p>
                     </c>
                     <c ca="left">
                        <p>DNAJC16</p>
                     </c>
                     <c ca="left">
                        <p>DnaJ (Hsp40) homolog, subfamily C, member 16</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>23</p>
                     </c>
                     <c ca="left">
                        <p>204817_at</p>
                     </c>
                     <c ca="left">
                        <p>ESPL1</p>
                     </c>
                     <c ca="left">
                        <p>extra spindle poles like 1 (S. cerevisiae)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>24</p>
                     </c>
                     <c ca="left">
                        <p>215783_s_at</p>
                     </c>
                     <c ca="left">
                        <p>ALPL</p>
                     </c>
                     <c ca="left">
                        <p>alkaline phosphatase, liver/bone/kidney</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>25, 38, 39, 44, 52, 71</p>
                     </c>
                     <c ca="left">
                        <p>204825_at</p>
                     </c>
                     <c ca="left">
                        <p>MELK</p>
                     </c>
                     <c ca="left">
                        <p>maternal embryonic leucine zipper kinase</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>25</p>
                     </c>
                     <c ca="left">
                        <p>213689_x_at</p>
                     </c>
                     <c ca="left">
                        <p>RPL5</p>
                     </c>
                     <c ca="left">
                        <p>Ribosomal protein L5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>26</p>
                     </c>
                     <c ca="left">
                        <p>206545_at</p>
                     </c>
                     <c ca="left">
                        <p>CD28</p>
                     </c>
                     <c ca="left">
                        <p>CD28 molecule</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>27</p>
                     </c>
                     <c ca="left">
                        <p>206364_at</p>
                     </c>
                     <c ca="left">
                        <p>KIF14</p>
                     </c>
                     <c ca="left">
                        <p>kinesin family member 14</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>28, 60, 61</p>
                     </c>
                     <c ca="left">
                        <p>208079_s_at</p>
                     </c>
                     <c ca="left">
                        <p>AURKA</p>
                     </c>
                     <c ca="left">
                        <p>aurora kinase A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>28</p>
                     </c>
                     <c ca="left">
                        <p>214955_at</p>
                     </c>
                     <c ca="left">
                        <p>TMPRSS6</p>
                     </c>
                     <c ca="left">
                        <p>transmembrane protease, serine 6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>29</p>
                     </c>
                     <c ca="left">
                        <p>210966_x_at</p>
                     </c>
                     <c ca="left">
                        <p>LARP1</p>
                     </c>
                     <c ca="left">
                        <p>La ribonucleoprotein domain family, member 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>29</p>
                     </c>
                     <c ca="left">
                        <p>218830_at</p>
                     </c>
                     <c ca="left">
                        <p>RPL26L1</p>
                     </c>
                     <c ca="left">
                        <p>ribosomal protein L26-like 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>30</p>
                     </c>
                     <c ca="left">
                        <p>204498_s_at</p>
                     </c>
                     <c ca="left">
                        <p>ADCY9</p>
                     </c>
                     <c ca="left">
                        <p>adenylate cyclase 9</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>31</p>
                     </c>
                     <c ca="left">
                        <p>206211_at</p>
                     </c>
                     <c ca="left">
                        <p>SELE</p>
                     </c>
                     <c ca="left">
                        <p>selectin E (endothelial adhesion molecule 1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>32, 34, 69</p>
                     </c>
                     <c ca="left">
                        <p>201890_at</p>
                     </c>
                     <c ca="left">
                        <p>RRM2</p>
                     </c>
                     <c ca="left">
                        <p>ribonucleotide reductase M2 polypeptide</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>32</p>
                     </c>
                     <c ca="left">
                        <p>219298_at</p>
                     </c>
                     <c ca="left">
                        <p>ECHDC3</p>
                     </c>
                     <c ca="left">
                        <p>enoyl Coenzyme A hydratase domain containing 3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>33</p>
                     </c>
                     <c ca="left">
                        <p>204847_at</p>
                     </c>
                     <c ca="left">
                        <p>ZBTB11</p>
                     </c>
                     <c ca="left">
                        <p>zinc finger and BTB domain containing 11</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>35, 62</p>
                     </c>
                     <c ca="left">
                        <p>203214_x_at</p>
                     </c>
                     <c ca="left">
                        <p>CDC2</p>
                     </c>
                     <c ca="left">
                        <p>cell division cycle 2, G1 to S and G2 to M</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>36</p>
                     </c>
                     <c ca="left">
                        <p>204605_at</p>
                     </c>
                     <c ca="left">
                        <p>CGRRF1</p>
                     </c>
                     <c ca="left">
                        <p>cell growth regulator with ring finger domain 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>36</p>
                     </c>
                     <c ca="left">
                        <p>211251_x_at</p>
                     </c>
                     <c ca="left">
                        <p>NFYC</p>
                     </c>
                     <c ca="left">
                        <p>nuclear transcription factor Y, gamma</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>37, 65</p>
                     </c>
                     <c ca="left">
                        <p>213008_at</p>
                     </c>
                     <c ca="left">
                        <p>KIAA1794</p>
                     </c>
                     <c ca="left">
                        <p>KIAA1794</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>37, 73</p>
                     </c>
                     <c ca="left">
                        <p>210042_s_at</p>
                     </c>
                     <c ca="left">
                        <p>CTSZ</p>
                     </c>
                     <c ca="left">
                        <p>cathepsin Z</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>38</p>
                     </c>
                     <c ca="left">
                        <p>203595_s_at</p>
                     </c>
                     <c ca="left">
                        <p>IFIT5</p>
                     </c>
                     <c ca="left">
                        <p>interferon-induced protein with tetratricopeptide repeats 5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>40</p>
                     </c>
                     <c ca="left">
                        <p>221529_s_at</p>
                     </c>
                     <c ca="left">
                        <p>PLVAP</p>
                     </c>
                     <c ca="left">
                        <p>plasmalemma vesicle associated protein</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>40</p>
                     </c>
                     <c ca="left">
                        <p>202114_at</p>
                     </c>
                     <c ca="left">
                        <p>SNX2</p>
                     </c>
                     <c ca="left">
                        <p>sorting nexin 2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>41</p>
                     </c>
                     <c ca="left">
                        <p>211779_x_at</p>
                     </c>
                     <c ca="left">
                        <p>AP2A2</p>
                     </c>
                     <c ca="left">
                        <p>adaptor-related protein complex 2, alpha 2 subunit</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>41, 63</p>
                     </c>
                     <c ca="left">
                        <p>202324_s_at</p>
                     </c>
                     <c ca="left">
                        <p>ACBD3</p>
                     </c>
                     <c ca="left">
                        <p>acyl-Coenzyme A binding domain containing 3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>42, 57</p>
                     </c>
                     <c ca="left">
                        <p>201821_s_at</p>
                     </c>
                     <c ca="left">
                        <p>TIMM17A</p>
                     </c>
                     <c ca="left">
                        <p>translocase of inner mitochondrial membrane 17 homolog A (yeast)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>42</p>
                     </c>
                     <c ca="left">
                        <p>201551_s_at</p>
                     </c>
                     <c ca="left">
                        <p>LAMP1</p>
                     </c>
                     <c ca="left">
                        <p>lysosomal-associated membrane protein 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>43</p>
                     </c>
                     <c ca="left">
                        <p>48808_at</p>
                     </c>
                     <c ca="left">
                        <p>DHFR</p>
                     </c>
                     <c ca="left">
                        <p>dihydrofolate reductase</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>44</p>
                     </c>
                     <c ca="left">
                        <p>211643_x_at</p>
                     </c>
                     <c ca="left">
                        <p>LOC651961</p>
                     </c>
                     <c ca="left">
                        <p>Myosin-reactive immunoglobulin light chain variable region</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>45</p>
                     </c>
                     <c ca="left">
                        <p>210396_s_at</p>
                     </c>
                     <c ca="left">
                        <p>LOC440354</p>
                     </c>
                     <c ca="left">
                        <p>PI-3-kinase-related kinase SMG-1 pseudogene</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>45</p>
                     </c>
                     <c ca="left">
                        <p>201070_x_at</p>
                     </c>
                     <c ca="left">
                        <p>SF3B1</p>
                     </c>
                     <c ca="left">
                        <p>splicing factor 3b, subunit 1, 155kDa</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>46</p>
                     </c>
                     <c ca="left">
                        <p>207391_s_at</p>
                     </c>
                     <c ca="left">
                        <p>PIP5K1A</p>
                     </c>
                     <c ca="left">
                        <p>phosphatidylinositol-4-phosphate 5-kinase, type I, alpha</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>47</p>
                     </c>
                     <c ca="left">
                        <p>200800_s_at</p>
                     </c>
                     <c ca="left">
                        <p>HSPA1A</p>
                     </c>
                     <c ca="left">
                        <p>heat shock 70 kDa protein 1A</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>47</p>
                     </c>
                     <c ca="left">
                        <p>201009_s_at</p>
                     </c>
                     <c ca="left">
                        <p>TXNIP</p>
                     </c>
                     <c ca="left">
                        <p>thioredoxin interacting protein</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>48</p>
                     </c>
                     <c ca="left">
                        <p>203530_s_at</p>
                     </c>
                     <c ca="left">
                        <p>STX4</p>
                     </c>
                     <c ca="left">
                        <p>syntaxin 4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>48, 50</p>
                     </c>
                     <c ca="left">
                        <p>218085_at</p>
                     </c>
                     <c ca="left">
                        <p>CHMP5</p>
                     </c>
                     <c ca="left">
                        <p>chromatin modifying protein 5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>49, 68, 70</p>
                     </c>
                     <c ca="left">
                        <p>219555_s_at</p>
                     </c>
                     <c ca="left">
                        <p>C16orf60</p>
                     </c>
                     <c ca="left">
                        <p>chromosome 16 open reading frame 60</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>49</p>
                     </c>
                     <c ca="left">
                        <p>210419_at</p>
                     </c>
                     <c ca="left">
                        <p>BARX2</p>
                     </c>
                     <c ca="left">
                        <p>BarH-like homeobox 2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>50</p>
                     </c>
                     <c ca="left">
                        <p>214119_s_at</p>
                     </c>
                     <c ca="left">
                        <p>FKBP1A</p>
                     </c>
                     <c ca="left">
                        <p>FK506 binding protein 1A, 12 kDa</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>51, 58</p>
                     </c>
                     <c ca="left">
                        <p>203362_s_at</p>
                     </c>
                     <c ca="left">
                        <p>MAD2L1</p>
                     </c>
                     <c ca="left">
                        <p>MAD2 mitotic arrest deficient-like 1 (yeast)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>52</p>
                     </c>
                     <c ca="left">
                        <p>218910_at</p>
                     </c>
                     <c ca="left">
                        <p>TMEM16K</p>
                     </c>
                     <c ca="left">
                        <p>transmembrane protein 16K</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>53</p>
                     </c>
                     <c ca="left">
                        <p>208838_at</p>
                     </c>
                     <c ca="left">
                        <p>KIAA0829</p>
                     </c>
                     <c ca="left">
                        <p>KIAA0829 protein</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>53</p>
                     </c>
                     <c ca="left">
                        <p>212081_x_at</p>
                     </c>
                     <c ca="left">
                        <p>BAT2</p>
                     </c>
                     <c ca="left">
                        <p>HLA-B associated transcript 2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>54</p>
                     </c>
                     <c ca="left">
                        <p>202115_s_at</p>
                     </c>
                     <c ca="left">
                        <p>NOC2L</p>
                     </c>
                     <c ca="left">
                        <p>nucleolar complex associated 2 homolog (S. cerevisiae)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>55</p>
                     </c>
                     <c ca="left">
                        <p>209714_s_at</p>
                     </c>
                     <c ca="left">
                        <p>CDKN3</p>
                     </c>
                     <c ca="left">
                        <p>cyclin-dependent kinase inhibitor 3 (CDK2-associated dual specificity phosphatase)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>56</p>
                     </c>
                     <c ca="left">
                        <p>205701_at</p>
                     </c>
                     <c ca="left">
                        <p>IPO8</p>
                     </c>
                     <c ca="left">
                        <p>importin 8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>56</p>
                     </c>
                     <c ca="left">
                        <p>205063_at</p>
                     </c>
                     <c ca="left">
                        <p>SIP1</p>
                     </c>
                     <c ca="left">
                        <p>survival of motor neuron protein interacting protein 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>57</p>
                     </c>
                     <c ca="left">
                        <p>200918_s_at</p>
                     </c>
                     <c ca="left">
                        <p>SRPR</p>
                     </c>
                     <c ca="left">
                        <p>signal recognition particle receptor ('docking protein')</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>58</p>
                     </c>
                     <c ca="left">
                        <p>212527_at</p>
                     </c>
                     <c ca="left">
                        <p>D15Wsu75e</p>
                     </c>
                     <c ca="left">
                        <p>DNA segment, Chr 15, Wayne State University 75, expressed</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>59</p>
                     </c>
                     <c ca="left">
                        <p>204244_s_at</p>
                     </c>
                     <c ca="left">
                        <p>DBF4</p>
                     </c>
                     <c ca="left">
                        <p>DBF4 homolog (S. cerevisiae)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>60</p>
                     </c>
                     <c ca="left">
                        <p>214508_x_at</p>
                     </c>
                     <c ca="left">
                        <p>CREM</p>
                     </c>
                     <c ca="left">
                        <p>cAMP responsive element modulator</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>63</p>
                     </c>
                     <c ca="left">
                        <p>200787_s_at</p>
                     </c>
                     <c ca="left">
                        <p>PEA15</p>
                     </c>
                     <c ca="left">
                        <p>phosphoprotein enriched in astrocytes 15</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>64</p>
                     </c>
                     <c ca="left">
                        <p>203764_at</p>
                     </c>
                     <c ca="left">
                        <p>DLG7</p>
                     </c>
                     <c ca="left">
                        <p>discs, large homolog 7 (Drosophila)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>64</p>
                     </c>
                     <c ca="left">
                        <p>205877_s_at</p>
                     </c>
                     <c ca="left">
                        <p>ZC3H7B</p>
                     </c>
                     <c ca="left">
                        <p>zinc finger CCCH-type containing 7B</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>66</p>
                     </c>
                     <c ca="left">
                        <p>200848_at</p>
                     </c>
                     <c ca="left">
                        <p>AHCYL1</p>
                     </c>
                     <c ca="left">
                        <p>S-adenosylhomocysteine hydrolase-like 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>66</p>
                     </c>
                     <c ca="left">
                        <p>201091_s_at</p>
                     </c>
                     <c ca="left">
                        <p>CBX3</p>
                     </c>
                     <c ca="left">
                        <p>chromobox homolog 3 (HP1 gamma homolog, Drosophila)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>67</p>
                     </c>
                     <c ca="left">
                        <p>64064_at</p>
                     </c>
                     <c ca="left">
                        <p>GIMAP5</p>
                     </c>
                     <c ca="left">
                        <p>GTPase, IMAP family member 5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>68</p>
                     </c>
                     <c ca="left">
                        <p>211649_x_at</p>
                     </c>
                     <c ca="left">
                        <p>IGHG1</p>
                     </c>
                     <c ca="left">
                        <p>Immunoglobulin heavy constant gamma 1 (G1m marker)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>69</p>
                     </c>
                     <c ca="left">
                        <p>204398_s_at</p>
                     </c>
                     <c ca="left">
                        <p>EML2</p>
                     </c>
                     <c ca="left">
                        <p>echinoderm microtubule associated protein like 2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>70</p>
                     </c>
                     <c ca="left">
                        <p>220433_at</p>
                     </c>
                     <c ca="left">
                        <p>PRRG3</p>
                     </c>
                     <c ca="left">
                        <p>proline rich Gla (G-carboxyglutamic acid) 3 (transmembrane)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>71</p>
                     </c>
                     <c ca="left">
                        <p>219169_s_at</p>
                     </c>
                     <c ca="left">
                        <p>TFB1M</p>
                     </c>
                     <c ca="left">
                        <p>transcription factor B1, mitochondrial</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>72</p>
                     </c>
                     <c ca="left">
                        <p>34689_at</p>
                     </c>
                     <c ca="left">
                        <p>TREX1</p>
                     </c>
                     <c ca="left">
                        <p>three prime repair exonuclease 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>74</p>
                     </c>
                     <c ca="left">
                        <p>212604_at</p>
                     </c>
                     <c ca="left">
                        <p>MRPS31</p>
                     </c>
                     <c ca="left">
                        <p>mitochondrial ribosomal protein S31</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>75</p>
                     </c>
                     <c ca="left">
                        <p>213907_at</p>
                     </c>
                     <c ca="left">
                        <p>EEF1E1</p>
                     </c>
                     <c ca="left">
                        <p>Eukaryotic translation elongation factor 1 epsilon 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>75</p>
                     </c>
                     <c ca="left">
                        <p>209622_at</p>
                     </c>
                     <c ca="left">
                        <p>STK16</p>
                     </c>
                     <c ca="left">
                        <p>serine/threonine kinase 16</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>76</p>
                     </c>
                     <c ca="left">
                        <p>209716_at</p>
                     </c>
                     <c ca="left">
                        <p>CSF1</p>
                     </c>
                     <c ca="left">
                        <p>colony stimulating factor 1 (macrophage)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>76</p>
                     </c>
                     <c ca="left">
                        <p>219575_s_at</p>
                     </c>
                     <c ca="left">
                        <p>PDF</p>
                     </c>
                     <c ca="left">
                        <p>peptide deformylase (mitochondrial)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>77</p>
                     </c>
                     <c ca="left">
                        <p>219328_at</p>
                     </c>
                     <c ca="left">
                        <p>DDX31</p>
                     </c>
                     <c ca="left">
                        <p>DEAD (Asp-Glu-Ala-Asp) box polypeptide 31</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>77</p>
                     </c>
                     <c ca="left">
                        <p>213121_at</p>
                     </c>
                     <c ca="left">
                        <p>SNRP70</p>
                     </c>
                     <c ca="left">
                        <p>small nuclear ribonucleoprotein 70 kDa polypeptide (RNP antigen)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>78</p>
                     </c>
                     <c ca="left">
                        <p>218870_at</p>
                     </c>
                     <c ca="left">
                        <p>ARHGAP15</p>
                     </c>
                     <c ca="left">
                        <p>Rho GTPase activating protein 15</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>78</p>
                     </c>
                     <c ca="left">
                        <p>219105_x_at</p>
                     </c>
                     <c ca="left">
                        <p>ORC6L</p>
                     </c>
                     <c ca="left">
                        <p>origin recognition complex, subunit 6 like (yeast)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>79</p>
                     </c>
                     <c ca="left">
                        <p>216510_x_at</p>
                     </c>
                     <c ca="left">
                        <p>IGHA1</p>
                     </c>
                     <c ca="left">
                        <p>immunoglobulin heavy constant alpha 1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>79</p>
                     </c>
                     <c ca="left">
                        <p>215207_x_at</p>
                     </c>
                     <c ca="left">
                        <p>YDD19</p>
                     </c>
                     <c ca="left">
                        <p>YDD19 protein</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>80</p>
                     </c>
                     <c ca="left">
                        <p>219918_s_at</p>
                     </c>
                     <c ca="left">
                        <p>ASPM</p>
                     </c>
                     <c ca="left">
                        <p>asp (abnormal spindle)-like, microcephaly associated (Drosophila)</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>In order to evaluate the reproducibility of the 112-gene signature, we repeat the same feature selection process with several re-samplings of 300 patients out of the 358 patients in the integrated data set. The average overlap is 39.0%. This is not surprising in view of the still modest sample size and the fact that most of the changes occur in the second half of the ranked list of gene pairs.</p>
         </sec>
         <sec>
            <st>
               <p>Validation of the prognostic test on independent data</p>
            </st>
            <p>To validate the prognostic test, we compute its sensitivity and specificity on an independent set of samples, the Pawitan data set <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>, which consists of 159 primary breast cancer patients. This test set includes both patients with lymph-node-negative tumors and patients with lymph-node-positive tumors, and who had or had not received adjuvant systemic therapy. Following the practice in most of the literature, our objective is to predict the development of distant metastases within five years. Of the 159 patients, 35 patients developed distant metastases (relapse) within five years ("poor-outcome"), and 119 patients were free of distant metastases (no relapse) during the follow-up for a period of at least five years ("good-outcome"). Note that the definition of good-outcome for patients in the validating data is different from the definition in the training data because we have used extreme samples to identify the prognostic signature.</p>
            <p>Our prognostic test is the classical likelihood ratio test, determined by assuming that the features are conditionally independent under both classes, namely "poor outcome" (the null hypothesis) and "good outcome" (the alternative hypothesis); see 'Methods'. The LRT reduces to comparing a weighted average of the 80 features to a threshold. The weights depend on the statistics of the individual features under both classes and are estimated from the training data; the threshold is also estimated from the training set, using cross-validation. The LRT built from the prognostic signature achieves a sensitivity of 88.6% (31 out of the 35 poor-outcome samples) and a specificity of 54.6% (65 out of the 119 good-outcome samples) on the 154 samples included in the validating data set. The remaining five patients, who either developed distant metastases after five years or were free of distant metastases with a follow-up period less than five years, are not included in the validating data set. We compute the odds ratio of the prognostic test for developing metastases within five years between the patients in the poor-outcome group and in the good-outcome group as determined by the prognostic test. The prognostic test has a high odds ratio of 9.3 (95% confidence interval: 3.1 &#8211; 28.1) with a Fisher's exact test <it>p</it>-value &lt; 0.00001. To make the results easier to understand, we have included in the additional files the heat maps of the two-group (good- and poor-outcomes) supervised clusters of the integrated training data and test data for the 112-signature genes (see Additional file <supplr sid="S1">1</supplr> and file <supplr sid="S2">2</supplr>).</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p>Clustering of the training data. Shown is the heat map of the two-group (good- and poor-outcome) supervised clusters of the integrated training data for the 112 signature genes. Those genes which appear in multiple pairs among the 80 gene pairs in the signature will appear multiple times in the heat map. The total number of the rows is 160.</p>
               </text>
               <file name="1471-2105-9-125-S1.jpeg">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p>Clustering of the test data. Shown is the heat map of the two-group (good- and poor-outcome) supervised clusters of the test data (Pawitan) for the 112 signature genes. Those genes which appear in multiple pairs among the 80 gene pairs in the signature will appear multiple times in the heat map. The total number of the rows is 160.</p>
               </text>
               <file name="1471-2105-9-125-S2.jpeg">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>It is noteworthy that performance of the LRT on the validation data is actually somewhat <it>better </it>than the performance on the training set (which is estimated by cross-validation). Specifically, from Figure <figr fid="F1">1</figr> (see also 'Methods'), the specificity of the LRT prognostic test is around 43% at approximately 90% sensitivity when estimated from the training data, whereas a specificity of approximately 55% at about the same sensitivity is achieved on the independent validation set.</p>
            <p>To obtain another useful estimate of the clinical outcome, we apply the LRT built from the prognostic signature to all of the 159 samples in the Pawitan data set and calculate the probability of remaining free of distant metastases according to the prognostic signature by using Kaplan-Meier analysis. The Kaplan-Meier curve of the prognostic signature shows a significant difference (<it>p</it>-value &lt; 0.001) in the probability of remaining free of distant metastases between the patients in the poor-outcome group and those in the good-outcome groups (Figure <figr fid="F3">3A</figr>). The <it>p</it>-value is computed by the use of log-rank test. The Mantel-Cox estimation of hazard ratio for distant metastases within five years in the poor-outcome group as compared to the good-outcome group is 9.3 (95% confidence interval: 2.9 &#8211; 29.9, <it>p</it>-value &lt; 0.001).</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>The Kaplan-Meier analysis</p>
               </caption>
               <text>
                  <p><b>The Kaplan-Meier analysis</b>. Kaplan-Meier analysis of the probability of remaining free of distant metastases among 159 Pawitan patients between the good-outcome group and the poor-outcome group. The LRT is based on the integrated data in (A) and the single, Wang data set in (B). CI denotes confidence interval and the <it>p</it>-value is calculated by the log-rank test.</p>
               </text>
               <graphic file="1471-2105-9-125-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Comparison of the prognostic signature to study-specific signatures</p>
            </st>
            <p>To evaluate the potential statistical power gained by integrating multiple data sets to increase diversity and sample size, we compare the predictive power of our integrated prognostic signature with each of the three separate study-specific prognostic signatures identified from the three data sets in Table <tblr tid="T1">1</tblr>. We use exactly the same method we used for the integrated data and each of the resulting three prognostic tests is applied to the same independent test data, the Pawitan data. The results are reported in Table <tblr tid="T3">3</tblr>. In the case of the Sotiriou data, we do not achieve the targeted sensitivity of at least ninety percent due to the very small sample size; the estimate of the threshold in the LRT does not generalize to the Pawitan test set. For the Miller and Wang data sets, the desired sensitivity is achieved but the specificity is far lower than for the classifier learned from the integrated data set.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Test results on Pawitan data (154 patients)</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Training Data</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>No. of Patients</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Sensitivity (%)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Specificity (%)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Sotiriou</p>
                     </c>
                     <c ca="left">
                        <p>43</p>
                     </c>
                     <c ca="left">
                        <p>51.4</p>
                     </c>
                     <c ca="left">
                        <p>47.1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Miller</p>
                     </c>
                     <c ca="left">
                        <p>106</p>
                     </c>
                     <c ca="left">
                        <p>100.0</p>
                     </c>
                     <c ca="left">
                        <p>15.1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Wang</p>
                     </c>
                     <c ca="left">
                        <p>209</p>
                     </c>
                     <c ca="left">
                        <p>94.3</p>
                     </c>
                     <c ca="left">
                        <p>10.1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Integrated</p>
                     </c>
                     <c ca="left">
                        <p>358</p>
                     </c>
                     <c ca="left">
                        <p>88.6</p>
                     </c>
                     <c ca="left">
                        <p>54.6</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>The Wang data set is the largest. Using 40-fold cross-validation, the optimal feature number of gene pairs for the prognostic signature is <it>m</it><sub><it>opt </it></sub>= 60. The 94.3% sensitivity on the test set (33 out of the 35 poor-outcome samples) is close to the target of 90%. The specificity of the classifier is 10.1% (12 out of the 119 poor-outcome samples), substantially lower than the classifier based on the integrated training set, albeit at somewhat higher sensitivity. (Indeed, the performance of the prognostic LRT test based on the Wang data alone is barely better than the completely randomized, data-independent procedure which chooses poor-outcome with probability 0.9 and good outcome with probability 0.1, independently from sample to sample.) The odds ratio of this test is 1.9 (95% confidence interval: 0.4 &#8211; 8.7, Fisher's exact test p-value = 0.74), and the Kaplan-Meier curve (Figure <figr fid="F3">3B</figr>) shows a less significant difference between the patients in the poor-outcome and good-outcome groups than that of the signature from the integrated data. Finally, the estimated hazard ratio of 1.6 (95% confidence interval: 0.4 &#8211; 6.8, 0.01 &lt; p-value &lt; 0.05) is much lower than that of the prognostic test from the integrated data.</p>
            <p>These comparisons demonstrate that the prognostic test derived from the integrated data is superior to the prognostic test derived from any of the individual studies and highlight the value of data integration. By integrating several microarray data sets with our rank-based methods, study-specific effects are reduced and more features of breast cancer prognosis are captured.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Using a rank-based method for feature selection, we integrate three independent microarray gene expression data sets of extreme samples and identify a 112-gene breast cancer prognostic signature. The signature is invariant to standard within-array preprocessing and data normalization. All of the patients in the integrated training set had lymph-node-negative tumors and had not received adjuvant systemic treatment, so the identification of the prognostic signature is not subject to potential confounding factors related to lymph node status or systemic treatment. A LRT constructed from the prognostic signature is used to predict whether a breast cancer patient will develop distant metastases within five years after initial treatment. This prognostic test achieves a sensitivity of 88.6% and a specificity of 54.6% on an independent test data set of 154 samples. The test set includes patients who had and who had not received adjuvant systemic treatment, and those with both lymph-node-negative and lymph-node-positive tumors, indicating that our prognostic signature could possibly be applied to all breast cancer patients independently of age, tumor size, tumor grade, lymph mode status, and systemic treatment. It should be pointed out that, somewhat paradoxically, one reason for this ability to generalize is that, as with all machine learning methods, the feature seleciton process is not guided by specific biological knowledge about the underlying processes and pathways.</p>
         <p>One motivation for using the LRT is simplicity: under the assumption of independent features, the test statistic is a weighted average of the feature values and the test itself reduces to comparing this average to a fixed threshold. Another motivation stems from the Neyman-Pearson lemma of statistical hypothesis testing <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, which states that the LRT achieves optimal specificity at any given level of sensitivity. However, we cannot claim optimal specificity (at roughly ninety percent sensitivity) for our prognostic test since our LRT is constructed by assuming the 80 binary comparison features are statistically independent in each class, which is likely to be violated in practice due to correlations among the genes and genes appearing in multiple pairs. But this approach does offer a rigorous statistical framework for constructing prognostic tests at a given sensitivity. It also provides a direction towards more powerful procedures. Evidently, increasingly better approximations to the "true" LRT, and hence to optimal specificity, would be obtained by accounting for more and more of the dependency structure among the features. Indeed, accounting for pair-wise correlations alone would be a significant step in this direction.</p>
         <p>Comparison with the conventional treatment guidelines (e.g. St. Gallen and NIH) is instructive. While maintaining almost the same level of sensitivity (~90%), our prognostic test achieves a specificity which is well above the 10&#8211;30% range of the St. Gallen and NIH targets. This means that our test can spare a significant number of good-outcome patients from unnecessary adjuvant therapy, while ensuring roughly the same percentage of poor-outcome patients receive adjuvant therapy as recommended by the treatment guidelines. Therefore, our prognostic test and signature, if further validated on large-scale independent data, could potentially provide a useful means of guiding adjuvant systemic treatment, reducing cost and improving the quality of patients' lives.</p>
         <p>Other strengths of our study, compared with previous ones, are the larger number of homogeneous patients (lymph-node-negative tumors without adjuvant systemic treatment) in the training set, and an external independent test set. In each of the two major breast cancer prognostic studies <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>, the training and validation data are extracted from the same study group from the same population. More specifically, the entire data set is randomly divided into two pieces, one serving as a training set and the other as a validation or test set. In this case, the training data and the validation data are likely to have similar properties. Therefore, the study-specific prognostic test identified from the training data usually gives over-promising results when assessed using the "internal" validation data. (Similar remarks apply to methods which measure performance using cross validation.) This argument may explain why the two major prognostic signatures, although validated internally with about 90% sensitivity and about 50% specificity, cannot be validated externally with an independent data set <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. In addition, splitting the original data set into two pieces only aggravates the small-sample problem, as well as producing other sources of bias <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. In our study, we increase diversity and sample size by integrating several microaray data sets involving patients from different populations. By selecting a homogeneous subgroup of patients and combining data from multiple studies, the derived prognostic test and signature is less sensitive to study-specific factors. An intriguing advantage of inter-study data integration is that it increases the statistical power to capture essential prognostic features which might be masked by study-specific features and the small sample sizes of individual data sets. In this sense, our prognostic test is more robust to inter-study variability and may facilitate external validation.</p>
         <p>Comparison of our prognostic signature with the two major signatures of van't Veer <it>et al</it>. and Wang <it>et al</it>. is not straightforward because of differences in patients, microarray platforms, and algorithms. The study of van't Veer <it>et al</it>. uses an Agilent array platform and our study uses an Affymetrix array platform. Only 46 out of the 112 genes in our prognostic signature are present on the Agilent Hu25K array and only 36 of the 70 genes in the van't Veer signature are present on the Affymetrix HG-U133A array. Therefore, we can neither validate the van't Veer prognostic test on our validation data nor validate our test on their data set. There is a three-gene overlap between the van't Veer signature and our signature (CCNE2, ORC6L, and PRC1). Since the data set in Wang <it>et al</it>. is included in our training set, we cannot validate our test on that data set. On the other hand, in order to validate the test proposed by Wang <it>et al</it>., we need to know the estrogen receptor (ER) status of our test samples because the classification rule based on their signature is depend on ER status, which is absent from our validation data. Again, there is a four-gene overlap between the Wang signature and our signature (AP2A2, CBX3, CCNE2, and MLF1IP). It is noteworthy that the gene CCNE2 is included in all of the three signatures and is reported to be related to breast cancer <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. CCNE2 could be a potential target for the rational development of new cancer drugs.</p>
         <p>Using the program DAVID <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, according to the gene ontology biological process categories, the 112-gene signature is highly enriched in cell cycle (<it>P</it>-value = 1.45E-5) and cell division (<it>P</it>-value = 5.9E-4). To pinpoint the role of some of the genes in our signature, the cell cycle pathway is displayed in the additional files with our signature genes shown in red (see Additional file <supplr sid="S3">3</supplr>). These findings demonstrate that deregulation of these pathways has a direct impact on tumor progression and indicate that the 112-gene signature is biologically relevant.</p>
         <suppl id="S3">
            <title>
               <p>Additional file 3</p>
            </title>
            <text>
               <p>The cell cycle pathway. Our signature genes which appear in the cell cycle pathway are shown in red.</p>
            </text>
            <file name="1471-2105-9-125-S3.jpeg">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>To assess the benefit of data integration, we compared the predictive power of our signature with that of three study-specific signatures identified from the Sotiriou, Miller and Wang data sets using the same LRT procedure. When applied to the same independent test data, our prognostic test consistently outperforms the study-specific tests and the largest study (Wang) in particular, in terms of specificity (54.6% vs. 10.1%) at roughly the same 90% sensitivity level, odds ratio (9.3 vs. 1.9), hazard ratio (9.3 vs. 1.6), and Kaplan-Meier analysis. These findings again suggest a prognostic test derived from a single data set may be over-dedicated and might perform weakly on external data. In contrast, a prognostic test derived from integrated data is more likely to be more robust to study-specific factors and to be validated satisfactorily on external data.</p>
         <p>Recently, some studies have shown that combining gene expression data and conventional clinical data (e.g. tumor size, grade, ER status) could lead to improved breast cancer prognosis <abbrgrp><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp>. An approach based on solid statistical principles can accommodate aggregating data of multiple types, e.g., combining gene expression signatures with traditional clinical factors. In this study, due to the lack of clinical information for some of the training samples, we could not incorporate such information into the development of our prognostic test. As clinical information becomes publicly available, it might be combined with the integrated gene expression data to further improve prognosis.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>The opinion expressed in recent studies that gene expression information can be useful in breast cancer prognosis seems to be well-founded. However, due to the small sample sizes relative to the complexity of the entire expression profile, existing methods suffer certain limitations, namely the prevalence of study-specific signatures and difficulties in validating the prognostic tests constructed from these signatures on independent data. Integrating data from multiple studies to obtain more samples appears to be a promising way to overcome these limitations.</p>
         <p>We have integrated several gene expression data sets and developed a likelihood ratio test for predicting distant metastases that correctly signals a poor outcome in approximately ninety percent of test cases while maintaining about fifty-five percent specificity for good outcome patients. This well exceeds the St. Gallen and NIH guidelines and compares favorably with the best results previously reported (although not yet validated on external test data). As more and more gene (and protein) expression data is generated and made publicly available, modeling the interactions among genes (and gene products) will become increasingly feasible, and is likely to be crucial in designing prognostic tests which achieve high sensitivity without sacrificing specificity.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Data integration</p>
            </st>
            <p>Recently, our group has developed a family of statistical molecular classification methods based on relative expression reversals <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B33">33</abbr></abbrgrp>, and applied one variant based on a two-gene classifier to microarray data integration <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. These methods only use the ranks of gene expression values within each profile and achieve impressive results in both molecular classification and microarray data integration. An important feature of rank-based methods is that they are invariant to monotonic transformations of the expression data within an array, such as those used in most array normalization and other pre-processing methods. This property makes these methods especially useful for combining data from separate studies since the nature of the primary features extracted from the data, namely comparisons of mRNA concentration between pairs of genes, eliminates the need to standardize the data before aggregation. Specifically, the ranks of gene expression values are invariant to monotonic data transformations within each microarray. Consequently, we directly merge gene expression data of the patients from three training data sets in Table <tblr tid="T1">1</tblr>, using the 22283 probe sets on Affymetrix HG-U133A microarray, to form an integrated training data set of 358 samples. After aggregation, we extract a list of pair-wise comparisons; each of these "features" corresponds to a pair of genes and is assigned the value zero or one depending on the observed ordering of expressions; see the following section. The number of features retained is much smaller than the number of genes on the array. The procedure is now described in more detail.</p>
         </sec>
         <sec>
            <st>
               <p>Feature selection and transformation</p>
            </st>
            <p>Consider <it>G </it>genes whose expression values <b><it>X </it></b>= {<it>X</it><sub>1</sub>, <it>X</it><sub>2</sub>, ..., <it>X</it><sub><it>G</it></sub>} are measured using a DNA microarray and regarded as random variables. The class label <it>Y </it>for each profile <b><it>X </it></b>is a discrete random variable taking on one of two possible values corresponding to the two prognostic states or hypotheses of interest, namely "poor-outcome," denoted <it>Y </it>= 1, and "good-outcome," denoted <it>Y </it>= 2. The integrated training microarray data represent the observed values of <b><it>X </it></b>and comprise a <it>G </it>&#215; <it>N </it>matrix <b><it>x </it></b>= [<it>x</it><sub><it>gn</it></sub>], <it>g </it>= 1, 2, ..., <it>G </it>and <it>n </it>= 1, 2, ..., <it>N</it>, where <it>G </it>is the number of genes in each profile and <it>N </it>is the number of samples (profiles) in the integrated data set. Each column <it>n </it>represents a gene expression profile of <it>G </it>genes with a class label <it>y</it><sub><it>n </it></sub>= 1 (poor-outcome) or <it>y</it><sub><it>n </it></sub>= 2 (good-outcome) for the two-class problem in our study. Among the <it>N </it>samples, there are <it>N</it><sub>1 </sub>(respectively, <it>N</it><sub>2</sub>) samples labeled as class 1 (respectively, class 2) with <it>N </it>= <it>N</it><sub>1 </sub>+ <it>N</it><sub>2</sub>.</p>
            <p>For each pair of genes (<it>i</it>, <it>j</it>), where <it>i</it>, <it>j </it>= 1, 2, ..., <it>G</it>, <it>i </it>&#8800; <it>j</it>, let <it>P</it>(<it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j</it></sub>|<it>Y </it>= <it>k</it>), <it>k </it>= 1,2, denote the conditional probability of the event {<it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j</it></sub><it>} </it>given <it>Y </it>= <it>k</it>. We define a score by</p>
            <p>
               <display-formula id="M1"><it>&#916;</it><sub><it>j </it></sub>= |<it>P</it>(<it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j</it></sub>|<it>Y </it>= 1) - <it>P</it>(<it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j</it></sub>|<it>Y </it>= 2)|</display-formula>
            </p>
            <p>and estimate the score of pair (<it>i</it>, <it>j</it>) based on the training set <b><it>x </it></b>by</p>
            <p>
               <display-formula id="M2">
                  <m:math name="1471-2105-9-125-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>&#916;</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>&#8776;</m:mo>
                           <m:mrow>
                              <m:mo>|</m:mo>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:msubsup>
                                          <m:mi>N</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:msubsup>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>N</m:mi>
                                          <m:mn>1</m:mn>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mfrac>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:msubsup>
                                          <m:mi>N</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mn>2</m:mn>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:msubsup>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>N</m:mi>
                                          <m:mn>2</m:mn>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mfrac>
                              </m:mrow>
                              <m:mo>|</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaeuiLdq0aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGHijYUdaabdaqaaKqbaoaalaaabaGaemOta40aa0baaeaacqWGPbqAcqWGQbGAaeaacqGGOaakcqaIXaqmcqGGPaqkaaaabaGaemOta40aaSbaaeaacqaIXaqmaeqaaaaakiabgkHiTKqbaoaalaaabaGaemOta40aa0baaeaacqWGPbqAcqWGQbGAaeaacqGGOaakcqaIYaGmcqGGPaqkaaaabaGaemOta40aaSbaaeaacqaIYaGmaeqaaaaaaOGaay5bSlaawIa7aaaa@4937@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-125-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:msubsup>
                                          <m:mi>N</m:mi>
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>j</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:msubsup>
                                       <m:mo>=</m:mo>
                                       <m:mrow>
                                          <m:mo>|</m:mo>
                                          <m:mrow>
                                             <m:mo>{</m:mo>
                                             <m:mi>n</m:mi>
                                             <m:mo>:</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo>&#8804;</m:mo>
                                             <m:mi>n</m:mi>
                                             <m:mo>&#8804;</m:mo>
                                             <m:mi>N</m:mi>
                                             <m:mo>,</m:mo>
                                             <m:msub>
                                                <m:mi>x</m:mi>
                                                <m:mrow>
                                                   <m:mi>i</m:mi>
                                                   <m:mi>n</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                             <m:mo>&lt;</m:mo>
                                             <m:msub>
                                                <m:mi>x</m:mi>
                                                <m:mrow>
                                                   <m:mi>j</m:mi>
                                                   <m:mi>n</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                             <m:mo>,</m:mo>
                                             <m:msub>
                                                <m:mi>y</m:mi>
                                                <m:mi>n</m:mi>
                                             </m:msub>
                                             <m:mo>=</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo>}</m:mo>
                                          </m:mrow>
                                          <m:mo>|</m:mo>
                                       </m:mrow>
                                       <m:mo>,</m:mo>
                                    </m:mrow>
                                 </m:mtd>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mi>k</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                       <m:mo>,</m:mo>
                                       <m:mn>2</m:mn>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                           <m:mo>.</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabd6eaonaaDaaaleaacqWGPbqAcqWGQbGAaeaacqGGOaakcqWGRbWAcqGGPaqkaaGccqGH9aqpdaabdaqaaiabcUha7jabd6gaUjabcQda6iabigdaXiabgsMiJkabd6gaUjabgsMiJkabd6eaojabcYcaSiabdIha4naaBaaaleaacqWGPbqAcqWGUbGBaeqaaOGaeyipaWJaemiEaG3aaSbaaSqaaiabdQgaQjabd6gaUbqabaGccqGGSaalcqWG5bqEdaWgaaWcbaGaemOBa4gabeaakiabg2da9iabdUgaRjabc2ha9bGaay5bSlaawIa7aiabcYcaSaqaaiabdUgaRjabg2da9iabigdaXiabcYcaSiabikdaYaaacqGGUaGlaaa@5BCA@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>In other words, the estimated score is simply the absolute difference between the fraction of poor-outcome patients for which gene <it>i </it>is expressed less than gene <it>j </it>and the same fraction in the good-outcome examples. The feature selection procedure consists of forming a list of gene pairs, sorted from the largest to the smallest according to their scores <it>&#916;</it><sub><it>ij</it></sub>, and selecting the top <it>M </it>pairs. The <it>M </it>top-ranked gene pairs are then considered to be the most discriminating candidate gene pairs for breast cancer prognosis if only relative expressions are taken into account. During the process, we have transformed the original feature vector <b><it>X </it></b>= {<it>X</it><sub>1</sub>, <it>X</it><sub>2</sub>, ..., <it>X</it><sub><it>G</it></sub><it>}</it><it/>(<it>G </it>= 22283 in this study), each of which assumes scalar values, to a new ordered feature vector <b><it>Z </it></b>= {<it>Z</it><sub>1</sub>, <it>Z</it><sub>2</sub>, ..., <it>Z</it><sub><it>M</it></sub><it>}</it>(usually, <it>M </it>&lt;&lt;<it>G</it>), each of which assumes only two values.</p>
            <p>Suppose <it>Z</it><sub><it>m</it></sub>, <it>m </it>= 1, 2, ..., <it>M</it>, corresponds to the gene pair {<it>i</it>, <it>j</it>}. For convenience, the ordering (<it>i</it>, <it>j</it>) will signify which probability in Equation (1) is larger. The reason for this is to facilitate interpretation of the results, as will become apparent. If <it>P</it>(<it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j</it></sub>|<it>Y = 1</it>) &#8805; <it>P</it>(<it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j</it></sub>|<it>Y = 2</it>), as estimated by the fractions in (2), we will write (<it>i</it>, <it>j</it>) and if <it>P</it>(<it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j</it></sub>|<it>Y = 1</it>) &lt;<it>P</it>(<it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j</it></sub>|<it>Y = 2</it>) we will denote the pair by (<it>j</it>, <it>i</it>). The value assumed by <it>Z</it><sub><it>m </it></sub>is then set to be 1 if we observe <it>X</it><sub><it>i </it></sub>&lt;<it>X</it><sub><it>j </it></sub>and set to 0 otherwise, i.e., if we observe <it>X</it><sub><it>i </it></sub>&#8805; <it>X</it><sub><it>j</it></sub>. Of course the same definition is applied to each feature in the training set. In this way, observing <it>Z</it><sub><it>m </it></sub>= 1 (resp., <it>Z</it><sub><it>m </it></sub>= 0) represents an indicator of the poor outcome (resp., good outcome) class in the sense that <it>p</it><sub><it>m </it></sub>= <it>P</it>(<it>Z</it><sub><it>m </it></sub>= 1|<it>Y </it>= 1) &#8805; <it>q</it><sub><it>m </it></sub>= <it>P</it>(<it>Z</it><sub><it>m </it></sub>= 1|<it>Y </it>= 2). For all the features selected in our signature we in fact have <it>p</it><sub><it>m </it></sub>> 1/2 > <it>q</it><sub><it>m</it></sub>.</p>
            <p>After this procedure, the original <it>G </it>&#215; <it>N </it>data matrix is reduced to an <it>M </it>&#215; <it>N </it>data matrix. The number of distinct genes in a prognostic signature is obviously fewer than <it>2M</it>. In our practice, there are always more than <it>M </it>distinct genes among the top <it>M </it>gene pairs. Given that the numbers of genes in published breast cancer prognostic signatures are mostly fewer than 100, we fix <it>M </it>= 200 in this study to maker sure we can identify a prognostic feature signature based on a reasonable number of genes.</p>
         </sec>
         <sec>
            <st>
               <p>Likelihood ratio test</p>
            </st>
            <p>The classical likelihood ratio test (LRT) is a statistical procedure for distinguishing between two hypotheses, each constraining the distribution of a random vector <b><it>Z </it></b>= {<it>Z</it><sub>1</sub>, <it>Z</it><sub>2</sub>, ..., <it>Z</it><sub><it>M</it></sub>}. In our case the variables <it>Z</it><sub><it>m </it></sub>are the simple, binary functions of the gene expression profile defined above.</p>
            <p>The LRT is based on the likelihood ratio</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-125-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>L</m:mi>
                           <m:mi>R</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>z</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>z</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mi>Y</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>z</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mi>Y</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>2</m:mn>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemitaWKaemOuaiLaeiikaGIaemOEaONaeiykaKIaeyypa0tcfa4aaSaaaeaacqWGqbaucqGGOaakcqWG6bGEcqGG8baFcqWGzbqwcqGH9aqpcqaIXaqmcqGGPaqkaeaacqWGqbaucqGGOaakcqWG6bGEcqGG8baFcqWGzbqwcqGH9aqpcqaIYaGmcqGGPaqkaaaaaa@4556@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <b><it>z </it></b>= {<it>z</it><sub>1</sub>, <it>z</it><sub>2</sub>, ..., <it>z</it><sub><it>M</it></sub><it>}</it>are the observed values in a new sample. Notice that <b><it>z </it></b>assumes values in {0, 1}<sup><it>M</it></sup>, the set of binary strings of length <it>M</it>. The LRT chooses hypothesis <it>Y </it>= 1 if <it>LR</it>(<b><it>z</it></b>) > <it>t </it>and chooses <it>Y </it>= 2 otherwise, i.e., if <it>LR</it>(<b><it>z</it></b>) &#8804; <it>t</it>. The threshold <it>t </it>is adjusted to provide a desired tradeoff between type I error and type II error, i.e., between sensitivity and specificity. Choosing <it>t </it>small provides high sensitivity at the expense of specificity and choosing <it>t </it>large promotes the opposite effect.</p>
            <sec>
               <st>
                  <p>Naive Bayes Classifier</p>
               </st>
               <p>In the special case in which the random variables <it>Z</it><sub>1</sub>, ..., <it>Z</it><sub><it>M </it></sub>are binary (as here) and are assumed to be conditionally independent given <it>Y</it>, the LRT has a particularly simple form. It reduces to comparing a linear combination of the variables to a threshold. Recall that <it>p</it><sub><it>m </it></sub>= <it>P</it>(<it>Z</it><sub><it>m </it></sub>= 1|<it>Y </it>= 1) and <it>q</it><sub><it>m </it></sub>= <it>P</it>(<it>Z</it><sub><it>m </it></sub>= 1|<it>Y </it>= 2), <it>m </it>= 1, 2, ..., <it>M</it>. In this case,</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2105-9-125-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>z</m:mi>
                              <m:mo>|</m:mo>
                              <m:mi>Y</m:mi>
                              <m:mo>=</m:mo>
                              <m:mn>1</m:mn>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8719;</m:mo>
                                    <m:mrow>
                                       <m:mi>m</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mi>M</m:mi>
                                 </m:munderover>
                                 <m: