Abstract
Background
There is an urgent need for new prognostic markers of breast cancer metastases to ensure that newly diagnosed patients receive appropriate therapy. Recent studies have demonstrated the potential value of gene expression signatures in assessing the risk of developing distant metastases. However, due to the small sample sizes of individual studies, the overlap among signatures is almost zero and their predictive power is often limited. Integrating microarray data from multiple studies in order to increase sample size is therefore a promising approach to the development of more robust prognostic tests.
Results
In this study, by using a highly stable data aggregation procedure based on expression comparisons, we have integrated three independent microarray gene expression data sets for breast cancer and identified a structured prognostic signature consisting of 112 genes organized into 80 pairwise expression comparisons. A classical likelihood ratio test based on these comparisons, essentially weighted voting, achieves 88.6% sensitivity and 54.6% specificity in an independent external test set of 154 samples. The test is highly informative in assessing the risk of developing distant metastases within five years (hazard ratio 9.3 with 95% CI 2.9–29.9).
Conclusion
Rankbased features provide a stable way to integrate patient data from separate microarray studies due to invariance to data normalization, and such features can be combined into a useful predictor of distant metastases in breast cancer within a statistical modeling framework which begins to capture genegene interactions. Upon further confirmation on largescale independent data, such prognostic signatures and tests could provide a powerful tool to guide adjuvant systemic treatment that could greatly reduce the cost of breast cancer treatment, both in terms of toxic side effects and health care expenditures.
Background
Breast cancer is the most common form of cancer and the second leading cause of cancer death among women in the United States, with an estimated ~213,000 new cases and ~41,000 deaths in 2006 [1]. The main cause of breast cancer death comes from its metastases to distant sites. Early diagnosis and adjuvant systemic therapy (hormone therapy and chemotherapy) substantially reduce the risk of distant metastases. However, adjuvant therapy has serious short and longterm side effects and involves high medical costs [2]. Therefore, highly accurate prognostic tests are essential to aid clinicians in deciding which patients are at high risk of developing metastases and should receive adjuvant therapy. Currently, the most widely used treatment guidelines, St. Gallen [3] and the US National Institutes of Health (NIH) [2] consensus criteria, assess a patient's risk of distant metastases based on clinical prognostic factors such as tumor size, lymph node status, and histologic grade. These guidelines cannot accurately identify atrisk patients and about 70–80% of patients defined as being at risk by these criteria and receiving adjuvant therapy would have survived without it [4]. In addition, many patients who would be cured by local or regional treatment alone are "overtreated" and suffer toxic side effects of adjuvant therapy unnecessarily. Therefore, there is an urgent need for new prognostic tests to precisely define a patient's risk of developing metastases to ensure that the patient receives appropriate therapy.
The advent of DNA microarray technology provides a powerful tool in various aspects of cancer research. Simultaneous assessment of the expression of thousands of genes in a single experiment could allow better understanding of the complex and heterogeneous molecular properties of breast cancer. Such information may lead to more accurate prognostic signatures for prediction of metastasis risk in breast cancer patients. Over the past few years, a number of studies have identified prognostic gene expression signatures and proposed corresponding prognostic tests based on these genes. In many cases, the prediction of breast cancer outcome is superior to conventional prognostic tests [511]. Among these studies, the two largest have attempted to identify gene expression signatures and prognostic tests strongly predictive of distant metastases. van't Veer et al. applied a supervised method to identify a 70gene signature, and a correlationbased test capable of predicting a short interval to distant metastases, in a cohort of 78 young breast cancer patients (<55 years of age) with lymphnodenegative tumors [6]. The test was applied to a cohort of 295 patients with either lymphnodenegative or lymphnodepositive breast tumors [5]. Using a different microarray platform, Wang et al. derived a 76gene prognostic signature from 115 lymphnodenegative patients who had not received adjuvant systemic treatment. The signature could be used to predict distant metastasis within five years in breast cancer patients of all age groups with lymphnodenegative tumors and was subsequently applied to a set of 171 lymphnodenegative patients [7]. These studies have shown that tests based on gene expression signatures would result in a substantial reduction of the number of patients receiving unnecessary adjuvant systemic treatment, thereby preventing overtreatment in a considerable number of breast cancer patients.
The most striking observation when comparing the signatures from different studies is the lack of overlap of signature genes. For instance, in the studies of van't Veer et al. and Wang et al., despite the similar clinical and statistical designs, there is an overlap of only three genes in the two gene signature lists. These diverse results make it difficult to identify the most predictive genes for breast cancer prognosis. The disagreements in gene signatures may be partly due to the use of different microarray platforms and differences in patient selection, normalization procedures and other experimental choices. Moreover, in a recent study [12], reanalysis of the van't Veer data has shown that the prognostic signature is even strongly influenced by the subset of the patients used for signature selection within a particular study. This observation indicates that given the small number of samples in the training sets, many genes might show what appear to be significant correlations with clinical outcome and the differences among these correlations might be small. Therefore, it is possible to combine genes in many ways to generate different signatures with similar predictive power when validated on internal test sets [12]. Moreover, in general, these prognostic tests are not robust, meaning that they cannot be validated on independent, external data sets [9]. Independent reanalysis on other microarray data sets has shown very similar findings [13]. Given the large numbers of features (~10,000 to 40,000 genes) in microarray data and the relatively small numbers of samples (~100 patients) used in the training set of each study, it is highly possible to accidentally find a set of genes with good predictive power on internal test sets. This is the type of "overfitting" that is typical when the number of observed variables far exceeds the number of samples. In light of this general "smallsample dilemma" in statistical learning and the particular observations from the two reanalysis studies mentioned above, the disagreements in gene signatures obtained from different data sets are not surprising. We believe that much larger numbers of samples (patients), perhaps thousands, are needed to develop more robust prognostic tests and signatures.
The rapid accumulation of microarray gene expression data suggests that combining microarray data from different studies may be a useful way to increase sample size and diversity. In particular, "metaanalyses" have recently been used to merge different studies in order to develop prognostic gene expression signatures for breast cancer [14,15]. However, effectively integrating microarray data from different studies is not straightforward due to several issues of compatibility, such as differing microarray platforms, experimental protocols and data preprocessing methods. Instead of directly integrating microarray gene expression values, metaanalyses combine results (e.g. t statistics) of individual studies to increase statistical power. The major limitation of metaanalyses is that the small sample sizes typical of individual studies, coupled with variation due to differences in study protocols, inevitably degrades the results. Also, deriving separate statistics and then averaging is often less powerful than directly computing statistics from aggregated data.
In contrast to the metaanalysis approach, in which the results of individual studies are combined at an interpretative level, other methods, such as Zscore, Distance Weighted Discrimination (DWD), integrate microarray data from different studies at the expression value level after transforming the expressions to numerically comparable measures [14,1620]. In general, the procedure involves the following steps. First, a list of genes common to multiple distinct microarray platforms is extracted based on crossreferencing the annotation of each probe set represented on the microarrays. Crossreferencing of expression data is usually achieved using the UniGene database [21]. Next, for each individual data set, numerically comparable quantities are derived from the expression values of genes in the common list by applying specific data transformation and normalization methods. Finally, the newly derived quantities from individual data sets are combined to increase sample size and statistical methods are applied to the combined data to build diagnostic and prognostic signatures. One major limitation of these direct integration methods is that there is still no consensus on how best to perform data transformation and normalization.
In our previous work [22], we proposed a novel method for molecular classification which builds predictors from relative expression values, which can be directly applied to integrated microarray data and which generates very simple decision rules. Because this method is based only on the ranks of the expression values within a profile (sample), there is no need to prepare the data for integration, in particular there is no need for data normalization, since ranks are invariant to all types of withinarray monotonic preprocessing. This approach to data integration was validated on prostate cancer data [23], resulting in a powerful twogene diagnostic classifier. It has also been applied recently to differentiating between gastrointestinal stromal tumors and leiomyosarcomas [24]. Here, we extend this method to predict distant metastases in breast cancer, and attempt to overcome the limitations of previous studyspecific methods and metaanalyses.
Results
Summary
We integrate three independent microarray gene expression data sets to obtain an integrated training set of 358 samples and identify a set of features for predicting distant metastases. All the samples included in this study are from lymphnodenegative patients who have not received adjuvant systemic treatment. Each feature is based on an ordered pair of genes and assumes the value one if the first gene is expressed less than the second gene, and assumes the value zero otherwise. These genes may not all be highly differentially expressed, and one gene in the pair may serve as a "reference" for the other one. Since the features are rankbased, no data normalization is needed before data integration. A classical likelihood ratio test is used to classify patients as either pooroutcome, meaning they are likely to metastasize, or goodoutcome, meaning that they are unlikely to develop distant metastases. The choice of features is motivated by achieving the highest possible specificity at an acceptable level of sensitivity, taken here to be 90% in accordance with the St. Gallen and NIH treatment guidelines. The number of features chosen in the prognostic signature, as well as the threshold in the likelihood ratio test (LRT), is optimized with kfold crossvalidation on the integrated training set. The optimal feature number is estimated to be 80, corresponding to 112 genes (since some genes appear in more than one feature). The prognostic test based on this signature is validated using an independent microarray data set. Upon further validation on largescale independent data, the prognostic gene expression signature could support other breast cancer prognostic tests with high enough specificity to help avoid overtreatment of newly diagnosed patients.
Study data
Four breast cancer microarray data sets are included in this study. Each data set has been downloaded from publicly available gene expression repositories (e.g. Gene Expression Omnibus) or supporting web sites [7,11,25,26]. All four data sets are generated from the same Affymetrix HGU133A microarray platform. Here, the names of the first authors of individual studies are used as the names of the data sets. Three data sets, Miller (251 patients), Sotiriou (189 patients) and Wang (286 patients), are used as training data and the other one, Pawitan (159 patients), is used as independent test data. The reason for this division into training and test data is that detailed clinical information has been provided for the Miller, Sotiriou and Wang data sets and this information has been used to select specific patients for training, whereas little clinical information is provided for the Pawitan study. For the Miller, Sotiriou and Pawitan studies, because the gene expression data sets provided by them have undergone crosssample normalization, we have downloaded the raw CEL files and calculated expression values using the Affymetrix GeneChip Operating Software version 1.4. There is an 85patient overlap between Miller and Sotiriou data sets, so we have excluded the replicate samples from our study. Detailed patient information in each study has been described in the corresponding literature.
Motivated by a recent study [27], we employ the idea of restricting training data to extreme patient samples, which are more informative in identifying a prognostic signature. Extreme patients are either shortterm survivors with pooroutcome within a short period or longterm survivors who maintain a goodoutcome after a long followup time. Specifically, we select patients who developed distant metastases (relapse) within five years as pooroutcome samples and patients who were free of distant metastases (relapse) during the followup for a period of at least eight years as goodoutcome samples. The sharp contrast between shortterm and longterm survivors should identify more informative and reliable genes for a prognostic signature. Only early stage lymphnodenegative patients who had not received adjuvant systemic treatment are included in the training data because adjuvant treatment is likely to modify patient outcome. The selection is irrespective of age, tumor size and other clinical parameters. After applying the above selection criteria, a total of 358 patients are identified from the three training data sets and used to learn a prognostic signature and prognostic test. The numbers of selected patients from each training data set are listed in Table 1.
Table 1. Training data sets: lymphnodenegative patients with no adjuvant treatment
A prognostic signature from integrated data
We directly merge the three microarray data sets in Table 1, using the 22283 probe sets on Affymetrix HGU133A microarray, to form an integrated training data set. The integrated data set consists of 122 extreme pooroutcome samples (distant metastases within five years after surgery) and 236 extreme goodoutcome samples (free of distant metastases during the followup for a period of at least eight years after surgery). Recall that each feature is based on a pair of genes. The integrated training set is used to estimate the relationship between the number m of features in a prognostic classifier and the specificity at 90% sensitivity level, evaluated by the 40fold crossvalidation, as described in 'Methods'. The result is plotted in Figure 1. As can be seen, the specificity is nearly constant after about 80 features are included. Our final prognostic signature then consists of the 80 topranked features (gene pairs) from the feature list generated from the original integrated training data, using the feature selection and transformation procedures described in 'Methods'. Because some genes appear in more than one feature, the 80 topranked gene pairs in our prognostic signature include 112 distinct genes (Table 2). To illustrate the behavior of the 80 features in the signature on the Wang data set (part of the integrated training data), we show the difference in expression between the two genes in each of the 80 gene pairs in the form of a heat map in Figure 2. Distinct patterns of expression differences can be observed for good and pooroutcome samples.
Figure 1. Choosing size of the signature. The relationship between the number of features in a prognostic signature and the specificity at 90% sensitivity of the corresponding prognostic test, evaluated by 40fold crossvalidation. We select m_{opt }= 80, the smallest value that achieves roughly maximum specificity at the 90% sensitivity level. The specificity observed on the validation set is in fact higher.
Figure 2. The heat map of the 80 signature gene pairs. The Wang data set is used to illustrate the gene expression values of the signature genes. A heat map is generated using the matrix2png software [34]. There are 80 rows corresponding to the 80 gene pairs; the displayed intensities are the differences between the expression values of the two genes in each pair. The expression value for each difference is normalized across the samples to zero mean and one standard deviation (SD) for visualization purposes. Differences with expression levels greater than the mean are colored in red and those below the mean are colored in green. The scale indicates the number of SDs above or below the mean.
Table 2. Genes in the identified prognostic signature. For each probe set the first column lists the subset of the eighty pairs which contain it. The pairs are ordered from 1 to 80 by their scores.
In order to evaluate the reproducibility of the 112gene signature, we repeat the same feature selection process with several resamplings of 300 patients out of the 358 patients in the integrated data set. The average overlap is 39.0%. This is not surprising in view of the still modest sample size and the fact that most of the changes occur in the second half of the ranked list of gene pairs.
Validation of the prognostic test on independent data
To validate the prognostic test, we compute its sensitivity and specificity on an independent set of samples, the Pawitan data set [26], which consists of 159 primary breast cancer patients. This test set includes both patients with lymphnodenegative tumors and patients with lymphnodepositive tumors, and who had or had not received adjuvant systemic therapy. Following the practice in most of the literature, our objective is to predict the development of distant metastases within five years. Of the 159 patients, 35 patients developed distant metastases (relapse) within five years ("pooroutcome"), and 119 patients were free of distant metastases (no relapse) during the followup for a period of at least five years ("goodoutcome"). Note that the definition of goodoutcome for patients in the validating data is different from the definition in the training data because we have used extreme samples to identify the prognostic signature.
Our prognostic test is the classical likelihood ratio test, determined by assuming that the features are conditionally independent under both classes, namely "poor outcome" (the null hypothesis) and "good outcome" (the alternative hypothesis); see 'Methods'. The LRT reduces to comparing a weighted average of the 80 features to a threshold. The weights depend on the statistics of the individual features under both classes and are estimated from the training data; the threshold is also estimated from the training set, using crossvalidation. The LRT built from the prognostic signature achieves a sensitivity of 88.6% (31 out of the 35 pooroutcome samples) and a specificity of 54.6% (65 out of the 119 goodoutcome samples) on the 154 samples included in the validating data set. The remaining five patients, who either developed distant metastases after five years or were free of distant metastases with a followup period less than five years, are not included in the validating data set. We compute the odds ratio of the prognostic test for developing metastases within five years between the patients in the pooroutcome group and in the goodoutcome group as determined by the prognostic test. The prognostic test has a high odds ratio of 9.3 (95% confidence interval: 3.1 – 28.1) with a Fisher's exact test pvalue < 0.00001. To make the results easier to understand, we have included in the additional files the heat maps of the twogroup (good and pooroutcomes) supervised clusters of the integrated training data and test data for the 112signature genes (see Additional file 1 and file 2).
Additional file 1. Clustering of the training data. Shown is the heat map of the twogroup (good and pooroutcome) supervised clusters of the integrated training data for the 112 signature genes. Those genes which appear in multiple pairs among the 80 gene pairs in the signature will appear multiple times in the heat map. The total number of the rows is 160.
Format: JPEG Size: 759KB Download file
Additional file 2. Clustering of the test data. Shown is the heat map of the twogroup (good and pooroutcome) supervised clusters of the test data (Pawitan) for the 112 signature genes. Those genes which appear in multiple pairs among the 80 gene pairs in the signature will appear multiple times in the heat map. The total number of the rows is 160.
Format: JPEG Size: 368KB Download file
It is noteworthy that performance of the LRT on the validation data is actually somewhat better than the performance on the training set (which is estimated by crossvalidation). Specifically, from Figure 1 (see also 'Methods'), the specificity of the LRT prognostic test is around 43% at approximately 90% sensitivity when estimated from the training data, whereas a specificity of approximately 55% at about the same sensitivity is achieved on the independent validation set.
To obtain another useful estimate of the clinical outcome, we apply the LRT built from the prognostic signature to all of the 159 samples in the Pawitan data set and calculate the probability of remaining free of distant metastases according to the prognostic signature by using KaplanMeier analysis. The KaplanMeier curve of the prognostic signature shows a significant difference (pvalue < 0.001) in the probability of remaining free of distant metastases between the patients in the pooroutcome group and those in the goodoutcome groups (Figure 3A). The pvalue is computed by the use of logrank test. The MantelCox estimation of hazard ratio for distant metastases within five years in the pooroutcome group as compared to the goodoutcome group is 9.3 (95% confidence interval: 2.9 – 29.9, pvalue < 0.001).
Figure 3. The KaplanMeier analysis. KaplanMeier analysis of the probability of remaining free of distant metastases among 159 Pawitan patients between the goodoutcome group and the pooroutcome group. The LRT is based on the integrated data in (A) and the single, Wang data set in (B). CI denotes confidence interval and the pvalue is calculated by the logrank test.
Comparison of the prognostic signature to studyspecific signatures
To evaluate the potential statistical power gained by integrating multiple data sets to increase diversity and sample size, we compare the predictive power of our integrated prognostic signature with each of the three separate studyspecific prognostic signatures identified from the three data sets in Table 1. We use exactly the same method we used for the integrated data and each of the resulting three prognostic tests is applied to the same independent test data, the Pawitan data. The results are reported in Table 3. In the case of the Sotiriou data, we do not achieve the targeted sensitivity of at least ninety percent due to the very small sample size; the estimate of the threshold in the LRT does not generalize to the Pawitan test set. For the Miller and Wang data sets, the desired sensitivity is achieved but the specificity is far lower than for the classifier learned from the integrated data set.
Table 3. Test results on Pawitan data (154 patients)
The Wang data set is the largest. Using 40fold crossvalidation, the optimal feature number of gene pairs for the prognostic signature is m_{opt }= 60. The 94.3% sensitivity on the test set (33 out of the 35 pooroutcome samples) is close to the target of 90%. The specificity of the classifier is 10.1% (12 out of the 119 pooroutcome samples), substantially lower than the classifier based on the integrated training set, albeit at somewhat higher sensitivity. (Indeed, the performance of the prognostic LRT test based on the Wang data alone is barely better than the completely randomized, dataindependent procedure which chooses pooroutcome with probability 0.9 and good outcome with probability 0.1, independently from sample to sample.) The odds ratio of this test is 1.9 (95% confidence interval: 0.4 – 8.7, Fisher's exact test pvalue = 0.74), and the KaplanMeier curve (Figure 3B) shows a less significant difference between the patients in the pooroutcome and goodoutcome groups than that of the signature from the integrated data. Finally, the estimated hazard ratio of 1.6 (95% confidence interval: 0.4 – 6.8, 0.01 < pvalue < 0.05) is much lower than that of the prognostic test from the integrated data.
These comparisons demonstrate that the prognostic test derived from the integrated data is superior to the prognostic test derived from any of the individual studies and highlight the value of data integration. By integrating several microarray data sets with our rankbased methods, studyspecific effects are reduced and more features of breast cancer prognosis are captured.
Discussion
Using a rankbased method for feature selection, we integrate three independent microarray gene expression data sets of extreme samples and identify a 112gene breast cancer prognostic signature. The signature is invariant to standard withinarray preprocessing and data normalization. All of the patients in the integrated training set had lymphnodenegative tumors and had not received adjuvant systemic treatment, so the identification of the prognostic signature is not subject to potential confounding factors related to lymph node status or systemic treatment. A LRT constructed from the prognostic signature is used to predict whether a breast cancer patient will develop distant metastases within five years after initial treatment. This prognostic test achieves a sensitivity of 88.6% and a specificity of 54.6% on an independent test data set of 154 samples. The test set includes patients who had and who had not received adjuvant systemic treatment, and those with both lymphnodenegative and lymphnodepositive tumors, indicating that our prognostic signature could possibly be applied to all breast cancer patients independently of age, tumor size, tumor grade, lymph mode status, and systemic treatment. It should be pointed out that, somewhat paradoxically, one reason for this ability to generalize is that, as with all machine learning methods, the feature seleciton process is not guided by specific biological knowledge about the underlying processes and pathways.
One motivation for using the LRT is simplicity: under the assumption of independent features, the test statistic is a weighted average of the feature values and the test itself reduces to comparing this average to a fixed threshold. Another motivation stems from the NeymanPearson lemma of statistical hypothesis testing [28], which states that the LRT achieves optimal specificity at any given level of sensitivity. However, we cannot claim optimal specificity (at roughly ninety percent sensitivity) for our prognostic test since our LRT is constructed by assuming the 80 binary comparison features are statistically independent in each class, which is likely to be violated in practice due to correlations among the genes and genes appearing in multiple pairs. But this approach does offer a rigorous statistical framework for constructing prognostic tests at a given sensitivity. It also provides a direction towards more powerful procedures. Evidently, increasingly better approximations to the "true" LRT, and hence to optimal specificity, would be obtained by accounting for more and more of the dependency structure among the features. Indeed, accounting for pairwise correlations alone would be a significant step in this direction.
Comparison with the conventional treatment guidelines (e.g. St. Gallen and NIH) is instructive. While maintaining almost the same level of sensitivity (~90%), our prognostic test achieves a specificity which is well above the 10–30% range of the St. Gallen and NIH targets. This means that our test can spare a significant number of goodoutcome patients from unnecessary adjuvant therapy, while ensuring roughly the same percentage of pooroutcome patients receive adjuvant therapy as recommended by the treatment guidelines. Therefore, our prognostic test and signature, if further validated on largescale independent data, could potentially provide a useful means of guiding adjuvant systemic treatment, reducing cost and improving the quality of patients' lives.
Other strengths of our study, compared with previous ones, are the larger number of homogeneous patients (lymphnodenegative tumors without adjuvant systemic treatment) in the training set, and an external independent test set. In each of the two major breast cancer prognostic studies [6,7], the training and validation data are extracted from the same study group from the same population. More specifically, the entire data set is randomly divided into two pieces, one serving as a training set and the other as a validation or test set. In this case, the training data and the validation data are likely to have similar properties. Therefore, the studyspecific prognostic test identified from the training data usually gives overpromising results when assessed using the "internal" validation data. (Similar remarks apply to methods which measure performance using cross validation.) This argument may explain why the two major prognostic signatures, although validated internally with about 90% sensitivity and about 50% specificity, cannot be validated externally with an independent data set [9]. In addition, splitting the original data set into two pieces only aggravates the smallsample problem, as well as producing other sources of bias [12]. In our study, we increase diversity and sample size by integrating several microaray data sets involving patients from different populations. By selecting a homogeneous subgroup of patients and combining data from multiple studies, the derived prognostic test and signature is less sensitive to studyspecific factors. An intriguing advantage of interstudy data integration is that it increases the statistical power to capture essential prognostic features which might be masked by studyspecific features and the small sample sizes of individual data sets. In this sense, our prognostic test is more robust to interstudy variability and may facilitate external validation.
Comparison of our prognostic signature with the two major signatures of van't Veer et al. and Wang et al. is not straightforward because of differences in patients, microarray platforms, and algorithms. The study of van't Veer et al. uses an Agilent array platform and our study uses an Affymetrix array platform. Only 46 out of the 112 genes in our prognostic signature are present on the Agilent Hu25K array and only 36 of the 70 genes in the van't Veer signature are present on the Affymetrix HGU133A array. Therefore, we can neither validate the van't Veer prognostic test on our validation data nor validate our test on their data set. There is a threegene overlap between the van't Veer signature and our signature (CCNE2, ORC6L, and PRC1). Since the data set in Wang et al. is included in our training set, we cannot validate our test on that data set. On the other hand, in order to validate the test proposed by Wang et al., we need to know the estrogen receptor (ER) status of our test samples because the classification rule based on their signature is depend on ER status, which is absent from our validation data. Again, there is a fourgene overlap between the Wang signature and our signature (AP2A2, CBX3, CCNE2, and MLF1IP). It is noteworthy that the gene CCNE2 is included in all of the three signatures and is reported to be related to breast cancer [29]. CCNE2 could be a potential target for the rational development of new cancer drugs.
Using the program DAVID [30], according to the gene ontology biological process categories, the 112gene signature is highly enriched in cell cycle (Pvalue = 1.45E5) and cell division (Pvalue = 5.9E4). To pinpoint the role of some of the genes in our signature, the cell cycle pathway is displayed in the additional files with our signature genes shown in red (see Additional file 3). These findings demonstrate that deregulation of these pathways has a direct impact on tumor progression and indicate that the 112gene signature is biologically relevant.
Additional file 3. The cell cycle pathway. Our signature genes which appear in the cell cycle pathway are shown in red.
Format: JPEG Size: 147KB Download file
To assess the benefit of data integration, we compared the predictive power of our signature with that of three studyspecific signatures identified from the Sotiriou, Miller and Wang data sets using the same LRT procedure. When applied to the same independent test data, our prognostic test consistently outperforms the studyspecific tests and the largest study (Wang) in particular, in terms of specificity (54.6% vs. 10.1%) at roughly the same 90% sensitivity level, odds ratio (9.3 vs. 1.9), hazard ratio (9.3 vs. 1.6), and KaplanMeier analysis. These findings again suggest a prognostic test derived from a single data set may be overdedicated and might perform weakly on external data. In contrast, a prognostic test derived from integrated data is more likely to be more robust to studyspecific factors and to be validated satisfactorily on external data.
Recently, some studies have shown that combining gene expression data and conventional clinical data (e.g. tumor size, grade, ER status) could lead to improved breast cancer prognosis [31,32]. An approach based on solid statistical principles can accommodate aggregating data of multiple types, e.g., combining gene expression signatures with traditional clinical factors. In this study, due to the lack of clinical information for some of the training samples, we could not incorporate such information into the development of our prognostic test. As clinical information becomes publicly available, it might be combined with the integrated gene expression data to further improve prognosis.
Conclusion
The opinion expressed in recent studies that gene expression information can be useful in breast cancer prognosis seems to be wellfounded. However, due to the small sample sizes relative to the complexity of the entire expression profile, existing methods suffer certain limitations, namely the prevalence of studyspecific signatures and difficulties in validating the prognostic tests constructed from these signatures on independent data. Integrating data from multiple studies to obtain more samples appears to be a promising way to overcome these limitations.
We have integrated several gene expression data sets and developed a likelihood ratio test for predicting distant metastases that correctly signals a poor outcome in approximately ninety percent of test cases while maintaining about fiftyfive percent specificity for good outcome patients. This well exceeds the St. Gallen and NIH guidelines and compares favorably with the best results previously reported (although not yet validated on external test data). As more and more gene (and protein) expression data is generated and made publicly available, modeling the interactions among genes (and gene products) will become increasingly feasible, and is likely to be crucial in designing prognostic tests which achieve high sensitivity without sacrificing specificity.
Methods
Data integration
Recently, our group has developed a family of statistical molecular classification methods based on relative expression reversals [22,33], and applied one variant based on a twogene classifier to microarray data integration [23]. These methods only use the ranks of gene expression values within each profile and achieve impressive results in both molecular classification and microarray data integration. An important feature of rankbased methods is that they are invariant to monotonic transformations of the expression data within an array, such as those used in most array normalization and other preprocessing methods. This property makes these methods especially useful for combining data from separate studies since the nature of the primary features extracted from the data, namely comparisons of mRNA concentration between pairs of genes, eliminates the need to standardize the data before aggregation. Specifically, the ranks of gene expression values are invariant to monotonic data transformations within each microarray. Consequently, we directly merge gene expression data of the patients from three training data sets in Table 1, using the 22283 probe sets on Affymetrix HGU133A microarray, to form an integrated training data set of 358 samples. After aggregation, we extract a list of pairwise comparisons; each of these "features" corresponds to a pair of genes and is assigned the value zero or one depending on the observed ordering of expressions; see the following section. The number of features retained is much smaller than the number of genes on the array. The procedure is now described in more detail.
Feature selection and transformation
Consider G genes whose expression values X = {X_{1}, X_{2}, ..., X_{G}} are measured using a DNA microarray and regarded as random variables. The class label Y for each profile X is a discrete random variable taking on one of two possible values corresponding to the two prognostic states or hypotheses of interest, namely "pooroutcome," denoted Y = 1, and "goodoutcome," denoted Y = 2. The integrated training microarray data represent the observed values of X and comprise a G × N matrix x = [x_{gn}], g = 1, 2, ..., G and n = 1, 2, ..., N, where G is the number of genes in each profile and N is the number of samples (profiles) in the integrated data set. Each column n represents a gene expression profile of G genes with a class label y_{n }= 1 (pooroutcome) or y_{n }= 2 (goodoutcome) for the twoclass problem in our study. Among the N samples, there are N_{1 }(respectively, N_{2}) samples labeled as class 1 (respectively, class 2) with N = N_{1 }+ N_{2}.
For each pair of genes (i, j), where i, j = 1, 2, ..., G, i ≠ j, let P(X_{i }<X_{j}Y = k), k = 1,2, denote the conditional probability of the event {X_{i }<X_{j}} given Y = k. We define a score by
and estimate the score of pair (i, j) based on the training set x by
where
In other words, the estimated score is simply the absolute difference between the fraction of pooroutcome patients for which gene i is expressed less than gene j and the same fraction in the goodoutcome examples. The feature selection procedure consists of forming a list of gene pairs, sorted from the largest to the smallest according to their scores Δ_{ij}, and selecting the top M pairs. The M topranked gene pairs are then considered to be the most discriminating candidate gene pairs for breast cancer prognosis if only relative expressions are taken into account. During the process, we have transformed the original feature vector X = {X_{1}, X_{2}, ..., X_{G}}(G = 22283 in this study), each of which assumes scalar values, to a new ordered feature vector Z = {Z_{1}, Z_{2}, ..., Z_{M}}(usually, M <<G), each of which assumes only two values.
Suppose Z_{m}, m = 1, 2, ..., M, corresponds to the gene pair {i, j}. For convenience, the ordering (i, j) will signify which probability in Equation (1) is larger. The reason for this is to facilitate interpretation of the results, as will become apparent. If P(X_{i }<X_{j}Y = 1) ≥ P(X_{i }<X_{j}Y = 2), as estimated by the fractions in (2), we will write (i, j) and if P(X_{i }<X_{j}Y = 1) <P(X_{i }<X_{j}Y = 2) we will denote the pair by (j, i). The value assumed by Z_{m }is then set to be 1 if we observe X_{i }<X_{j }and set to 0 otherwise, i.e., if we observe X_{i }≥ X_{j}. Of course the same definition is applied to each feature in the training set. In this way, observing Z_{m }= 1 (resp., Z_{m }= 0) represents an indicator of the poor outcome (resp., good outcome) class in the sense that p_{m }= P(Z_{m }= 1Y = 1) ≥ q_{m }= P(Z_{m }= 1Y = 2). For all the features selected in our signature we in fact have p_{m }> 1/2 > q_{m}.
After this procedure, the original G × N data matrix is reduced to an M × N data matrix. The number of distinct genes in a prognostic signature is obviously fewer than 2M. In our practice, there are always more than M distinct genes among the top M gene pairs. Given that the numbers of genes in published breast cancer prognostic signatures are mostly fewer than 100, we fix M = 200 in this study to maker sure we can identify a prognostic feature signature based on a reasonable number of genes.
Likelihood ratio test
The classical likelihood ratio test (LRT) is a statistical procedure for distinguishing between two hypotheses, each constraining the distribution of a random vector Z = {Z_{1}, Z_{2}, ..., Z_{M}}. In our case the variables Z_{m }are the simple, binary functions of the gene expression profile defined above.
The LRT is based on the likelihood ratio
where z = {z_{1}, z_{2}, ..., z_{M}}are the observed values in a new sample. Notice that z assumes values in {0, 1}^{M}, the set of binary strings of length M. The LRT chooses hypothesis Y = 1 if LR(z) > t and chooses Y = 2 otherwise, i.e., if LR(z) ≤ t. The threshold t is adjusted to provide a desired tradeoff between type I error and type II error, i.e., between sensitivity and specificity. Choosing t small provides high sensitivity at the expense of specificity and choosing t large promotes the opposite effect.
Naive Bayes Classifier
In the special case in which the random variables Z_{1}, ..., Z_{M }are binary (as here) and are assumed to be conditionally independent given Y, the LRT has a particularly simple form. It reduces to comparing a linear combination of the variables to a threshold. Recall that p_{m }= P(Z_{m }= 1Y = 1) and q_{m }= P(Z_{m }= 1Y = 2), m = 1, 2, ..., M. In this case,
and a similar expression holds for P(zY = 2) with p_{m }replaced by q_{m}.
It follows that
and consequently
The LRT then reduces to the form: Choose Y = 1 if
and choose Y = 2 otherwise, where
Since p_{m }> q_{m}, all these coefficients in Equation (4) are positive and the decision rule in Equation (3) reduces to weighted voting among the pairwise comparisons: every observed instance of z_{m }= 1 is a vote for the poor outcome class with weight λ_{m}. Moreover, under the two assumptions of i) conditional independence and ii) equal a priori class probabilities (i.e., P(Y = 1) = P(Y = 2)), this is in fact the Bayes classifier (which is optimal) for the threshold t = 0.
Sensitivity vs. Specificity
Since our interest lies in high sensitivity at the expense of specificity if necessary, we do not choose t = 0. Since we want a very high likelihood of detecting the pooroutcome class, we choose the threshold t to achieve high sensitivity, defined to be above 90%. Let t_{α }denote the (largest) threshold achieving sensitivity 1  α. That is, suppose
(We explain how to estimate t_{α }from the training data in the next sections.) Then, from the NeymanPearson lemma, we know that our decision rule achieves the maximum possible specificity at this level of sensitivity. More precisely, this threshold maximizes
which is the probability of choosing goodoutcome when in fact goodoutcome is the true hypothesis.
Of course this is only a theoretical guarantee and depends very strongly on the conditional independence assumption which is surely violated in practice; indeed, some genes are common to several of the variables Z_{m}. Still, the LRT does provide a framework in which there are clearly stated hypotheses under which specificity can be optimized at a given sensitivity. Moreover, it provides a very simple test and the parameters p_{m}, q_{m }are easily estimated with available sample sizes. Most importantly, the decision procedure dictated by the LRT does indeed work well on independent test data (see 'Results').
Signature identification and class prediction
In clinical practice, when selecting breast cancer patients for adjuvant systemic therapy, it is of evident importance to limit the number of pooroutcome patients assigned to the goodoutcome category. The conventional guidelines (e.g., St. Gallen and NIH) for breast cancer treatment usually call for at least 90% sensitivity and 10–30% specificity. Therefore our objective in selecting the threshold t is to maintain high sensitivity (~90%); the specificity is then determined by the sample size and the information content in the features. In order to meet these criteria, we employ kfold crossvalidation to estimate the threshold which maximizes specificity at ~90% sensitivity for each signature size for our likelihood ratio test.
The idea is to use kfold cross validation to estimate the sensitivity and specificity for each possible value of m = 5, 10, 15, ..., 200, (the number of features in the LRT) and t = 1, 2, ..., 200 (the threshold in Equation (3)). For each such m we then compute the specificity at the largest threshold t(m) achieving at least 90% sensitivity; this function is plotted in Figure 1. (Obtaining 90% sensitivity can always be achieved by selecting a small enough threshold.) Finally, we then choose the smallest value m_{opt }which (approximately) maximizes specificity; the threshold is then t_{opt }= t(m_{opt}). From Figure 1, we see that m_{opt }= 80.
Specifically, the steps are as follows: 1) Divide the integrated training data set into k disjoint subsets of approximately equal sample size; 2) Leave out one subset and combine the other k1 subsets to form a training set; 3) Generate a feature list of M gene pairs ranked from most to least discriminating according the score defined in Equation (1) and compute the corresponding binary feature vector of length M for every training sample and every leftout sample; 4) Starting from the top five features, sequentially add five features at a time from the ranked list, generating series of 40 feature signatures of sizes m = 5, 10, ..., 200; 5) For each signature in 4), classify the leftout samples using the LRT in Equation (3) for each possible integer threshold t = 1, 2, ..., 200 and record the numbers of misclassified pooroutcome and misclassified goodoutcome samples; 6) Repeat steps 1)5) exhaustively for all k divisions into training and testing in step 1); 7) Calculate the sensitivity and specificity for the prognostic LRT test for each of the 40 signatures, and keep only the largest threshold for which the sensitivity exceeds 90%. The optimal number of features, m_{opt}, is the smallest number which effectively maximizes specificity.
The final prognostic signature is the m_{opt }topranked features (gene pairs) generated from the original integrated training set. The final prognostic test is the LRT with these features and the corresponding threshold t_{opt }= t(m_{opt}); this is the classifier which is applied to the validation set and yields the error rates reported in 'Results'.
Additional statistical analysis
We compute the odds ratio of our prognostic test for developing distant metastases within five years between the patients in the pooroutcome group and goodoutcome group as determined by LRT classifier. The pvalues associated with odds ratios are calculated by Fisher's exact text. We also plot the KaplanMeier curve of the signature on the independent test data with pvalues calculated by logrank test. The MantelCox estimation of hazard ratio of distant metastases within five years for the signature is also reported. All the statistical analyses are performed using MATLAB.
Authors' contributions
LX, under the supervision of RLW and DG, collected the microarray data sets and implemented the algorithms; all authors developed the methodology and contributed to the final manuscript.
Acknowledgements
We thank Dr. Yijun Sun for providing the MATLAB code to compute the hazard ratio. This work was supported by HL72488 and HL085343.
References

Jemal A, Siegel R, Ward E, Murray T, Xu J, Smigal C, Thun MJ: Cancer Statistics, 2006.
CA Cancer J Clin 2006, 56(2):106130. PubMed Abstract  Publisher Full Text

Eifel P, Axelson JA, Crowley J, Curran WJ, Deshler A, Fulton S, Hendricks CB, Kemeny M: National Institutes of Health Consensus Development Conference Statement: Adjuvant Therapy for Breast Cancer, November 13, 2000.
J Natl Cancer Inst 2001, 93(13):979989. PubMed Abstract  Publisher Full Text

Goldhirsch A, Glick JH, Gelber RD, Coates AS, Thurlimann B, Senn HJ, and Panel M: Meeting Highlights: International Expert Consensus on the Primary Therapy of Early Breast Cancer 2005.
Ann Oncol 2005, 16(10):15691583. PubMed Abstract  Publisher Full Text

Early Breast Cancer Trialists' Collaborative G: Polychemotherapy for early breast cancer: an overview of the randomised trials.
The Lancet 1998, 352(9132):930. Publisher Full Text

van de Vijver MJ, He YD, van 't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A GeneExpression Signature as a Predictor of Survival in Breast Cancer.
N Engl J Med 2002, 347(25):19992009. PubMed Abstract  Publisher Full Text

van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder ME, Yu J: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer.
Lancet 2005, 365:671679. PubMed Abstract  Publisher Full Text

Ma XJ, Wang Z, Ryan PD, Isakoff SJ, Barmettler A, Fuller A, Muir B, Mohapatra G, Salunga R, Tuggle JT: A twogene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen.
Cancer Cell 2004, 5(6):607. PubMed Abstract  Publisher Full Text

Naderi A, Teschendorff AE, BarbosaMorais NL, Pinder SE, Green AR, Powe DG, Robertson JFR, Aparicio S, Ellis IO, Brenton JD, Caldas C: A geneexpression signature to predict survival in breast cancer across independent data sets.
Oncogene 2007, 26(10):150716. PubMed Abstract  Publisher Full Text

Chang HY, Nuyten DSA, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van't Veer LJ, Bartelink H, van de Rijn M, Brown PO, van de Vijver MJ: From The Cover: Robustness, scalability, and integration of a woundresponse gene expression signature in predicting breast cancer survival.
PNAS 2005, 102(10):37383743. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, HaibeKains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis.
J Natl Cancer Inst 2006, 98(4):262272. PubMed Abstract  Publisher Full Text

EinDor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?
Bioinformatics 2005, 21(2):171178. PubMed Abstract  Publisher Full Text

Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy.
The Lancet 2005, 365(9458):488. Publisher Full Text

Shen R, Ghosh D, Chinnaiyan A: Prognostic metasignature of breast cancer developed by twostage mixture modeling of microarray data.
BMC Genomics 2004, 5(1):94. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Teschendorff A, Naderi A, BarbosaMorais N, Pinder S, Ellis I, Aparicio S, Brenton J, Caldas C: A consensus prognostic gene expression classifier for ER positive breast cancer.
Genome Biology 2006, 7(10):R101. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ramaswamy S, Ross KN, Lander ES, Golub TR: A molecular signature of metastasis in primary solid tumors.
Nat Genet 2003, 33(1):49. PubMed Abstract  Publisher Full Text

Warnat P, Eils R, Brors B: Crossplatform analysis of cancer microarray data improves gene expression based classification of phenotypes.
BMC Bioinformatics 2005, 6(1):265. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Benito M, Parker J, Du Q, Wu J, Xiang D, Perou CM, Marron JS: Adjustment of systematic microarray data biases.
Bioinformatics 2004, 20(1):105114. PubMed Abstract  Publisher Full Text

Cheadle C, Vawter MP, Freed WJ, Becker KG: Analysis of Microarray Data Using Z Score Transformation.
J Mol Diagn 2003, 5(2):7381. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jiang H, Deng Y, Chen HS, Tao L, Sha Q, Chen J, Tsai CJ, Zhang S: Joint analysis of two microarray geneexpression data sets to select lung adenocarcinoma marker genes.
BMC Bioinformatics 2004, 5(1):81. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information.
Nucleic Acids Res 2005, 33(Database issue):D3945. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Geman D, d'Avignon C, Naiman DQ, Winslow RL: Classifying gene expression profiles from pairwise mRNA comparison.
Statistical Applications in Genetics and Molecular Biology 2004, 3(1):19. Publisher Full Text

Xu L, Tan AC, Naiman DQ, Geman D, Winslow RL: Robust prostate cancer marker genes emerge from direct integration of interstudy microarray data.
Bioinformatics 2005, 21(20):39053911. PubMed Abstract  Publisher Full Text

Price ND, Trent J, ElNaggar AK, Cogdell D, Taylor E, Hunt KK, Pollock RE, Hood L, Shmulevich I, Zhang W: Highly accurate twogene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas.
PNAS 2007, 104(9):34143419. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: From The Cover: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival.
PNAS 2005, 102(38):1355013555. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu E, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw P, Smeds J, Skoog L, Wedren S, Bergh J: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two populationbased cohorts.
Breast Cancer Research 2005, 7(6):R953  R964. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Liu H, Li J, Wong L: Use of extreme patient samples for outcome prediction from gene expression data.
Bioinformatics 2005, 21(16):33773384. PubMed Abstract  Publisher Full Text

Doksum KA, Bickel PJ: Mathematical statistics : basic ideas and selected topics . Volume 1. 2nd edition. Prentice Hall; 2006.

Payton M, Scully S, Chung G, Coats S: Deregulation of cyclin E2 expression and associated kinase activity in primary breast tumors.
Oncogene 2002, 21(55):85298534. PubMed Abstract  Publisher Full Text

Dennis G, Sherman B, Hosack D, Yang J, Gao W, Lane HC, Lempicki R: DAVID: Database for Annotation, Visualization, and Integrated Discovery.
Genome Biology 2003, 4(5):P3. PubMed Abstract  BioMed Central Full Text

Pittman J, Huang E, Dressman H, Horng CF, Cheng SH, Tsou MH, Chen CM, Bild A, Iversen ES, Huang AT, Nevins JR, West M: Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes.
PNAS 2004, 101(22):84318436. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sun Y, Goodison S, Li J, Liu L, Farmerie W: Improved breast cancer prognosis through the combination of clinical and genetic markers.
Bioinformatics 2007, 23(1):307. PubMed Abstract  Publisher Full Text

Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D: Simple decision rules for classifying human cancers from gene expression profiles.
Bioinformatics 2005, 21(20):38963904. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pavlidis P, Noble WS: Matrix2png: a utility for visualizing matrix data.
Bioinformatics 2003, 19(2):295296. PubMed Abstract  Publisher Full Text