The Institute for Computational Medicine and Center for Cardiovascular Bioinformatics and Modeling, Johns Hopkins University, Baltimore, MD 21218, USA

Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA

Abstract

Background

There is an urgent need for new prognostic markers of breast cancer metastases to ensure that newly diagnosed patients receive appropriate therapy. Recent studies have demonstrated the potential value of gene expression signatures in assessing the risk of developing distant metastases. However, due to the small sample sizes of individual studies, the overlap among signatures is almost zero and their predictive power is often limited. Integrating microarray data from multiple studies in order to increase sample size is therefore a promising approach to the development of more robust prognostic tests.

Results

In this study, by using a highly stable data aggregation procedure based on expression comparisons, we have integrated three independent microarray gene expression data sets for breast cancer and identified a structured prognostic signature consisting of 112 genes organized into 80 pair-wise expression comparisons. A classical likelihood ratio test based on these comparisons, essentially weighted voting, achieves 88.6% sensitivity and 54.6% specificity in an independent external test set of 154 samples. The test is highly informative in assessing the risk of developing distant metastases within five years (hazard ratio 9.3 with 95% CI 2.9–29.9).

Conclusion

Rank-based features provide a stable way to integrate patient data from separate microarray studies due to invariance to data normalization, and such features can be combined into a useful predictor of distant metastases in breast cancer within a statistical modeling framework which begins to capture gene-gene interactions. Upon further confirmation on large-scale independent data, such prognostic signatures and tests could provide a powerful tool to guide adjuvant systemic treatment that could greatly reduce the cost of breast cancer treatment, both in terms of toxic side effects and health care expenditures.

Background

Breast cancer is the most common form of cancer and the second leading cause of cancer death among women in the United States, with an estimated ~213,000 new cases and ~41,000 deaths in 2006

The advent of DNA microarray technology provides a powerful tool in various aspects of cancer research. Simultaneous assessment of the expression of thousands of genes in a single experiment could allow better understanding of the complex and heterogeneous molecular properties of breast cancer. Such information may lead to more accurate prognostic signatures for prediction of metastasis risk in breast cancer patients. Over the past few years, a number of studies have identified prognostic gene expression signatures and proposed corresponding prognostic tests based on these genes. In many cases, the prediction of breast cancer outcome is superior to conventional prognostic tests

The most striking observation when comparing the signatures from different studies is the lack of overlap of signature genes. For instance, in the studies of van't Veer

The rapid accumulation of microarray gene expression data suggests that combining microarray data from different studies may be a useful way to increase sample size and diversity. In particular, "meta-analyses" have recently been used to merge different studies in order to develop prognostic gene expression signatures for breast cancer

In contrast to the meta-analysis approach, in which the results of individual studies are combined at an interpretative level, other methods, such as Z-score, Distance Weighted Discrimination (DWD), integrate microarray data from different studies at the expression value level after transforming the expressions to numerically comparable measures

In our previous work

Results

Summary

We integrate three independent microarray gene expression data sets to obtain an integrated training set of 358 samples and identify a set of features for predicting distant metastases. All the samples included in this study are from lymph-node-negative patients who have not received adjuvant systemic treatment. Each feature is based on an ordered pair of genes and assumes the value one if the first gene is expressed less than the second gene, and assumes the value zero otherwise. These genes may not all be highly differentially expressed, and one gene in the pair may serve as a "reference" for the other one. Since the features are rank-based, no data normalization is needed before data integration. A classical likelihood ratio test is used to classify patients as either poor-outcome, meaning they are likely to metastasize, or good-outcome, meaning that they are unlikely to develop distant metastases. The choice of features is motivated by achieving the highest possible specificity at an acceptable level of sensitivity, taken here to be 90% in accordance with the St. Gallen and NIH treatment guidelines. The number of features chosen in the prognostic signature, as well as the threshold in the likelihood ratio test (LRT), is optimized with

Study data

Four breast cancer microarray data sets are included in this study. Each data set has been downloaded from publicly available gene expression repositories (e.g. Gene Expression Omnibus) or supporting web sites

Motivated by a recent study

Training data sets: lymph-node-negative patients with no adjuvant treatment

**Data Set**

**No. of Patients**

**No. of Good-outcome**

**No. of Poor-outcome**

Miller [25]

106

92

14

Sotiriou [11]

43

30

13

Wang [7]

209

114

95

Total

358

236

122

A prognostic signature from integrated data

We directly merge the three microarray data sets in Table

Choosing size of the signature

**Choosing size of the signature**. The relationship between the number of features in a prognostic signature and the specificity at 90% sensitivity of the corresponding prognostic test, evaluated by 40-fold cross-validation. We select _{opt }= 80, the smallest value that achieves roughly maximum specificity at the 90% sensitivity level. The specificity observed on the validation set is in fact higher.

The heat map of the 80 signature gene pairs

**The heat map of the 80 signature gene pairs**. The Wang data set is used to illustrate the gene expression values of the signature genes. A heat map is generated using the matrix2png software [34]. There are 80 rows corresponding to the 80 gene pairs; the displayed intensities are the differences between the expression values of the two genes in each pair. The expression value for each difference is normalized across the samples to zero mean and one standard deviation (SD) for visualization purposes. Differences with expression levels greater than the mean are colored in red and those below the mean are colored in green. The scale indicates the number of SDs above or below the mean.

Genes in the identified prognostic signature. For each probe set the first column lists the subset of the eighty pairs which contain it. The pairs are ordered from 1 to 80 by their scores.

**Pair Rank**

**Probe Set**

**Gene Symbol**

**Gene Title**

1, 43

91816_f_at

RKHD1

ring finger and KH domain containing 1

1, 6, 73

204641_at

NEK2

NIMA (never in mitosis gene a)-related kinase 2

2

213139_at

SNAI2

snail homolog 2 (Drosophila)

2, 4, 9, 33

212188_at

KCTD12

potassium channel tetramerisation domain containing 12

3

212022_s_at

MKI67

antigen identified by monoclonal antibody Ki-67

3, 61, 80

219716_at

APOL6

apolipoprotein L, 6

4

205264_at

CD3EAP

CD3e molecule, epsilon associated protein

5

206687_s_at

PTPN6

protein tyrosine phosphatase, non-receptor type 6

5, 67

218009_s_at

PRC1

protein regulator of cytokinesis 1

6, 35, 39, 55

219579_at

RAB3IL1

RAB3A interacting protein (rabin3)-like 1

7

221824_s_at

MARCH8

membrane-associated ring finger (C3HC4) 8

7

209574_s_at

C18orf1

chromosome 18 open reading frame 1

8

210199_at

CRYAA

crystallin, alpha A

8, 24, 26, 31

219493_at

SHCBP1

SHC SH2-domain binding protein 1

9

204177_s_at

KLHL20

kelch-like 20 (Drosophila)

10, 34

203010_at

STAT5A

signal transducer and activator of transcription 5A

10

212747_at

ANKS1A

ankyrin repeat and sterile alpha motif domain containing 1A

11, 19, 21

205034_at

CCNE2

cyclin E2

11, 65

217427_s_at

HIRA

HIR histone cell cycle regulation defective homolog A (S. cerevisiae)

12, 46, 54, 74

222077_s_at

RACGAP1

Rac GTPase activating protein 1

12, 62

36545_s_at

SFI1

Sfi1 homolog, spindle assembly associated (yeast)

13, 17, 72

218883_s_at

MLF1IP

MLF1 interacting protein

13

203332_s_at

INPP5D

inositol polyphosphate-5-phosphatase, 145kDa

14, 15

211584_s_at

NPAT

nuclear protein, ataxia-telangiectasia locus

14

219512_at

C20orf172

chromosome 20 open reading frame 172

15

221193_s_at

ZCCHC10

zinc finger, CCHC domain containing 10

16

221521_s_at

GINS2

GINS complex subunit 2 (Psf2 homolog)

16

209671_x_at

TRA@///TRAC

T cell receptor alpha locus///T cell receptor alpha locus

17

208952_s_at

LARP5

La ribonucleoprotein domain family, member 5

18, 30

218726_at

DKFZp762E1312

hypothetical protein DKFZp762E1312

18, 51

211581_x_at

LST1

leukocyte specific transcript 1

19

221273_s_at

DKFZP761H1710

hypothetical protein DKFZp761H1710

20

205395_s_at

MRE11A

MRE11 meiotic recombination 11 homolog A (S. cerevisiae)

20, 59

214973_x_at

IGHD

immunoglobulin heavy constant delta

21, 27

211881_x_at

IGLJ3

immunoglobulin lambda joining 3

22

202602_s_at

HTATSF1

HIV-1 Tat specific factor 1

22

218143_s_at

SCAMP2

secretory carrier membrane protein 2

23

212911_at

DNAJC16

DnaJ (Hsp40) homolog, subfamily C, member 16

23

204817_at

ESPL1

extra spindle poles like 1 (S. cerevisiae)

24

215783_s_at

ALPL

alkaline phosphatase, liver/bone/kidney

25, 38, 39, 44, 52, 71

204825_at

MELK

maternal embryonic leucine zipper kinase

25

213689_x_at

RPL5

Ribosomal protein L5

26

206545_at

CD28

CD28 molecule

27

206364_at

KIF14

kinesin family member 14

28, 60, 61

208079_s_at

AURKA

aurora kinase A

28

214955_at

TMPRSS6

transmembrane protease, serine 6

29

210966_x_at

LARP1

La ribonucleoprotein domain family, member 1

29

218830_at

RPL26L1

ribosomal protein L26-like 1

30

204498_s_at

ADCY9

adenylate cyclase 9

31

206211_at

SELE

selectin E (endothelial adhesion molecule 1)

32, 34, 69

201890_at

RRM2

ribonucleotide reductase M2 polypeptide

32

219298_at

ECHDC3

enoyl Coenzyme A hydratase domain containing 3

33

204847_at

ZBTB11

zinc finger and BTB domain containing 11

35, 62

203214_x_at

CDC2

cell division cycle 2, G1 to S and G2 to M

36

204605_at

CGRRF1

cell growth regulator with ring finger domain 1

36

211251_x_at

NFYC

nuclear transcription factor Y, gamma

37, 65

213008_at

KIAA1794

KIAA1794

37, 73

210042_s_at

CTSZ

cathepsin Z

38

203595_s_at

IFIT5

interferon-induced protein with tetratricopeptide repeats 5

40

221529_s_at

PLVAP

plasmalemma vesicle associated protein

40

202114_at

SNX2

sorting nexin 2

41

211779_x_at

AP2A2

adaptor-related protein complex 2, alpha 2 subunit

41, 63

202324_s_at

ACBD3

acyl-Coenzyme A binding domain containing 3

42, 57

201821_s_at

TIMM17A

translocase of inner mitochondrial membrane 17 homolog A (yeast)

42

201551_s_at

LAMP1

lysosomal-associated membrane protein 1

43

48808_at

DHFR

dihydrofolate reductase

44

211643_x_at

LOC651961

Myosin-reactive immunoglobulin light chain variable region

45

210396_s_at

LOC440354

PI-3-kinase-related kinase SMG-1 pseudogene

45

201070_x_at

SF3B1

splicing factor 3b, subunit 1, 155kDa

46

207391_s_at

PIP5K1A

phosphatidylinositol-4-phosphate 5-kinase, type I, alpha

47

200800_s_at

HSPA1A

heat shock 70 kDa protein 1A

47

201009_s_at

TXNIP

thioredoxin interacting protein

48

203530_s_at

STX4

syntaxin 4

48, 50

218085_at

CHMP5

chromatin modifying protein 5

49, 68, 70

219555_s_at

C16orf60

chromosome 16 open reading frame 60

49

210419_at

BARX2

BarH-like homeobox 2

50

214119_s_at

FKBP1A

FK506 binding protein 1A, 12 kDa

51, 58

203362_s_at

MAD2L1

MAD2 mitotic arrest deficient-like 1 (yeast)

52

218910_at

TMEM16K

transmembrane protein 16K

53

208838_at

KIAA0829

KIAA0829 protein

53

212081_x_at

BAT2

HLA-B associated transcript 2

54

202115_s_at

NOC2L

nucleolar complex associated 2 homolog (S. cerevisiae)

55

209714_s_at

CDKN3

cyclin-dependent kinase inhibitor 3 (CDK2-associated dual specificity phosphatase)

56

205701_at

IPO8

importin 8

56

205063_at

SIP1

survival of motor neuron protein interacting protein 1

57

200918_s_at

SRPR

signal recognition particle receptor ('docking protein')

58

212527_at

D15Wsu75e

DNA segment, Chr 15, Wayne State University 75, expressed

59

204244_s_at

DBF4

DBF4 homolog (S. cerevisiae)

60

214508_x_at

CREM

cAMP responsive element modulator

63

200787_s_at

PEA15

phosphoprotein enriched in astrocytes 15

64

203764_at

DLG7

discs, large homolog 7 (Drosophila)

64

205877_s_at

ZC3H7B

zinc finger CCCH-type containing 7B

66

200848_at

AHCYL1

S-adenosylhomocysteine hydrolase-like 1

66

201091_s_at

CBX3

chromobox homolog 3 (HP1 gamma homolog, Drosophila)

67

64064_at

GIMAP5

GTPase, IMAP family member 5

68

211649_x_at

IGHG1

Immunoglobulin heavy constant gamma 1 (G1m marker)

69

204398_s_at

EML2

echinoderm microtubule associated protein like 2

70

220433_at

PRRG3

proline rich Gla (G-carboxyglutamic acid) 3 (transmembrane)

71

219169_s_at

TFB1M

transcription factor B1, mitochondrial

72

34689_at

TREX1

three prime repair exonuclease 1

74

212604_at

MRPS31

mitochondrial ribosomal protein S31

75

213907_at

EEF1E1

Eukaryotic translation elongation factor 1 epsilon 1

75

209622_at

STK16

serine/threonine kinase 16

76

209716_at

CSF1

colony stimulating factor 1 (macrophage)

76

219575_s_at

peptide deformylase (mitochondrial)

77

219328_at

DDX31

DEAD (Asp-Glu-Ala-Asp) box polypeptide 31

77

213121_at

SNRP70

small nuclear ribonucleoprotein 70 kDa polypeptide (RNP antigen)

78

218870_at

ARHGAP15

Rho GTPase activating protein 15

78

219105_x_at

ORC6L

origin recognition complex, subunit 6 like (yeast)

79

216510_x_at

IGHA1

immunoglobulin heavy constant alpha 1

79

215207_x_at

YDD19

YDD19 protein

80

219918_s_at

ASPM

asp (abnormal spindle)-like, microcephaly associated (Drosophila)

In order to evaluate the reproducibility of the 112-gene signature, we repeat the same feature selection process with several re-samplings of 300 patients out of the 358 patients in the integrated data set. The average overlap is 39.0%. This is not surprising in view of the still modest sample size and the fact that most of the changes occur in the second half of the ranked list of gene pairs.

Validation of the prognostic test on independent data

To validate the prognostic test, we compute its sensitivity and specificity on an independent set of samples, the Pawitan data set

Our prognostic test is the classical likelihood ratio test, determined by assuming that the features are conditionally independent under both classes, namely "poor outcome" (the null hypothesis) and "good outcome" (the alternative hypothesis); see 'Methods'. The LRT reduces to comparing a weighted average of the 80 features to a threshold. The weights depend on the statistics of the individual features under both classes and are estimated from the training data; the threshold is also estimated from the training set, using cross-validation. The LRT built from the prognostic signature achieves a sensitivity of 88.6% (31 out of the 35 poor-outcome samples) and a specificity of 54.6% (65 out of the 119 good-outcome samples) on the 154 samples included in the validating data set. The remaining five patients, who either developed distant metastases after five years or were free of distant metastases with a follow-up period less than five years, are not included in the validating data set. We compute the odds ratio of the prognostic test for developing metastases within five years between the patients in the poor-outcome group and in the good-outcome group as determined by the prognostic test. The prognostic test has a high odds ratio of 9.3 (95% confidence interval: 3.1 – 28.1) with a Fisher's exact test

Clustering of the training data. Shown is the heat map of the two-group (good- and poor-outcome) supervised clusters of the integrated training data for the 112 signature genes. Those genes which appear in multiple pairs among the 80 gene pairs in the signature will appear multiple times in the heat map. The total number of the rows is 160.

Click here for file

Clustering of the test data. Shown is the heat map of the two-group (good- and poor-outcome) supervised clusters of the test data (Pawitan) for the 112 signature genes. Those genes which appear in multiple pairs among the 80 gene pairs in the signature will appear multiple times in the heat map. The total number of the rows is 160.

Click here for file

It is noteworthy that performance of the LRT on the validation data is actually somewhat

To obtain another useful estimate of the clinical outcome, we apply the LRT built from the prognostic signature to all of the 159 samples in the Pawitan data set and calculate the probability of remaining free of distant metastases according to the prognostic signature by using Kaplan-Meier analysis. The Kaplan-Meier curve of the prognostic signature shows a significant difference (

The Kaplan-Meier analysis

**The Kaplan-Meier analysis**. Kaplan-Meier analysis of the probability of remaining free of distant metastases among 159 Pawitan patients between the good-outcome group and the poor-outcome group. The LRT is based on the integrated data in (A) and the single, Wang data set in (B). CI denotes confidence interval and the

Comparison of the prognostic signature to study-specific signatures

To evaluate the potential statistical power gained by integrating multiple data sets to increase diversity and sample size, we compare the predictive power of our integrated prognostic signature with each of the three separate study-specific prognostic signatures identified from the three data sets in Table

Test results on Pawitan data (154 patients)

**Training Data**

**No. of Patients**

**Sensitivity (%)**

**Specificity (%)**

Sotiriou

43

51.4

47.1

Miller

106

100.0

15.1

Wang

209

94.3

10.1

Integrated

358

88.6

54.6

The Wang data set is the largest. Using 40-fold cross-validation, the optimal feature number of gene pairs for the prognostic signature is _{opt }= 60. The 94.3% sensitivity on the test set (33 out of the 35 poor-outcome samples) is close to the target of 90%. The specificity of the classifier is 10.1% (12 out of the 119 poor-outcome samples), substantially lower than the classifier based on the integrated training set, albeit at somewhat higher sensitivity. (Indeed, the performance of the prognostic LRT test based on the Wang data alone is barely better than the completely randomized, data-independent procedure which chooses poor-outcome with probability 0.9 and good outcome with probability 0.1, independently from sample to sample.) The odds ratio of this test is 1.9 (95% confidence interval: 0.4 – 8.7, Fisher's exact test p-value = 0.74), and the Kaplan-Meier curve (Figure

These comparisons demonstrate that the prognostic test derived from the integrated data is superior to the prognostic test derived from any of the individual studies and highlight the value of data integration. By integrating several microarray data sets with our rank-based methods, study-specific effects are reduced and more features of breast cancer prognosis are captured.

Discussion

Using a rank-based method for feature selection, we integrate three independent microarray gene expression data sets of extreme samples and identify a 112-gene breast cancer prognostic signature. The signature is invariant to standard within-array preprocessing and data normalization. All of the patients in the integrated training set had lymph-node-negative tumors and had not received adjuvant systemic treatment, so the identification of the prognostic signature is not subject to potential confounding factors related to lymph node status or systemic treatment. A LRT constructed from the prognostic signature is used to predict whether a breast cancer patient will develop distant metastases within five years after initial treatment. This prognostic test achieves a sensitivity of 88.6% and a specificity of 54.6% on an independent test data set of 154 samples. The test set includes patients who had and who had not received adjuvant systemic treatment, and those with both lymph-node-negative and lymph-node-positive tumors, indicating that our prognostic signature could possibly be applied to all breast cancer patients independently of age, tumor size, tumor grade, lymph mode status, and systemic treatment. It should be pointed out that, somewhat paradoxically, one reason for this ability to generalize is that, as with all machine learning methods, the feature seleciton process is not guided by specific biological knowledge about the underlying processes and pathways.

One motivation for using the LRT is simplicity: under the assumption of independent features, the test statistic is a weighted average of the feature values and the test itself reduces to comparing this average to a fixed threshold. Another motivation stems from the Neyman-Pearson lemma of statistical hypothesis testing

Comparison with the conventional treatment guidelines (e.g. St. Gallen and NIH) is instructive. While maintaining almost the same level of sensitivity (~90%), our prognostic test achieves a specificity which is well above the 10–30% range of the St. Gallen and NIH targets. This means that our test can spare a significant number of good-outcome patients from unnecessary adjuvant therapy, while ensuring roughly the same percentage of poor-outcome patients receive adjuvant therapy as recommended by the treatment guidelines. Therefore, our prognostic test and signature, if further validated on large-scale independent data, could potentially provide a useful means of guiding adjuvant systemic treatment, reducing cost and improving the quality of patients' lives.

Other strengths of our study, compared with previous ones, are the larger number of homogeneous patients (lymph-node-negative tumors without adjuvant systemic treatment) in the training set, and an external independent test set. In each of the two major breast cancer prognostic studies

Comparison of our prognostic signature with the two major signatures of van't Veer

Using the program DAVID

The cell cycle pathway. Our signature genes which appear in the cell cycle pathway are shown in red.

Click here for file

To assess the benefit of data integration, we compared the predictive power of our signature with that of three study-specific signatures identified from the Sotiriou, Miller and Wang data sets using the same LRT procedure. When applied to the same independent test data, our prognostic test consistently outperforms the study-specific tests and the largest study (Wang) in particular, in terms of specificity (54.6% vs. 10.1%) at roughly the same 90% sensitivity level, odds ratio (9.3 vs. 1.9), hazard ratio (9.3 vs. 1.6), and Kaplan-Meier analysis. These findings again suggest a prognostic test derived from a single data set may be over-dedicated and might perform weakly on external data. In contrast, a prognostic test derived from integrated data is more likely to be more robust to study-specific factors and to be validated satisfactorily on external data.

Recently, some studies have shown that combining gene expression data and conventional clinical data (e.g. tumor size, grade, ER status) could lead to improved breast cancer prognosis

Conclusion

The opinion expressed in recent studies that gene expression information can be useful in breast cancer prognosis seems to be well-founded. However, due to the small sample sizes relative to the complexity of the entire expression profile, existing methods suffer certain limitations, namely the prevalence of study-specific signatures and difficulties in validating the prognostic tests constructed from these signatures on independent data. Integrating data from multiple studies to obtain more samples appears to be a promising way to overcome these limitations.

We have integrated several gene expression data sets and developed a likelihood ratio test for predicting distant metastases that correctly signals a poor outcome in approximately ninety percent of test cases while maintaining about fifty-five percent specificity for good outcome patients. This well exceeds the St. Gallen and NIH guidelines and compares favorably with the best results previously reported (although not yet validated on external test data). As more and more gene (and protein) expression data is generated and made publicly available, modeling the interactions among genes (and gene products) will become increasingly feasible, and is likely to be crucial in designing prognostic tests which achieve high sensitivity without sacrificing specificity.

Methods

Data integration

Recently, our group has developed a family of statistical molecular classification methods based on relative expression reversals

Feature selection and transformation

Consider ** X **= {

For each pair of genes (_{i }<_{j}|_{i }<_{j}

_{j }= |_{i }<_{j}|_{i }<_{j}|

and estimate the score of pair (** x **by

where

In other words, the estimated score is simply the absolute difference between the fraction of poor-outcome patients for which gene _{ij}, and selecting the top ** X **= {

Suppose _{m}, _{i }<_{j}|_{i }<_{j}|_{i }<_{j}|_{i }<_{j}|_{m }is then set to be 1 if we observe _{i }<_{j }and set to 0 otherwise, i.e., if we observe _{i }≥ _{j}. Of course the same definition is applied to each feature in the training set. In this way, observing _{m }= 1 (resp., _{m }= 0) represents an indicator of the poor outcome (resp., good outcome) class in the sense that _{m }= _{m }= 1|_{m }= _{m }= 1|_{m }> 1/2 > _{m}.

After this procedure, the original

Likelihood ratio test

The classical likelihood ratio test (LRT) is a statistical procedure for distinguishing between two hypotheses, each constraining the distribution of a random vector ** Z **= {

The LRT is based on the likelihood ratio

where ** z **= {

Naive Bayes Classifier

In the special case in which the random variables _{1}, ..., _{M }are binary (as here) and are assumed to be conditionally independent given _{m }= _{m }= 1|_{m }= _{m }= 1|

and a similar expression holds for ** z**|

It follows that

and consequently

The LRT then reduces to the form: Choose

and choose

Since _{m }> _{m}, all these coefficients in Equation (4) are positive and the decision rule in Equation (3) reduces to weighted voting among the pair-wise comparisons: every observed instance of _{m }= 1 is a vote for the poor outcome class with weight _{m}. Moreover, under the two assumptions of i) conditional independence and ii) equal

Sensitivity vs. Specificity

Since our interest lies in high sensitivity at the expense of specificity if necessary, we do not choose _{α }denote the (largest) threshold achieving sensitivity 1 - α. That is, suppose

(We explain how to estimate _{α }from the training data in the next sections.) Then, from the Neyman-Pearson lemma, we know that our decision rule achieves the maximum possible specificity at this level of sensitivity. More precisely, this threshold maximizes

which is the probability of choosing good-outcome when in fact good-outcome is the true hypothesis.

Of course this is only a theoretical guarantee and depends very strongly on the conditional independence assumption which is surely violated in practice; indeed, some genes are common to several of the variables _{m}. Still, the LRT does provide a framework in which there are clearly stated hypotheses under which specificity can be optimized at a given sensitivity. Moreover, it provides a very simple test and the parameters _{m}, _{m }are easily estimated with available sample sizes. Most importantly, the decision procedure dictated by the LRT does indeed work well on independent test data (see 'Results').

Signature identification and class prediction

In clinical practice, when selecting breast cancer patients for adjuvant systemic therapy, it is of evident importance to limit the number of poor-outcome patients assigned to the good-outcome category. The conventional guidelines (e.g., St. Gallen and NIH) for breast cancer treatment usually call for at least 90% sensitivity and 10–30% specificity. Therefore our objective in selecting the threshold

The idea is to use _{opt }which (approximately) maximizes specificity; the threshold is then _{opt }= _{opt}). From Figure _{opt }=

Specifically, the steps are as follows: 1) Divide the integrated training data set into _{opt}, is the smallest number which effectively maximizes specificity.

The final prognostic signature is the _{opt }top-ranked features (gene pairs) generated from the original integrated training set. The final prognostic test is the LRT with these features and the corresponding threshold _{opt }= _{opt}); this is the classifier which is applied to the validation set and yields the error rates reported in 'Results'.

Additional statistical analysis

We compute the odds ratio of our prognostic test for developing distant metastases within five years between the patients in the poor-outcome group and good-outcome group as determined by LRT classifier. The

Authors' contributions

LX, under the supervision of RLW and DG, collected the microarray data sets and implemented the algorithms; all authors developed the methodology and contributed to the final manuscript.

Acknowledgements

We thank Dr. Yijun Sun for providing the MATLAB code to compute the hazard ratio. This work was supported by HL72488 and HL085343.