Abstract
Background
Cancer survival studies are commonly analyzed using survivaltime prediction models for cancer prognosis. A number of different performance metrics are used to ascertain the concordance between the predicted risk score of each patient and the actual survival time, but these metrics can sometimes conflict. Alternatively, patients are sometimes divided into two classes according to a survivaltime threshold, and binary classifiers are applied to predict each patient’s class. Although this approach has several drawbacks, it does provide natural performance metrics such as positive and negative predictive values to enable unambiguous assessments.
Methods
We compare the survivaltime prediction and survivaltime threshold approaches to analyzing cancer survival studies. We review and compare common performance metrics for the two approaches. We present new randomization tests and crossvalidation methods to enable unambiguous statistical inferences for several performance metrics used with the survivaltime prediction approach. We consider five survival prediction models consisting of one clinical model, two gene expression models, and two models from combinations of clinical and gene expression models.
Results
A public breast cancer dataset was used to compare several performance metrics using five prediction models. 1) For some prediction models, the hazard ratio from fitting a Cox proportional hazards model was significant, but the twogroup comparison was insignificant, and vice versa. 2) The randomization test and crossvalidation were generally consistent with the pvalues obtained from the standard performance metrics. 3) Binary classifiers highly depended on how the risk groups were defined; a slight change of the survival threshold for assignment of classes led to very different prediction results.
Conclusions
1) Different performance metrics for evaluation of a survival prediction model may give different conclusions in its discriminatory ability. 2) Evaluation using a highrisk versus lowrisk group comparison depends on the selected riskscore threshold; a plot of pvalues from all possible thresholds can show the sensitivity of the threshold selection. 3) A randomization test of the significance of Somers’ rank correlation can be used for further evaluation of performance of a prediction model. 4) The crossvalidated power of survival prediction models decreases as the training and test sets become less balanced.
Background
The Cox proportional hazards model [1] is the most common survival prediction model for cancer prognosis. Often, demographic and clinical covariates are combined in a Cox model with staging information from the American Joint Committee on Cancer (AJCC) staging system to predict a patient’s survival to improve treatment recommendations [27]. Because microarray studies have shown an association between patient survival and gene expression profiles [810], some recent papers have investigated the use of microarray gene expression data alone or in combination with clinical covariates [1114] as an improvement to estimate patient survival risk. Dimensionality reduction techniques are often performed prior to applying the Cox model to improve prediction performance. A practical approach is to apply a selection technique to select a smaller set of relevant genes from the entire gene set as initial step; a dimensionality reduction technique is then applied to the selected gene set [15].
Evaluation of the ability of a survival model to predict future data is the most important consideration in the development of prediction model. Common metrics to assess the performance of survival prediction models include hazard ratios between high and lowrisk groups defined by dichotomized risk scores, and tests for significant differences in the two groups’ KaplanMeier survival curves. Metrics that measure the strength of the relationship between risk scores and survival, including the simple hazard ratio, the coefficient of determination R^{2}[16], the concordance index and Somers’ rank correlation D_{xy}[17,18], are also used. Additional metrics include receiver operating characteristic (ROC) curves [19] and area under the ROC curve (AUC) [2022] defined over a range of riskscore cutoffs given a fixed survival threshold. Hielscher et al. [23] recently compared several common R^{2}type measures for evaluation of survival models. In addition, the Brier scores and Schemper/Henderson measure which were also developed to assess the survival prediction models were investigated by Schumacher et al. [24] and Dunkler et al. [25], respectively. To our knowledge no comprehensive evaluation and comparison among different performance metrics for different survival modeling approaches has been reported.
In contrast to the commonly used survivaltime prediction model approach, much research in gene expression profiling of cancer data has focused on binary class prediction, where patients’ survival times have been dichotomized to form two classes [11,23,2637]. With this approach, a prediction model is built and used to distinguish between the “lowrisk” and ‘highrisk” classes. The performance of a binary classifier is generally evaluated in terms of the overall predictive accuracy, along with positive and negative predictive values, etc. [28]. Dupuy and Simon [27] discussed several drawbacks of using this approach with survival data. Mainly, it does not take the information on survival and censored times into consideration. Furthermore, binary classification highly depends on the survival threshold used to define the two classes. A slight change of the threshold can lead to very different prediction accuracy and interpretation. Binder et al. [30] applied three different survival thresholds to evaluate a binary classifier based on gene expression, and showed how the choice of threshold affected the predictions. They concluded that using the binary modeling approach can result in loss of efficiency and potential bias in high dimensional settings.
In this study, we evaluate and compare commonly used metrics to assess performance of five prediction models. These five models are based on established approaches to modeling clinical variables and microarray gene expression data. We propose using randomization tests to compute the pvalues of certain performance metrics, including D_{xy}, R^{2}, and AUC, and crossvalidation to evaluate the power of the prediction models. We also present an analysis to illustrate differences between the survivaltime threshold (binary classification) and survivaltime prediction (survival risk score) models in the analysis of survival data.
Methods
Models for survival outcomes developed from training dataset
Five survival prediction models to estimate patient’s survival risks were considered. These five models included one clinical model, two gene expression models, and two models based on the combinations of the clinical and two gene expression models.
The clinical model (Model A) was the Cox proportional hazards model, derived by fitting the Cox model to all clinical variables, TNM stage, gender, age, and others, and selecting the clinical variables most relevant to the training dataset. The gene expression data were first analyzed using the univariate Cox model to select a set of “significant” genes, based on a predetermined statistical criterion which is p < 0.001 in this paper. For the set of selected genes, two gene expression models were developed using the Cox model: Model B used the first five principal components of the set of significant genes as signature variables, and Model C used the top 10 ranked genes as signature predictors. Each of the gene expression models was combined with the clinical model (Model A) additively to develop two clinical and gene expression models (D = A + C and E = A + D). A summary of the five models is given in Additional file 1: Table S1.
Additional file 1. Supplementary method.
Format: DOC Size: 275KB Download file
This file can be viewed with: Microsoft Word Viewer
Assessment of predicted risk scores for the patients in test dataset
The regression coefficients of the fitted Cox models (AE) developed from the training data were used to compute the predictive risk scores for each patient in the test dataset. The predictive risk scores were then used to compute performance metrics to evaluate the performance of the prediction models built from the training data. We considered the following commonly used performance metrics. A more detailed description of these metrics is given in the Additional file 1.
Simple hazard ratio and R^{2} (Cox Model I)
A Cox model was fit using the predictive risk scores as an independent variable with survival time as the outcome variable. The exponent of the regression coefficient was the simple hazard ratio. The performance metrics included: estimated hazard ratio, 95% confidence limits on hazard ratio, pvalue for significance of hazard ratio, and R^{2}[23,38].
Twogroup hazard ratio and brier score (Cox Model II)
The test data were first segregated into highrisk and lowrisk groups by the median of training risk scores. A Cox model was fit using the risk group as an independent variable with survival time as the outcome variable. The exponent of the regression coefficient was the twogroup hazard ratio. The performance metrics included the estimated hazard ratio, 95% confidence limits on the hazard ratio, and pvalue for significance of hazard ratio.
The Brier score [31], which measures average discrepancies between true disease status and estimated predictive values, was also calculated to assess the predictive risk scores in risk group stratification. The Brier score can be calculated for a specific time point or for an overall error measure across all time points. A larger Brier score means a higher inaccuracy of a prognostic classification scheme. However, baseline estimation is required for computing predicted riskfree probability to estimate Brier score, and different methods used could result in different Brier scores. Therefore we applied the method developed by Graf et al. [31] to compute integrated Brier score (IBS) in the twogroup stratification without baseline estimation, where the test data are stratified into two groups according to the training model and the riskfree probability for each sample is estimated from the KaplanMeier estimate for the corresponding group.
Logrank test
The logrank test was used to compare the survival curves between the patients in the high risk and low risk groups defined by the predicted risk scores. The performance metric was the pvalue of the test.
Somers’ rank correlation D_{xy}
The concordance index between predicted risk score and observed survival time in the test dataset was computed by a rank correlation adjusted for censored time [17,18]. This index was reexpressed equivalently as a correlation measure, known as the Somers’ D_{xy} rank correlation. The performance metrics included the calculated D_{xy} and the pvalue of a randomization test of its significance.
Time dependent Receiver Operating Characteristic (ROC) Curve and the Area under the ROC Curve (AUC)
For a given survival threshold, t, ROC(t) was plotted as sensitivity(t) versus 1specificity(t) for all values of the risk score cutoff used to define binary classes [19]. Performance metrics included the plotted ROC(t), the associated AUC(t) [2022], and the pvalue of a randomization test of its significance.
Randomization test
The randomization test is a nonparametric test by permuting the survival times of the training data to generate the null dataset that patients’ survival times are not associated with the clinical and gene expression variables. The prediction model was fit to the null dataset, and performance metrics were computed on the test dataset and compared to the corresponding metrics calculated from the observed data. The procedure was repeated 10,000 times. The proportion of the estimated metrics calculated from the null dataset that exceeded the metric calculated from the observed dataset was the pvalue of the randomization test. The metrics obtained by the randomization test were D_{xy}, pvalue of Cox model, R^{2} and AUC(t).
Power validation
Crossvalidation and bootstrapping are two methods commonly used to assess performance of a prediction model. Both methods are based on resampling techniques Cross validation involves repeatedly splitting the data into a training set and test set, where the training set is used for model development and the test set is for model validation and performance assessment. The predictive performance is the average of the numerous trainingtest partitions. In particular, a split sample validation refers to splitting the entire data into a training set and a test set, and only the test set is used to evaluate once without "crossing". Bootstrapping analyzes subsamples repeatedly, where each subsample is a random sample with replacement from the entire data. Various bootstrap methods such as the ordinary bootstrap, the leaveoneout bootstrap and the .632+ bootstrap are proposed and compared by Efron [39] and Efron and Tibshirani [40,41]. The power of the prediction models were evaluated by 2fold cross validation [42], and the procedure was repeated 5,000 times. The proportion of pvalues less than or equal to 0.05 were calculated as an estimate of the power.
Assessment of binary classification of patients in test dataset
Binary classifiers for the five models with the same signatures selected from the risk prediction models were developed using the support vector machine (SVM), random forest classification (RF) algorithms, and logistic regression. These three algorithms are the most frequently used algorithms and have been shown to perform well in the analysis of microarray data. Performance metrics were the numbers of misclassified samples for each metastasisfree survival threshold.
Results
The dataset of van 't Veer et al. [26] contained 78 primary breast cancers (34 from patients who developed distant metastases within 5 years (poor prognosis) and 44 from patients who continue to be diseasefree (good prognosis) after a period of at least 5 years). The available clinical variables included age, diameter, tumor grade, angioinvasion, oestrogen and progesterone receptor status, and lymphocytic infiltration. The 78 patients were used as training data to develop prediction models; an additional 19 patients including 7 with good prognoses and 12 with poor prognoses were used as test data. Although this dataset is small, its size represents many existing datasets that have a cancerrelated endpoint as the outcome variable with many genes as predictor variables.
Results of survival prediction
Table 1 shows the estimates of performance metrics for the five models from two fittings of the Cox model. Model I used the risk scores as an independent variable (Columns 3–7). Model II used the risk groups (high versus low risk group) as an independent variable (Columns 8–12), based on the median of the training scores. Table 1 also shows the calculated values of Somers’ rank correlation coefficient, D_{xy}. The predicted risk scores and risk rankings are shown in Additional file 1: Table S2.
Table 1. Performance metrics of the five prediction models for the breast cancer data: Somers’ correlation (D_{xy}); estimates of the hazard ratio (HR) with 95% confidence limits (CI), and pvalue for Cox Models I and II; and R^{2}for Cox Model I and Brier score (IBS) for Cox Model II
Additional file 1: Figure S1 shows the Kaplan–Meier survival curves with the pvalues from the logrank test, and the Brier scores of the five models for each followup time points are also shown in Additional file 1: Figure S2. As expected, the pvalues from the logrank test and Model II analysis are very close. ROC and AUC analyses at the 4, 5, 6 metastasisfree times for the five models show similar results to one another (Figure 1).
Figure 1. ROC curves for patients’ survival with AUC measures evaluated at 4, 5, and 6 years metastasisfree times for the five models.
The performance estimates obtained from Models I and II appear to be inconsistent for Models C and D (Table 1). Model C shows a small HR estimate and small absolute value of D_{xy} from the Model I analysis, but a significant HR estimate from the Model II analysis, while Model D shows the opposite. In all analyses, Model A has the smallest pvalues and the largest absolute D_{xy}, R^{2}, and AUC for all three time points. The estimates for D_{xy}, R,^{2} and AUC are useful for comparison between two prediction models, but the actual values, such as D_{xy} of −0.333, R^{2} of 0.311, and AUC of 0.84, are uninformative to infer the significance of the prediction.
The randomization (permutation) test was used to assess statistical significance of the observed performance measures shown in Table 1 and Figure 1, including the significance of the pvalue of the hazard ratio itself. The randomization test generated the null distribution of no association (no predictability) between the 78 training patients and 19 test patients to assess the significance of the risk scores predicted by a prediction model. We illustrated an analysis for D_{xy}, Pvalue, R^{2} and AUC metrics for Model I. In the permutation test, the survival times of the 78 patients were randomly permutated to generate a null dataset. Five prediction models, AE, were developed from the null dataset. In each of the five models, two predictive models were developed. M1 was developed using the same signature predictors developed from the original 78 patient training dataset. M2 was developed by generating new predictors based on the null dataset. In M1 the null distributions were generated conditionally on the same signature, while in M2 the null distributions were generated unconditionally. The null hypothesis under both models was that the signature developed does not associate with the test data. Each prediction model was applied to the 19 test patients; the performance metrics D_{xy}, Pvalue, R^{2} and AUC evaluated at 4, 5, and 6 year metastasisfree times were estimated. The procedure was repeated 10,000 times to generate the null distributions of the metrics. The proportion of the estimated metrics calculated from the null dataset that exceeded the metric calculated from the observed dataset was the pvalue of the randomization test shown in Tables 2 and 3. In Model A, the pvalues from M1 and M2 are identical in Table 2 since both models used all clinical variables.
Table 2. Pvalues of randomization test based on 10,000 permutations for the three measures: Somers’ correlation (D_{xy}), pvalue of the hazard ratio, and R^{2}from fitting the Cox proportional hazards model using the risk scores as independent variable
Table 3. Pvalues of randomization test based on 10,000 permutations for the AUC measures evaluated at 4, 5, and 6 years metastasisfree times
In Table 2, the pvalues estimated from the randomization test for the metrics Pvalue and R^{2} are very similar since both metrics measure the association under the Cox model. For Model B, the randomization test pvalues for D_{xy} and R^{2} appear to be different. The results from the left panel (original predictors) and right panels (reselected predictors) are similar. In general, the pvalues from the randomization test generally agree with the results based on asymptotic parametric tests from Model I of Table 1. The pvalues for the AUC metrics in Table 3 are generally higher than the pvalues for the metrics in Table 2. The AUC test was significance only at the 5year metastasisfree time (for Model A). In both M1 and M2, Model A has the smallest pvalues for D_{xy}, P, R^{2} and AUC at all three time points.
A twofold cross validation was used to estimate the power to detect an association between a prediction model and survival time. First the 78 and 19 patients were pooled. The 97 total patients were randomly split into a training set of 49 patients and a test set of 48 patients. Five prediction models were developed from the training set and then applied to the test set. (Since the original signature of each model was developed based on the 78 test samples, the original signature was no longer applicable in this analysis.) In crossvalidation, the pvalues were computed using the Cox models I and II, and the logrank test. The procedure was repeated 5,000 times. The proportion of pvalues less than or equal to 0.05 were calculated as an estimate of the power. The results are shown in Table 4. The results agree qualitatively with the results from the randomization test. Again, Model A appears to perform the best. Box plots of the empirical distributions of the pvalues are given in Additional file 1: Figure S3.
Table 4. The 97 total patients were randomly split into a training set and a test set
We further investigated the effect of the training and test set sizes on the power estimation. The numbers of patients for the training set investigated were 78, 65, 32, 25, and 19. Only the results from Model A are shown (Table 5). The results from other models are given in Additional file 1: Table S3S7. Table 5 shows that (78:19) and (19:78) are among the poorest performance. It appears that a small test sample size (78:19) will reduce the power. Steyerberg [43] also discusses this issue. On the other hand, a small training size (19:78) may affect the fitting of the model. The 2fold or 3fold cross validation can be used for power performance evaluation.
Table 5. Effect of training and test set sizes on the power for Model A
Binary classification and survivaltime prediction
Binary classification
The breast cancer dataset was first presented to develop a binary classifier of 70 signature genes based on 5year distant metastases [26]. The classifier misclassified 2 of the 19 test patients using both optimal accuracy and sensitivity threshold strategies. For the 19 patients in the test dataset, there were four patients (11, 12, 13, and 14) who had metastasisfree times between 4.77 and 5.23, around 5 years. We illustrate the analysis using 4year, 5year, and 6year metastasisfree times to define high and low risk groups.
Table 6 shows the numbers of misclassifications using the thresholds of 4year, 5year, and 6year metastasisfree times. The classification errors can be very different if different thresholds are used. Among the five models, Model B, as binary classifier, appears to have the best overall performance. The logistic regression performs better than the SVM and RF in model A. For the SVM algorithm, the numbers of misclassification errors based on the 4year, 5year, and 6year survival thresholds are 8, 3, and 3, respectively; the numbers are 7, 3, and 2 for the RF algorithm; the numbers are 7, 7, and 4 for logistic regression. The misclassified patients by the three algorithms are given in Additional file 1: Table S8. The misclassification errors between 4year and 5year differ substantially in SVM and random forest.
Table 6. The numbers of misclassifications for five binary classifiers using the support vector machine (SVM), random forest (RF) and logistic regression (LR) classification algorithms
Survivaltime prediction
Although a survival prediction model is developed to predict survival risks of patients based on their predictor profiles, it can also be used as a binary prediction model. Figure 2 shows the plot of patients’ survival times and their ranked predicted risks for the five survival prediction models, where patients are ranked according to their survival times (Additional file 1: Table S2). Thus, Patient #1 (at the top) had the shortest survival time and Patient #19 (at the bottom) had the longest survival time. The horizontal axis represents the patient’s rank according to the estimated risk score from a prediction model, where a rank of 1 corresponds to the highest estimated risk score, etc. The patients on the left have high risk scores and on the right have low risk scores. For example, in Model A, Patient #6 (ranked 1st) has the highest estimated risk score and Patient #19 (ranked 19th) has the lowest estimated risk score. The vertical line is the median of the training scores that separate the patients into the high and low risk groups for a twogroup comparison. This separation into two groups implies a binary classification. In fact, the ROC approach relies on an induced binary classification at each riskscore cutoff. With a ROC approach at the 5year metastasisfree time (Figure 2, horizontal line), the patients on the upper left region have longer survival times but are categorized in the high risk group; the patients on the lower right region have shorter survival times but are categorized in the low risk group. Thus, Patients #1, #3, #7, and #11 are misclassified in Model A. Different horizontal lines can be plotted to evaluate predictive performance for different time points. The ROC curves constructed by enumerating all 19 vertical cutoffs with the AUC measure are shown in Figure 1. The ROC is a line connecting all the points without smoothing, and there are fewer jumps than 19 because some of the 19 points have same true positive rate or false positive rate. Figure 3 shows plots of pvalues of the logrank test using all possible cutoffs for Models AE. The minimum pvalues occur generally in the range when the numbers of patients in the lowrisk group are between 8 and 11. It also indicates that the dichotomization of the survival risk scores into two groups to evaluate predictability could lead to different conclusions if different thresholds are applied.
Figure 2. Plot of metastasisfree survival time (vertical axis) of the 19 test patients versus the rank of the estimated risk score (horizontal axis) from the five risk prediction models. The patients were numbered according to the ranks of their survival times. The patients on the left have high estimated risk scores (low ranks) and on the right have low estimated risk scores. Performance of a risk prediction model can be assessed by analyzing the relationship between survival times and risk scores (see text). For example, the horizontal line represents a cutoff at the 5 year metastasisfree time and the vertical line is the median of the training scores. A ROC curve can be constructed by enumerating all 19 vertical cutoffs and AUC can be computed (Figure 1).
Figure 3. Logerank pvalues of the five models for all possible sizes of lowrisk group.
In summary, a binary classifier is developed from a training dataset where each patient is preassigned into either a lowrisk or highrisk group, while a survival risk prediction model is developed, based on the patients’ survival times without preassigning the patients into two groups. A binary classifier predicts a new patient as either high or low risk. A risk prediction model provides an estimate of risk score of a new patient; the estimated risk score can be compared with the median of the training scores to determine the patient’s risk group (high or low). The main deficiency in the use of a binary classifier to analyze survival time is the presence of censored observations. However, if there is no censoring, and the purpose is to classify patients’ survival risks at a specific time point of interest, then a binary classifier should be more powerful than a survival prediction model.
Discussion and Conclusions
The development of prediction models using the cancer TNM staging system combined with the basic clinical covariates and microarray gene expression variables for identifying highrisk and lowrisk patients for treatment recommendations has been an important goal in clinical oncology research. Several recent publications have investigated the use of microarray gene expression data to improve accuracy in estimating patient risk. However, the use of prediction models for clinical decision making still has many challenges to be overcome. A recent critical evaluation of published studies on lung cancer found little evidence that any of the reported gene expression signatures are ready for clinical application [21].
A prediction model is developed to predict survival risk of new patients which may come from different medical centers or different times. The ability to predict patients from different centers involves many factors, such as study protocols, microarray platforms, sample processing, and data preprocessing, etc. This study considers prediction of new patients assuming they are from the same study protocol. We focus on the assessment of performance of survival prediction models using five established prediction models.
Performance of a prediction model depends on the set of predictive signatures used in the model. Since the number of clinical variables is typically small, all clinical variables can be considered to develop a prediction model. On the other hand, since gene expression levels are often correlated, the set of predictors selected may vary substantially among different training samples, although the models predict about equally well [28]. It may not be feasible to come up with a general procedure to determine an optimal set of predictors (genes and clinical variables) for a “best” performance under the Cox model.
A common practice to assess performance of a survival risk prediction model is to evaluate its ability to separate the predicted risk scores of patients into low and high risk groups based on a particular cutoff threshold. However, the threshold has been defined differently; some researchers used the median or other percentiles of training scores as the cutoff [21,3234] and others used the median or other percentiles of the test scores [3537,44]. Different cutoffs to segregate the testing data could lead to different conclusions, and it also occurred in the binary classifiers such as SVM and random forest algorithm. A more fundamental issue is that a prediction model is developed, based on training of the available dataset, to predict new sample(s), classifying new patients as high or low risk based on the available data. Therefore, the median or other percentiles of the training scores should be used as a cutoff. In multiple center studies where a prediction model is developed from one center to predict patients of another center, comparing the medians of training and test scores will be useful to understand the underlying survival distributions of the two centers.
The survival time endpoint for risk prediction has been analyzed as a class prediction problem by dividing patients into two classes according to a survivaltime threshold such as the breast cancer data [26]. The binary response approach provides natural performance metrics such as positive and negative predictive values to enable unambiguous assessments. The binary response approach addresses the question of whether the patient will survive up to a specific time, say, t*, while the survivaltime risk prediction approach estimates the patient’s risk score. These two approaches address two different questions. The survivaltime prediction approach is generally more appropriate and natural for modeling survival data in the presence of censored observations. This paper illustrates that binary classifiers highly depended on how the risk groups were defined. Binder et al. [30] investigated the effects of the choice of threshold on the predictions and showed that there is little overlap of selected genes between an early and median threshold cutoffs, which might be due to shortterm and longterm effects of genes or the censoring pattern.
Performance of a risk prediction model is assessed by analyzing the relationship between survival times and risk scores. Many ROC studies mainly address a specific time point of interest [11,45,46]. Sun et al. [36] and van Belle et al. [47] showed time varying AUC measures for two different models to show an improvement of using gene expression data for predicting lung cancer survival, but the AUC measure from one model may not be consistently higher than the AUC measure from the other model across all time points. The assessment of the ROC curves for all time points might be needed. However, this can be impractical. Although accuracy comparison method developed by Moskowitz and Pepe [46] could be useful to assess performance among different models, this measure itself is inapplicable to assess the performance of a single model.
The Somers' index D_{xy} is a correlation measure for an overall concordance between predicted risk scores and observed survival times for the test data [11,44,4852]. A high correlation implies that the predicted patients’ risk scores are in good concordance with the patients’ survival times. In most studies that presented D_{xy} values [11,44,4850], they were used to show improvement of a new model [52] or to compare different models [34,53], without making inference to statistical significance. A few studies did report confidence limits [47,53]. Unlike R^{2}, D_{xy} does not depend on the fitting of the Cox model.
Hielscher et al. [23] compared seven existing R^{2}type measures and showed their behavior in simulation examples and a gene expression microarray dataset. This paper evaluated several measures that have commonly been used for the evaluation in clinical oncology, including pvalues of hazard ratios and logrank test, AUC, and three R^{2}type measures. A main conclusion in our analysis is that these existing metrics for evaluating the discriminatory ability of survival prediction models may lead to discordant results. In the lymphoma application, the seven R^{2}type measures reported in Table two of Hielscher et al. [23] were in agreement. They provided a summary of references of seven R^{2}type measures and available R software in Table three.
Cross validation of binary classifiers in gene expression data has been investigated extensively [54]. Cross validation of survival prediction models has not commonly been conducted. Recently, Subramanian and Simon [46] compared several resampling techniques for assessment of accuracy of risk prediction models, and their investigation covers various settings, including sample sizes, null model, number of kfold partitions, etc. Although they only evaluated the AUC(t) at t = 180 months, they recommended 5 or 10fold crossvalidation which has good balance between bias and variability in the different settings. Simon et al. [14] also showed how to utilize crossvalidation for the evaluation of prediction models using time dependent ROC curves. The cross validation to estimate power illustrated in this paper is similar to the approach used by Subramanian and Simon [46].
The pvalues of the hazard ratios or logrank test are commonly used to evaluate performance of risk prediction models. These pvalues provide direct assessment of significance of the measures of predictability; however, some models can give inconsistent conclusions. D_{xy} measures an overall concordance between the patients’ survival times and predicted risk scores. AUC provides a probability measure of predictive ability at a given time point. The pvalues of these two measures can be computed using the proposed randomization test, which cannot be derived theoretically. Both measures are very useful to assess performance of a single model or to compare different models.
Competing interest
The authors declare that they have no competing interests.
Authors’ contributions
JJC conceived the study and wrote the manuscript. HCC, RLK, and JJC improved the concepts. HCC developed and implemented the methodology and performed the analysis. RLK critically revised the context. KFC helped draft the manuscript. All authors read and approved the final manuscript.
Authors’ information
The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.
Acknowledgements
The authors are very grateful to reviewers for much helpful comments and suggestions for revising and improving this paper. The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.
References

Cox DR, Oakes D: Analysis of survival data. Chapman and Hall, London; 1984.

Gimotty PA, Guerry D, Ming ME, et al.: Thin primary cutaneous malignant melanoma: a prognostic tree for 10year metastasis is more accurate than American Joint Committee on Cancer staging.
J Clin Oncol 2004, 22:36683676. PubMed Abstract  Publisher Full Text

RadespielTroger M, Hohenberger W, Reingruber B: Improved prediction of recurrence after curative resection of colon carcinoma using treebased risk stratification.
Cancer 2004, 100:958967. PubMed Abstract  Publisher Full Text

Huang X, Soong SJ, McCarthy WH, Urist MM, Balch CM: Classification of localized melanoma by the exponential survival trees method.
Cancer 1997, 79:11221128. PubMed Abstract  Publisher Full Text

Banerjee M, Biswas D, Sakr W, Wood DP: Recursive partitioning for prognostic grouping of patients with clinically localized prostate carcinoma.
Cancer 2000, 89:404411. PubMed Abstract  Publisher Full Text

Segal MR, Bloch DA: A comparison of estimated proportional hazards models and regression trees.
Stat Med 1989, 8:539550. PubMed Abstract  Publisher Full Text

Segal MR: Features of treestructured survival analysis.
Epidemiology 1997, 8:344346. PubMed Abstract  Publisher Full Text

Alizadeh AA, Elsen MB, Davis RE, et al.: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling.
Nature 2000, 403:503511. PubMed Abstract  Publisher Full Text

Waldman SA, Hyslop T, Schulz S, et al.: Association of GUCY2C expression in lymph nodes with time to recurrence and diseasefree survival in pN0 colorectal cancer.
JAMA 2009, 301:745752. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gordon GJ, Jensen RV, Hsiao LL, et al.: Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma.
Cancer Res 2002, 62:49634967. PubMed Abstract  Publisher Full Text

Shedden K, Taylor JM, Enkemann SA, et al.: Gene expressionbased survival prediction in lung adenocarcinoma: a multisite, blinded validation study.
Nat Med 2008, 14:822827. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

FernandezTeijeiro A, Betensky RA, Sturla LM, Kim JYH, Tamayo P, Pomeroy SL: Combining Gene Expression Profiles and Clinical Parameters for Risk Stratification in Medulloblastomas.
J Clin Oncol 2004, 22:994998. PubMed Abstract  Publisher Full Text

Habermann TM, Wang SS, Maurer MJ, et al.: Host immune gene polymorphisms in combination with clinical and demographic factors predict late survival in diffuse large Bcell lymphoma patients in the prerituximab era.
Blood 2008, 112:26942702. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Simon RM, Subramanian J, Li MC, Menezes S: Using crossvalidation to evaluate predictive accuracy of survival risk classifiers based on highdimensional data. Briefings in Bioinformatics.
Brief Bioinform 2011, 12:203214. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bair E, Tibshirani R: Semisupervised methods to predict patient survival from gene expression data.

Simon RM: Interpretation of Genomic Data: Questions and Answers.
Semin Hematol 2008, 45:196204. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the Yield of Medical Tests.
JAMA 1982, 247:25432546. PubMed Abstract  Publisher Full Text

Newson R: Confidence intervals for rank statistics: Somers' D and extensions.

Heagerty PJ, Lumley T, Pepe MS: Timedependent ROC curves for censored survival data and a diagnostic marker.
Biometrics 2000, 56:337344. PubMed Abstract  Publisher Full Text

Segal MR: Microarray gene expression data with linked survival phenotypes: diffuse largeBcell lymphoma revisited.
Biostatistics 2006, 7:268285. PubMed Abstract  Publisher Full Text

Buyse M, Loi S, van’t Vee L, et al.: Validation and clinical utility of a 70gene prognostic signature for women with nodenegative breast cancer.

Subramanian J, Simon RM: Gene expressionbased prognostic signatures in lung cancer ready for clinical use?
J Natl Cancer Inst 2010, 102:464474. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hielscher T, Zucknick M, Werft W, Benner A: On the prognostic value of survival models with application to gene expression signatures.

Schumacher M, Binder H, Gerds T: Assessment of survival prediction models based on microarray data.
Bioinformatics 2007, 23:17681774. PubMed Abstract  Publisher Full Text

Dunkler D, Michiels S, Schemper M: Gene expression profiling: does it add predictive accuracy to clinical characteristics in cancer prognosis?
Eur J Cancer 2007, 43:745751. PubMed Abstract  Publisher Full Text

van ‘tVeer LJ, Dai H, van de Vijver MJ, et al.: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Dupuy A, Simon RM: Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting.
J Natl Cancer Inst 2007, 99:147157. PubMed Abstract  Publisher Full Text

Zhu ZH, Sun BY, Ma Y, et al.: Three Immunomarker support vector machines–based prognostic classifiers for Stage IB Non–SmallCell Lung Cancer.
J Clin Oncol 2009, 27:10911099. PubMed Abstract  Publisher Full Text

Drozdov I, Kidd M, Nadler B, et al.: Predicting neuroendocrine tumor (carcinoid) neoplasia using gene expression profiling and supervised machine learning.
Cancer 2009, 115:16381650. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Binder H, Porzelius C, Schumacher M: An overview of techniques for linking highdimensional molecular data to timetoevent endpoints by risk prediction models.
Biom J 2011, 53(2):170189. PubMed Abstract  Publisher Full Text

Graf E, Schmoor C, Sauerbrei W, Schumacher M: Assessment and comparison of prognostic classification schemes for survival data.
Stat Med 1999, 18:25292545. PubMed Abstract  Publisher Full Text

Yu SL, Chen HY, Chang GC, et al.: MicroRNA signature predicts survival and relapse in lung cancer.
Cancer Cell 2008, 13:4857. PubMed Abstract  Publisher Full Text

Hu Z, Chen X, Zhao Y, et al.: Serum MicroRNA signatures identified in a genomewide serum MicroRNA expression profiling predict survival of non–smallcell lung cancer.
J Clin Oncol 2010, 28:17211726. PubMed Abstract  Publisher Full Text

Cho JY, Lim JY, Cheong JH, et al.: Gene expression signaturebased prognostic risk scores in gastric cancer.
Clin Cancer Res 2011, 17:18501857. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Decaux O, Lode L, Magrangeas F, et al.: Prediction of survival in multiple myeloma based on gene expression profiles reveals cell cycle and chromosomal instability signatures in highrisk patients and hyperdiploid signatures in lowrisk patients: a study of the intergroupe francophone du Myelome.
J Clin Oncol 2008, 26:47984805. PubMed Abstract  Publisher Full Text

Sun Z, Wigle DA, Yang P: Nonoverlapping and noncelltypespecific gene expression signatures predict lung cancer survival.
J Clin Oncol 2008, 26:877883. PubMed Abstract  Publisher Full Text

Korkola JE, Houldsworth J, Feldman DR, et al.: Identification and validation of a gene expression signature that predicts outcome in adult men with germ cell tumors.
J Clin Oncol 2009, 27:52405247. PubMed Abstract  Publisher Full Text

Schemper M: The relative importance of prognostic factors in studies of survival.
Stat Med 1993, 12:23772382. PubMed Abstract  Publisher Full Text

Efron B: Estimating the error rate of a prediction rule: improvement on cross validation.
J Am Stat Assoc 1983, 78:316331. Publisher Full Text

Efron B, Tibshirani R: Improvement on crossvalidation: the.632+ bootstrap method.

Efron B, Tibshirani R: An Introduction to the Bootstrap. Chapman and Hall, New York; 1998.

Baek S, Tsai CA, Chen JJ: Development of biomarker classifiers from highdimensional data.
Brief Bioinfor 2009, 10:537546. Publisher Full Text

Steyerberg EW: Clinical Prediction Models, A Practical Approach to Development, Validation and Updating. Springer, New York; 2009.
Section 19.7: 352–357

Smith JJ, Deane NG, Wu F, et al.: Experimentally derived metastasis gene expression profile predicts recurrence and death in patients with colon cancer.
Gastroenterology 2010, 138:958968. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Loi S, HaibeKains B, Desmedt C, et al.: Definition of clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through genomic grade.
J Clin Oncol 2007, 25:12391246. PubMed Abstract  Publisher Full Text

Subramanian J, Simon R: An evaluation of resampling methods for assessment of survival risk prediction in highdimensional settings.
Stat Med 2011, 30:642653. PubMed Abstract  Publisher Full Text

van Belle V, van Calster B, Brouckaert O, et al.: Qualitative assessment of the progesterone receptor and HER2 improves the nottingham prognostic index up to 5 years after breast cancer diagnosis.
J Clin Oncol 2010, 28:41294134. PubMed Abstract  Publisher Full Text

Wierda WG, O’Brien S, Wang X, et al.: Prognostic nomogram and index for overall survival in previously untreated patients with chronic lymphocytic leukemia.
Blood 2007, 109:46794685. PubMed Abstract  Publisher Full Text

Kattan MW, Karpeh MS, Mazumdar M, Brennan MF: Postoperative Nomogram for DiseaseSpecific Survival After an R0 Resection for Gastric Carcinoma.
J Clin Oncol 2003, 21:36473650. PubMed Abstract  Publisher Full Text

Hoster E, Dreyling M, Klapper W, et al.: Anew prognostic index (MIPI) for patients with advancedstage mantle cell lymphoma.
Blood 2008, 111:558565. PubMed Abstract  Publisher Full Text

Lau SK, Boutros PC, Pintilie M, et al.: ThreeGene Prognostic Classifier for EarlyStage Non–SmallCell Lung Cancer.
J Clin Oncol 2007, 25:55625569. PubMed Abstract  Publisher Full Text

Amstrong AJ, GarrettMayer E, de Wit Ronald , Tannock I, Eisenberger M: Prediction of Survival following FirstLine Chemotherapy in Men with CastrationResistant Metastatic Prostate Cancer.
Clin Cancer Res 2010, 16:203211. PubMed Abstract  Publisher Full Text

Moskowitz CS, Pepe MS: Quantifying and comparing the accuracy of binary biomarkers when predicting a failure time outcome.
Stat Med 2004, 23:15551570. PubMed Abstract  Publisher Full Text

Molinaro AM, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods.
Bioinformatics 2005, 21:33013307. PubMed Abstract  Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: