Dealing with high dimensional markers, such as gene expression data obtained using microarray chip technology or genomics studies, is a key challenge because the numbers of features greatly exceeds the number of biological samples. After selecting biologically relevant genes, how to summarize the expression of selected genes and then further build predicted model is an important issue in medical applications. One intuitive method of addressing this challenge assigns different weights to different features, subsequently combining this information into a single score, named the compound covariate. Investigators commonly employ this score to assess whether an association exists between the compound covariate and clinical outcomes adjusted for baseline covariates. However, we found that some clinical papers concerned with such analysis report bias p-values based on flawed compound covariate in their training data set.
We correct this flaw in the analysis and we also propose treating the compound score as a random covariate, to achieve more appropriate results and significantly improve study power for survival outcomes. With this proposed method, we thoroughly assess the performance of two commonly used estimated gene weights through simulation studies. When the sample size is 100, and censoring rates are 50%, 30%, and 10%, power is increased by 10.6%, 3.5%, and 0.4%, respectively, by treating the compound score as a random covariate rather than a fixed covariate. Finally, we assess our proposed method using two publicly available microarray data sets.
In this article, we correct this flaw in the analysis and the propose method, treating the compound score as a random covariate, can achieve more appropriate results and improve study power for survival outcomes.
High-dimensional omics data
Personalized medicine is expected to enable a more predictive discipline, in which therapies are targeted toward the molecular constitution of individual patients and their disease; thus, molecular biomarkers are widely expected to revolutionize the current practice of medicine. For example, the progress of genomics has made it possible to evaluate molecular signatures to predict cancer metastasis [1,2]. Various technological breakthroughs have led to a plethora of high-dimensional omics data to support personalized medicine, and these data have a common characteristic: the numbers of features greatly exceeds the number of biological samples. Because biological phenomena are the result of sets of features (e.g., concerted expression of multiple genes), the analysis of a group of related features (e.g., genes) may be more effective and may provide more directly interpretable results than the analysis of individual genes.
As high-dimensional omics research has advanced, the compound covariate (or compound score) has generally been held as a simpler and more straightforward approach. After selecting biologically relevent genes in training cohort, such a score is often a useful device in medical applications to define the information contained in a single set of data and to summarize the association of a set of variables with disease. Tukey  first advocated the use of compound covariates in the clinical trial setting. To develop a compound score, the individual covariates are summed; the association between such a compound covariate and outcome then is evaluated via regression analysis. Tomasson  used a compound score for binary outcomes, via fitting a logistic regression. Later, Hedenfalk  successfully applied the compound covariate method to class prediction analysis for breast cancer data. Because the use of the compound covariate is intuitive and seems useful, many other leading researchers also have applied this method for analyzing omics data sets [6-9].
A compound covariate is a linear combination of the basic covariates being studied, with each covariate having its own coefficient or weight. For survival outcomes, a commonly used scheme is to 1) compute the univariate Cox regression  for each gene of interest, 2) assign a weight to each gene (typically, the estimated regression coefficients or Wald statistics from the univariate Cox regressions), and 3) combine the weighted genes in a linear model that incorporates gene expression levels in each sample. This method of modeling weighted genes is believed to reflect the importance of each individual gene to the outcome; the higher the weight assigned, the more significant a particular gene is.
However, the linear combination of the group of genes, with each gene having its own estimated weight, should not be treated as an observed covariate or fixed covariate. Because selected weights are estimated through computing the univariate Cox regression of each individual gene, the compound "covariate" should be treated as a random covariate that includes estimated error. Besides, for the purpose of assessing whether an association exists between the compound covariate and survival outcomes, Cox regression is typically used to evaluate the significance level of the parameter of the compound score. However, bias concerns arise when the same data set, training cohort, is used for a double purpose: to construct the compound covariate and then to test it. This framework results in an over-fitting problem. As shown in Figure 1, we simulate 50 observations with 3, 5, 10, and 30 non-causal genes used to create a compound covariate. The Cox regression model is then used to test whether an association exists between the compound scores and survival outcomes in the same dataset. The distribution of p-values should be uniform in the interval [0, 1]. In our simulation, however, p-values are concentrated around zero, especially as the number of genes increases. This demonstrates that type I error rates are inflated and consequently not controlled. Obviously, double using the training cohort casue overfitting problem and bias p-values arised. We found that some medical papers report inappropriate p-values for testing training cohort data [11,6]. Although the proposed bias p-values in their training cohort do not affact their final research results inferred from another independent testing data set, these observations motive us to study a more appropriate testing procedure for the compound covariate if the investagtors whish to report a testing result in the training cohort.
Figure 1. P-value distribution under the null hypothesis with nominal level 0.05.
In this paper, we first contend the compound covariate should be treated as a random observation. Our idea is based on that proposed by Prentice , who analyzed covariates with measurement error and used a partial likelihood function technique to infer whether the parameter for the covariate was significant. In addition, if a training data set is used for a double purpose (i.e., to construct the compound covariate and then to test it), the resulting over-fitting means the p-value is not reliable when testing the regression parameter. Therefore, we use a 2-fold method (e.g., [13,14]), splitting all observations in the training cohort into two parts, one part for assigning gene weights, and another part for testing the regression parameter through a partial likelihood score test. The remainder of this paper is organized as follows: We outline creation of the compound score using a random covariate approach. Then, we investigate the accuracy of the asymptotic distribution of the proposed tests. We thoroughly assess the performance of two commonly used estimated weights, "estimated coefficient" and "Wald statistic", for the Cox proportional hazards model. Finally, we illustrate the implementation of the proposed methods through two real data sets, and offer concluding remarks.
The proposed method
The compound covariate
In this section, we formally define some notations for compound covariate and introduce a procedure to identify whether a set of genes is truly associated with survival times in the training cohort. Let Tj denote the survival time and Cj denote the censoring time independent of Tj for the jth patient, with j = 1, 2, ..., n. When some observations are right-censored, one can observe only random variables Xj = min(Tj, Cj) and δj = I(Tj ≤ Cj), where I(A) is an indicator function of event A, assuming the value 1 if event A occurs and the value 0 otherwise. Let xj = (xj1, xj2, ..., xjp) be the standardized selected gene's intensity in the jth patient, where p is the size of the gene set. For creating the compound covariate, one first fits the univariate Cox regression model for each individual gene, that is,
where h0k(t) is a baseline hazard function of each gene k, and βk is a corresponding parameter to be estimated. Let be the estimators of β1, β2, ..., βp, and be the Wald statistics obtained from (1). A compound covariate commonly used by clinicians is defined as
or another possible weighting policy depends on Wald statistics,
for each patient j, j = 1, 2, ..., n. From the perspective of biology, the weighting policy is believed to reflect the importance of each individual gene to survival outcome, the higher the weight, the more important the gene is. In other words, the score can be regarded as a condensed index, representing the collective effects of gene expression.
To identify whether this gene expression pattern is truly associated with survival in training cohort, investigators prefer to use Cox regression analysis. That is, after fitting model (1), they construct
where h0(t) is a baseline hazard function and γ0 is a corresponding parameter for the compound covariate z; they then use the same data set to test the null hypothesis H0 :γ0 = 0. Under the null hypothesis, however, the method results in uncontrolled type I error, because the training data set has been used twice, both for building the model and for testing the regression parameter. If independent data are available, carry from training data set and test on another independent data set is possible, allowing unbiased model validation to prevent over-fitting. However, if investagtors whish to report a testing result in the training cohort, an alternative, though less optimal, study design is using k-fold method or split the training cohort data randomly (2-fold), with 50% of the data being assigned to develop the score, compound covariate, and 50% to evaluate its performance. The limitation of this approach is that it requires a relatively large sample size. With this method, Kaplan-Meier survival curves  for the two sets should be examined to ensure no significant difference by the random selection of those two sets from training cohort data.
Cox regression with a random compound covariate
The measuring mechanism makes the compound covariate an estimation and not a fixed observable. Naturally, such a covariate should be treated as a random covariate, and the variance of each score needs to be taken into account. To fit a Cox regression model with a random covariate, we use the idea advocated by Prentice, which presents the Cox model as a multiplicative hazards model, with a relative risk at time t,
This is a weighted average of a possible relative risk given the covariate z. The Cox regression model then can be written as
Because omics data sets involve a large number of features, we assume the distributions of the scores follow normal distribution. That is, for each patient j, assuming zj is drawn from a normal distribution with mean μj and variance , (2) can be derived as
Thus, a quadratic term, , is added to the relative risk function, accounting for the variance in the random covariate. In addition, with both random covariates z (in this case, the compound covariate) and observed covariates w with dimension d (in this case, the clinical variable), it is easy to incorporate the fixed covariate effects into the Cox model, as:
A partial likelihood function and score test
Suppose now that t1 < ... <tl are the ordered distinct survival times in the sample, and let R(ti) and F(ti) denote the risk set prior to ti and the set of subjects failing at ti, respectively. The partial likelihood function is:
where mi is the number of failures at time ti(i = 1, 2,...l). Let , and where (The explicit forms of a, b and c are shown in Additional file 1). The score statistic then can be derived as
with the observed information matrix V
Consequently, under the null hypothesis , the partial likelihood score test vTV-1 v will have an asymptotic distribution when V is nonsingular. In addition, (3) can be used in a standard manner for estimation. In practice, we can calculate a p-value by replacing μj with zj and with Var(zj) where
by using Wald statistics as weight. We show the derivation in more detail in Additional file 2.
Multiple gene sets
Further, we extended the compound covariate to multiple gene sets. If there are q given independent gene sets for the jth patient, , where , s = 1, 2, ..., q , the compound covariates can be written as vector
where ps is the number of genes of the sth gene set. Then, the partial likelihood function can be written as
Let , and , where . The score statistic and the observed information matrix can be further derived as (3) and (4) as well. Consequently, under the null hypothesis , the partial likelihood score test vTV-1 v has an asymptotic distribution when V is nonsingular. If we reject the null hypothesis, we can conclude that the covariate vector is associated with survival time.
To assess the performance of the proposed testing procedure for compound covariate, we conducted simulation studies under various scenarios to study type I error rate and power. For the scenario of split training data set as two parts and the consideration of compound scores as random covariates, we denoted the compound score using as a weight function as SRCB, and the compound score using the Wald statistic as a weight function as SRCW. The corresponding notations, SCB and SCW, refer to split data but without treating the compound covariate as a random covariate (i.e., typical Cox regression ). The notations DCB and DCW refer to compound scores with double use of the training data set, both for building the model and then testing for the same data set. Assume there are p selected genes in a gene set. Gene expression data were generated from a multivariate normal distribution with mean vector β = [β1, β2, ..., βp]T, and variance-covariance matrix equal to the identity matrix for n cases. Generated survival times were associated with gene expression via a proportional hazard model, exp(xTβ). All tests with nominal significance level 0.05 were applied and empirical rejection probability was obtained based on 2000 simulation runs.
For comparing empirical type I error rates, the value of β was set to 0. The total sample size was set to 50, 75, or 100. After gene selection process, we assume the total number of disease relative genes was set to 10, 30, 50, or 70. Censoring times (denoted as cen.) were generated from an exponential distribution, and the overall censoring fraction in either setup was fixed at 10% or 40%. Table 1 shows the empirical type I error rates. As shown, the proposed procedure for SRCB and SRCW preserve reasonable type I error rates. The methods DCB and DCW, however, which make double use of the data set, do not control type I error. Note that we do not show type I error for SCB and SCW methods, because these methods are typical Cox regression.
Table 1. Empirical type I error rates
Because type I error rates are preserved for both the SRCB/SRCW and SCB/SCW methods, we compared the power under each method in Table 2. In this simulation, censoring times were also generated from an exponential distribution, and the overall censoring fraction in either setup was fixed at 10%, 30%, or 50%. We then simulated 30 total genes in one gene set under two different scenarios. The first scenario considers 30 disease related genes. We designed two different levels of effect, strong effect and low effect, as
Table 2. Power comparison under two different scenario
respectively. The second scenario considers 3 disease related genes, with the other 27 genes considered "noise" (i.e., no effect). Strong effect and low effect in this case were set as
Results are shown in Table 2.
For scenario 1, all 30 genes have effects. As expected, the power of the tests increases with increase in total sample size and gene effect, but decreases as the censoring proportion grows. Under the first scenario, the power of the SRCB method is always better than that of SCB, and SRCW is always better than SCW. This result indicates that treating the compound score as a random covariate yields higher power than treating the score as a fixed covariate. When the sample size is 100, the power average increases 10.6, 3.5, and 0.4 percentage points for 50%, 30%, and 10% censoring, respectively. This is a reasonable result, because the random covariate approach involves fitting a quadratic Cox regression model, , instead of exp(γ0, μj). The quadratic form takes into account the variance of each score, use of the compound score without acknowledgement of covariate error yields lower power.
To further illustrate the effect of treating the compound score as a random covariate, in Figure 2, we show the power curves of the SRCB, SCB, SRCW, and SCW methods for gene effect varying from -1 to 1. The total sample size was set to 100, and the censoring fraction was fixed at 50%. As shown in Figure 2, difference in power between SRCB and SCB and between SRCW and SCW increases as gene effect size increases, when gene effect size increases, the quadratic term can more accurately account for variance in the effect. Thus, treating the compound score as a random covariate in the Cox regression model provides greater power.
Figure 2. Power curves with varying gene effect.
For the first scenario, tests based on SRCB have higher power than those based on SRCW. In the second scenario, however, SRCW yields higher power than SRCB. This pattern change seems to relate to the increase in noise in the gene set. For Figure 3, we fixed the total number of genes at 30, sample size, 50; censoring fraction, 10% in one gene set, and show the power curves as gene effect and number of noise genes increase. There are eight lines in Figure 3. All solid lines indicate power curves for SRCB while all long-dash lines indicate power curves for SRCW. Situation a has 30 disease related genes, b has 10 disease related genes and 20 noise genes; c has 5 disease related genes and 25 noise genes, and d has 3 disease related genes and 27 noise genes. As gene effect increases, all powers increase. As the number of noise genes increases (from a to d), however, the powers of SRCB and SRCW decrease. With no noise genes (case a), the power of SRCB is always greater than that of SRCW. As the number of noise genes increases, however, the power of SRCW gradually improves over that of SRCB. This results from the fact that the compound covariate SRCW takes into account the variance of . Similar results were obtained with larger sample size (75, 100) and censoring fraction (30%, 50%). Consequently, if prior biological knowledge indicates many noise genes in a given gene set/pathway, we recommend use of the compound covariate SRCW over SRCB.
Figure 3. Power curves with varying gene effect and number of noise genes (sample size, 50, censoring fraction, 10%).
Figure 4 shows the power curve for SRCB with different sample sizes and different numbers of disease related genes. In this simulation design, the censoring fraction was fixed at 30%. The effect of all disease related genes was set to 0.5. As expected, the power increases as sample size grows. Power decreases, however, as the number of disease genes increase. Under this specified setting, if we need 80% power and have sample size 100, we can include about 90 disease related genes in this analysis. If we have sample size 60, however, we can only include fewer than 30 disease related genes. This result indicates the need for greater sample size to preserve power when a a gene set includes a large number of disease genes.
Figure 4. Power curves with different numbers of genes and sample sizes.
In this section, we demonstrate our methodology using two examples, an Amsterdam 70-gene breast cancer gene signature  and a data set involving two pathways for non-small-cell lung cancer. All tests with nominal level 0.05 were applied to the training cohort. The R code for obtaining p-values for the proposed testing procedure is available from the authors upon request.
Breast cancer data set
The well-known Amsterdam 70-gene breast cancer gene signature was published by Van't Veer . To evaluate the previously established 70-gene prognosis file, Van de Vijver  further classified an additional 295 consecutive patients with stage I or II breast cancer to validate the breast cancer gene signature. Because the 295 patients are independent of the original data, we re-analyzed this data set using our methodology. In this data set, patients were followed for a median of 7.2 years, with 79 observed deaths. The survival curve is shown in Figure 5 (a) and the testing results, including estimated coefficients (Coef.), relative risk (RR), and p-values are given in Table 3.
Although all coefficients and relative risks are very close, the p-values are very different. When using DCB and DCW, the p-values are 8.6 × 10-13 and 1.1 × 10-13, respectively. When treating the compound covariate as fixed, the p-values of SCB and SCW are 1.1 × 10-7 and 1.3 × 10-7. When using our procedure, the p-values of SRCB and SRCW are 1.9 × 10-8 and 1.8 × 10-8. Although the results remain significant regardless of method, we achieve appropriate p-values for the training cohort, showing that the 70-gene prognosis signature can be used to evaluate early events in breast cancer patients. We get consistent conclusion with the other researches [17,16].
Non-small cell lung cancer data set
We also tested our method by applying it to a publicly available non-small-cell microarray data set downloaded from National Center for Biotechnology Information Gene Expression Omnibus (GSE14814). There are 90 gene expression profiling conducted on mRNA isolated from frozen tumor samples. In this example, two well-known cancer-related pathways were used to test association with survival outcomes for demonstration purposes. The first signaling pathway is the p53 pathway, which is induced by a number of stress signals, including DNA damage, oxidative stress, and activated ontogenesis. The other pathway is the NOD-like receptor signaling pathway, which has been associated with an increased risk for the development of different types of cancer . There are 61 genes in p53 pathway and 54 genes in NOD-like receptor signaling pathway, respectively. The median follow-up time of these patients was 5.4 years, and the number of observed deaths was 29. Figure 5 (b) shows the survival curve, and the test results are given in Table 4.
Table 4. Non-small-cell lung cancer data set analysis
To summarize all the information, two compound covariates were used. As shown, conventional Cox regression yields overall p-values that are strongly statistically significant (2.29 × 10-6 for DCB and 1.85 × 10-5 for DCW). When treating the compound score as a fixed covariate and using a split data set, however, the p-values of SCB and SCW become 0.432 for both. When treating the compound score as a random covariate, the p-values of SRCB and SRCW become 0.236 and 0.358, respectively. Such divergent p-values suggest that an inappropriate method may well lead to misleading results.
In this paper, we focused on survival outcomes and proposed a feasible and correct method for testing the compound covariate to evaluate its association with survival outcomes for training cohort data. We have described the use of a random covariate, SRCB/SRCW, to achieve correct testing results for training cohort data and moderately improve power as compared to the use of SCB/SCW. Simulation study shows that our proposed method performs consistently better than SCB/SCW, because the quadratic term utilized in the SRCB/SRCW method takes into account error in the compound covariate. We further found that an increase in sample size improves power when there is a high proportion of censored data. If the gene set of interest includes noise genes, we suggest that the compound covariate SRCW is a better choice than SRCB; whether noise genes or non-functionally related genes are hidden in gene set is a judgment call for a geneticist. In addition, we contend that a flaw of biomedical papers concerned with such topics report an bias p-value based on flawed compound covariate analysis for the same training data set. In this paper, we use a well-known 2-fold concept, with one part of the data to built compound covariate and the remainder part for testing if there is association between survival outcomes and the score to ensure correct p-values in the training data set. Note that we need to check the proportional hazards assumption.
Our method can simultaneously test for more than one gene set in a training cohort data. More generally, this procedure can be applied not only for survival outcomes, but also for binary or continuous outcomes. The weighted flexible compound covariate method WFCCM , an extension of the compound covariate, also allows for use of the method of statistical analysis presented here. In addition, this method can easily be extended to consider the interaction between random covariates and clinical observed covariates, as
using the same analysis procedure. The chosen weight or can be adjusted by the other clinical observed covariates in the proposed framework. Our method, however, cannot be used to test the interaction between two random covariates, because of the complexity of specifying the distribution for the interaction between two random covariates; this is an area worthy of further investigation. Another one potential practical concern of the proposed method is that sample size must not be too small, higher fractions of censored data create the need for further increased sample size. Similarly, to achieve high power when studying a large number of genes, greater sample size is needed. When studying a large number of genes, ignoring the covariance that exists between genes does not influence the type I error rate, however, taking the covariance into account may increase power. Further research is required to address these limitations. Note that the permutation test (e.g., ) might be another method to calculate an appropriate p-value for the training dataset, however, with the permutation test, weights are not easily adjusted by the other covariates. Even for a small gene set, this approach may appear too expensive in computer time.
The authors declare that they have no competing interests.
PFS developed the mathematical derivations, designed and performed the simulation, and drafted the manuscript. Xi and HC performed the experimence about microarray analysis and gave valuable advice related to the compound score issues. YS supervised the research and managed this project. All authors read and approved the final manuscript.
The authors wish to thank editorial assistants, Lynne Berry and Yvonne Poindexter, for their editorial work on this manuscript. This work was supported by National Cancer Institute (grant numbers P30 CA068485, P50 CA095103, P50 CA090949, P50 CA098131).
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend S: Gene expression profiling predicts clinical outcome of breast cancer.
Wang Y, Klijn J, Zhang Y, Sieuwerts A, Look M, Yang F, Talantov D, Timmermans M, Meijer-van Gelder M, Yu J, Jatkoe T, Berns E, Atkins D, Foekens J: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer.
Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J, Raffeld M, Yakhini Z, Ben-Dor A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Sauter G: Gene-expression profiles in hereditary breast cancer.
Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, Yuan A, Cheng CL, Wang CH, Terng HJ, Kao SF, Chan WK, Li HN, Liu CC, Singh S, Chen WJ, Chen JJ, Yang PC: A five-gene signature and clinical outcome in non-small-cell lung cancer.
Beer DG, Kardia SLR, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JMG, Iannettoni MD, Orringer MB, Hanash S: Gene-expression profiles predict survival of patients with lung adenocarcinoma.
Salmon S, Chen H, Chen S, Herbst R, et al.: Classification by mass spectrometry can accurately and reliably predict outcome in patients with non-small cell lung cancer treated with erlotinib-containing regimen.
Biometrika 1982, 69:331-342. Publisher Full Text
Ann Appl Stat 2007, 1:107-129. Publisher Full Text
J Amer Stat Assoc 1958, 53:457-481. Publisher Full Text
van de Vijver MJ, He YD, van 't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer.
Buyse M, Loi S, van't Veer L, Viale G, Delorenzi M, Glas AM, Saghatchian d'Assignies M, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ: Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer.
Shyr Y, Kim K: Weighted flexible compound covariate method for classifying microarray data. In A practical approach to microarray data analysis. Norwell: Kluwer Academic Publishers, New York; 2003:186-200.