Abstract
Background
Dealing with high dimensional markers, such as gene expression data obtained using microarray chip technology or genomics studies, is a key challenge because the numbers of features greatly exceeds the number of biological samples. After selecting biologically relevant genes, how to summarize the expression of selected genes and then further build predicted model is an important issue in medical applications. One intuitive method of addressing this challenge assigns different weights to different features, subsequently combining this information into a single score, named the compound covariate. Investigators commonly employ this score to assess whether an association exists between the compound covariate and clinical outcomes adjusted for baseline covariates. However, we found that some clinical papers concerned with such analysis report bias pvalues based on flawed compound covariate in their training data set.
Results
We correct this flaw in the analysis and we also propose treating the compound score as a random covariate, to achieve more appropriate results and significantly improve study power for survival outcomes. With this proposed method, we thoroughly assess the performance of two commonly used estimated gene weights through simulation studies. When the sample size is 100, and censoring rates are 50%, 30%, and 10%, power is increased by 10.6%, 3.5%, and 0.4%, respectively, by treating the compound score as a random covariate rather than a fixed covariate. Finally, we assess our proposed method using two publicly available microarray data sets.
Conclusion
In this article, we correct this flaw in the analysis and the propose method, treating the compound score as a random covariate, can achieve more appropriate results and improve study power for survival outcomes.
Introduction
Highdimensional omics data
Personalized medicine is expected to enable a more predictive discipline, in which therapies are targeted toward the molecular constitution of individual patients and their disease; thus, molecular biomarkers are widely expected to revolutionize the current practice of medicine. For example, the progress of genomics has made it possible to evaluate molecular signatures to predict cancer metastasis [1,2]. Various technological breakthroughs have led to a plethora of highdimensional omics data to support personalized medicine, and these data have a common characteristic: the numbers of features greatly exceeds the number of biological samples. Because biological phenomena are the result of sets of features (e.g., concerted expression of multiple genes), the analysis of a group of related features (e.g., genes) may be more effective and may provide more directly interpretable results than the analysis of individual genes.
As highdimensional omics research has advanced, the compound covariate (or compound score) has generally been held as a simpler and more straightforward approach. After selecting biologically relevent genes in training cohort, such a score is often a useful device in medical applications to define the information contained in a single set of data and to summarize the association of a set of variables with disease. Tukey [3] first advocated the use of compound covariates in the clinical trial setting. To develop a compound score, the individual covariates are summed; the association between such a compound covariate and outcome then is evaluated via regression analysis. Tomasson [4] used a compound score for binary outcomes, via fitting a logistic regression. Later, Hedenfalk [5] successfully applied the compound covariate method to class prediction analysis for breast cancer data. Because the use of the compound covariate is intuitive and seems useful, many other leading researchers also have applied this method for analyzing omics data sets [69].
Problem statements
A compound covariate is a linear combination of the basic covariates being studied, with each covariate having its own coefficient or weight. For survival outcomes, a commonly used scheme is to 1) compute the univariate Cox regression [10] for each gene of interest, 2) assign a weight to each gene (typically, the estimated regression coefficients or Wald statistics from the univariate Cox regressions), and 3) combine the weighted genes in a linear model that incorporates gene expression levels in each sample. This method of modeling weighted genes is believed to reflect the importance of each individual gene to the outcome; the higher the weight assigned, the more significant a particular gene is.
However, the linear combination of the group of genes, with each gene having its own estimated weight, should not be treated as an observed covariate or fixed covariate. Because selected weights are estimated through computing the univariate Cox regression of each individual gene, the compound "covariate" should be treated as a random covariate that includes estimated error. Besides, for the purpose of assessing whether an association exists between the compound covariate and survival outcomes, Cox regression is typically used to evaluate the significance level of the parameter of the compound score. However, bias concerns arise when the same data set, training cohort, is used for a double purpose: to construct the compound covariate and then to test it. This framework results in an overfitting problem. As shown in Figure 1, we simulate 50 observations with 3, 5, 10, and 30 noncausal genes used to create a compound covariate. The Cox regression model is then used to test whether an association exists between the compound scores and survival outcomes in the same dataset. The distribution of pvalues should be uniform in the interval [0, 1]. In our simulation, however, pvalues are concentrated around zero, especially as the number of genes increases. This demonstrates that type I error rates are inflated and consequently not controlled. Obviously, double using the training cohort casue overfitting problem and bias pvalues arised. We found that some medical papers report inappropriate pvalues for testing training cohort data [11,6]. Although the proposed bias pvalues in their training cohort do not affact their final research results inferred from another independent testing data set, these observations motive us to study a more appropriate testing procedure for the compound covariate if the investagtors whish to report a testing result in the training cohort.
Figure 1. Pvalue distribution under the null hypothesis with nominal level 0.05.
In this paper, we first contend the compound covariate should be treated as a random observation. Our idea is based on that proposed by Prentice [12], who analyzed covariates with measurement error and used a partial likelihood function technique to infer whether the parameter for the covariate was significant. In addition, if a training data set is used for a double purpose (i.e., to construct the compound covariate and then to test it), the resulting overfitting means the pvalue is not reliable when testing the regression parameter. Therefore, we use a 2fold method (e.g., [13,14]), splitting all observations in the training cohort into two parts, one part for assigning gene weights, and another part for testing the regression parameter through a partial likelihood score test. The remainder of this paper is organized as follows: We outline creation of the compound score using a random covariate approach. Then, we investigate the accuracy of the asymptotic distribution of the proposed tests. We thoroughly assess the performance of two commonly used estimated weights, "estimated coefficient" and "Wald statistic", for the Cox proportional hazards model. Finally, we illustrate the implementation of the proposed methods through two real data sets, and offer concluding remarks.
The proposed method
The compound covariate
In this section, we formally define some notations for compound covariate and introduce a procedure to identify whether a set of genes is truly associated with survival times in the training cohort. Let T_{j }denote the survival time and C_{j }denote the censoring time independent of T_{j }for the jth patient, with j = 1, 2, ..., n. When some observations are rightcensored, one can observe only random variables X_{j }= min(T_{j}, C_{j}) and δ_{j }= I(T_{j }≤ C_{j}), where I(A) is an indicator function of event A, assuming the value 1 if event A occurs and the value 0 otherwise. Let x_{j }= (x_{j1}, x_{j2}, ..., x_{jp}) be the standardized selected gene's intensity in the jth patient, where p is the size of the gene set. For creating the compound covariate, one first fits the univariate Cox regression model for each individual gene, that is,
where h_{0k}(t) is a baseline hazard function of each gene k, and β_{k }is a corresponding parameter to be estimated. Let
or another possible weighting policy depends on Wald statistics,
for each patient j, j = 1, 2, ..., n. From the perspective of biology, the weighting policy is believed to reflect the importance of each individual gene to survival outcome, the higher the weight, the more important the gene is. In other words, the score can be regarded as a condensed index, representing the collective effects of gene expression.
To identify whether this gene expression pattern is truly associated with survival in training cohort, investigators prefer to use Cox regression analysis. That is, after fitting model (1), they construct
where h_{0}(t) is a baseline hazard function and γ_{0 }is a corresponding parameter for the compound covariate z; they then use the same data set to test the null hypothesis H_{0 }:γ_{0 }= 0. Under the null hypothesis, however, the method results in uncontrolled type I
error, because the training data set has been used twice, both for building the model
and for testing the regression parameter. If independent data are available, carry
Cox regression with a random compound covariate
The measuring mechanism makes the compound covariate an estimation and not a fixed observable. Naturally, such a covariate should be treated as a random covariate, and the variance of each score needs to be taken into account. To fit a Cox regression model with a random covariate, we use the idea advocated by Prentice, which presents the Cox model as a multiplicative hazards model, with a relative risk at time t,
This is a weighted average of a possible relative risk given the covariate z. The Cox regression model then can be written as
Because omics data sets involve a large number of features, we assume the distributions
of the scores follow normal distribution. That is, for each patient j, assuming z_{j }is drawn from a normal distribution with mean μ_{j }and variance
Thus, a quadratic term,
where γ_{0 }is the parameter for the compound covariate,
A partial likelihood function and score test
Suppose now that t_{1 }< ... <t_{l }are the ordered distinct survival times in the sample, and let R(t_{i}) and F(t_{i}) denote the risk set prior to t_{i }and the set of subjects failing at t_{i}, respectively. The partial likelihood function is:
where m_{i }is the number of failures at time t_{i}(i = 1, 2,...l). Let
Additional file 1. The explicit forms of a, b and c. Additional file 1 is a PDF file which shows the explicit forms of a, b and c. Then, the score statistic can be derived.
Format: PDF Size: 117KB Download file
This file can be viewed with: Adobe Acrobat Reader
with the observed information matrix V
Consequently, under the null hypothesis
is derived based on approximation and
by using Wald statistics as weight. We show the derivation in more detail in Additional file 2.
Additional file 2. The derivation of variance for compound covariates. Additional file 2 is a PDF file which shows the derivation of variance for compound covariates.
Format: PDF Size: 82KB Download file
This file can be viewed with: Adobe Acrobat Reader
Multiple gene sets
Further, we extended the compound covariate to multiple gene sets. If there are q given independent gene sets for the jth patient,
where p_{s }is the number of genes of the sth gene set. Then, the partial likelihood function can be written as
Let
Simulation results
To assess the performance of the proposed testing procedure for compound covariate,
we conducted simulation studies under various scenarios to study type I error rate
and power. For the scenario of split training data set as two parts and the consideration
of compound scores as random covariates, we denoted the compound score using
For comparing empirical type I error rates, the value of β was set to 0. The total sample size was set to 50, 75, or 100. After gene selection process, we assume the total number of disease relative genes was set to 10, 30, 50, or 70. Censoring times (denoted as cen.) were generated from an exponential distribution, and the overall censoring fraction in either setup was fixed at 10% or 40%. Table 1 shows the empirical type I error rates. As shown, the proposed procedure for SRC_{B }and SRC_{W }preserve reasonable type I error rates. The methods DC_{B }and DC_{W}, however, which make double use of the data set, do not control type I error. Note that we do not show type I error for SC_{B }and SC_{W }methods, because these methods are typical Cox regression.
Table 1. Empirical type I error rates
Because type I error rates are preserved for both the SRC_{B}/SRC_{W }and SC_{B}/SC_{W }methods, we compared the power under each method in Table 2. In this simulation, censoring times were also generated from an exponential distribution, and the overall censoring fraction in either setup was fixed at 10%, 30%, or 50%. We then simulated 30 total genes in one gene set under two different scenarios. The first scenario considers 30 disease related genes. We designed two different levels of effect, strong effect and low effect, as
Table 2. Power comparison under two different scenario
respectively. The second scenario considers 3 disease related genes, with the other 27 genes considered "noise" (i.e., no effect). Strong effect and low effect in this case were set as
Results are shown in Table 2.
For scenario 1, all 30 genes have effects. As expected, the power of the tests increases
with increase in total sample size and gene effect, but decreases as the censoring
proportion grows. Under the first scenario, the power of the SRC_{B }method is always better than that of SC_{B}, and SRC_{W }is always better than SC_{W}. This result indicates that treating the compound score as a random covariate yields
higher power than treating the score as a fixed covariate. When the sample size is
100, the power average increases 10.6, 3.5, and 0.4 percentage points for 50%, 30%,
and 10% censoring, respectively. This is a reasonable result, because the random covariate
approach involves fitting a quadratic Cox regression model,
To further illustrate the effect of treating the compound score as a random covariate, in Figure 2, we show the power curves of the SRC_{B}, SC_{B}, SRC_{W}, and SC_{W }methods for gene effect varying from 1 to 1. The total sample size was set to 100, and the censoring fraction was fixed at 50%. As shown in Figure 2, difference in power between SRC_{B }and SC_{B }and between SRC_{W }and SC_{W }increases as gene effect size increases, when gene effect size increases, the quadratic term can more accurately account for variance in the effect. Thus, treating the compound score as a random covariate in the Cox regression model provides greater power.
Figure 2. Power curves with varying gene effect.
For the first scenario, tests based on SRC_{B }have higher power than those based on SRC_{W}. In the second scenario, however, SRC_{W }yields higher power than SRC_{B}. This pattern change seems to relate to the increase in noise in the gene set. For
Figure 3, we fixed the total number of genes at 30, sample size, 50; censoring fraction, 10%
in one gene set, and show the power curves as gene effect and number of noise genes
increase. There are eight lines in Figure 3. All solid lines indicate power curves for SRC_{B }while all longdash lines indicate power curves for SRC_{W}. Situation a has 30 disease related genes, b has 10 disease related genes and 20 noise genes; c has 5 disease related genes and 25 noise genes, and d has 3 disease related genes and 27 noise genes. As gene effect increases, all powers
increase. As the number of noise genes increases (from a to d), however, the powers of SRC_{B }and SRC_{W }decrease. With no noise genes (case a), the power of SRC_{B }is always greater than that of SRC_{W}. As the number of noise genes increases, however, the power of SRC_{W }gradually improves over that of SRC_{B}. This results from the fact that the compound covariate SRC_{W }takes into account the variance of
Figure 3. Power curves with varying gene effect and number of noise genes (sample size, 50, censoring fraction, 10%).
Figure 4 shows the power curve for SRC_{B }with different sample sizes and different numbers of disease related genes. In this simulation design, the censoring fraction was fixed at 30%. The effect of all disease related genes was set to 0.5. As expected, the power increases as sample size grows. Power decreases, however, as the number of disease genes increase. Under this specified setting, if we need 80% power and have sample size 100, we can include about 90 disease related genes in this analysis. If we have sample size 60, however, we can only include fewer than 30 disease related genes. This result indicates the need for greater sample size to preserve power when a a gene set includes a large number of disease genes.
Figure 4. Power curves with different numbers of genes and sample sizes.
Examples
In this section, we demonstrate our methodology using two examples, an Amsterdam 70gene breast cancer gene signature [1] and a data set involving two pathways for nonsmallcell lung cancer. All tests with nominal level 0.05 were applied to the training cohort. The R code for obtaining pvalues for the proposed testing procedure is available from the authors upon request.
Breast cancer data set
The wellknown Amsterdam 70gene breast cancer gene signature was published by Van't Veer [1]. To evaluate the previously established 70gene prognosis file, Van de Vijver [16] further classified an additional 295 consecutive patients with stage I or II breast cancer to validate the breast cancer gene signature. Because the 295 patients are independent of the original data, we reanalyzed this data set using our methodology. In this data set, patients were followed for a median of 7.2 years, with 79 observed deaths. The survival curve is shown in Figure 5 (a) and the testing results, including estimated coefficients (Coef.), relative risk (RR), and pvalues are given in Table 3.
Although all coefficients and relative risks are very close, the pvalues are very different. When using DC_{B }and DC_{W}, the pvalues are 8.6 × 10^{13 }and 1.1 × 10^{13}, respectively. When treating the compound covariate as fixed, the pvalues of SC_{B }and SC_{W }are 1.1 × 10^{7 }and 1.3 × 10^{7}. When using our procedure, the pvalues of SRC_{B }and SRC_{W }are 1.9 × 10^{8 }and 1.8 × 10^{8}. Although the results remain significant regardless of method, we achieve appropriate pvalues for the training cohort, showing that the 70gene prognosis signature can be used to evaluate early events in breast cancer patients. We get consistent conclusion with the other researches [17,16].
Nonsmall cell lung cancer data set
We also tested our method by applying it to a publicly available nonsmallcell microarray data set downloaded from National Center for Biotechnology Information Gene Expression Omnibus (GSE14814). There are 90 gene expression profiling conducted on mRNA isolated from frozen tumor samples. In this example, two wellknown cancerrelated pathways were used to test association with survival outcomes for demonstration purposes. The first signaling pathway is the p53 pathway, which is induced by a number of stress signals, including DNA damage, oxidative stress, and activated ontogenesis. The other pathway is the NODlike receptor signaling pathway, which has been associated with an increased risk for the development of different types of cancer [18]. There are 61 genes in p53 pathway and 54 genes in NODlike receptor signaling pathway, respectively. The median followup time of these patients was 5.4 years, and the number of observed deaths was 29. Figure 5 (b) shows the survival curve, and the test results are given in Table 4.
Table 4. Nonsmallcell lung cancer data set analysis
To summarize all the information, two compound covariates were used. As shown, conventional Cox regression yields overall pvalues that are strongly statistically significant (2.29 × 10^{6 }for DC_{B }and 1.85 × 10^{5 }for DC_{W}). When treating the compound score as a fixed covariate and using a split data set, however, the pvalues of SC_{B }and SC_{W }become 0.432 for both. When treating the compound score as a random covariate, the pvalues of SRC_{B }and SRC_{W }become 0.236 and 0.358, respectively. Such divergent pvalues suggest that an inappropriate method may well lead to misleading results.
Concluding remarks
In this paper, we focused on survival outcomes and proposed a feasible and correct method for testing the compound covariate to evaluate its association with survival outcomes for training cohort data. We have described the use of a random covariate, SRC_{B}/SRC_{W}, to achieve correct testing results for training cohort data and moderately improve power as compared to the use of SC_{B}/SC_{W}. Simulation study shows that our proposed method performs consistently better than SC_{B}/SC_{W}, because the quadratic term utilized in the SRC_{B}/SRC_{W }method takes into account error in the compound covariate. We further found that an increase in sample size improves power when there is a high proportion of censored data. If the gene set of interest includes noise genes, we suggest that the compound covariate SRC_{W }is a better choice than SRC_{B}; whether noise genes or nonfunctionally related genes are hidden in gene set is a judgment call for a geneticist. In addition, we contend that a flaw of biomedical papers concerned with such topics report an bias pvalue based on flawed compound covariate analysis for the same training data set. In this paper, we use a wellknown 2fold concept, with one part of the data to built compound covariate and the remainder part for testing if there is association between survival outcomes and the score to ensure correct pvalues in the training data set. Note that we need to check the proportional hazards assumption.
Our method can simultaneously test for more than one gene set in a training cohort data. More generally, this procedure can be applied not only for survival outcomes, but also for binary or continuous outcomes. The weighted flexible compound covariate method WFCCM [19], an extension of the compound covariate, also allows for use of the method of statistical analysis presented here. In addition, this method can easily be extended to consider the interaction between random covariates and clinical observed covariates, as
using the same analysis procedure. The chosen weight
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
PFS developed the mathematical derivations, designed and performed the simulation, and drafted the manuscript. Xi and HC performed the experimence about microarray analysis and gave valuable advice related to the compound score issues. YS supervised the research and managed this project. All authors read and approved the final manuscript.
Acknowledgements
The authors wish to thank editorial assistants, Lynne Berry and Yvonne Poindexter, for their editorial work on this manuscript. This work was supported by National Cancer Institute (grant numbers P30 CA068485, P50 CA095103, P50 CA090949, P50 CA098131).
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM)  Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
References

van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend S: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Wang Y, Klijn J, Zhang Y, Sieuwerts A, Look M, Yang F, Talantov D, Timmermans M, Meijervan Gelder M, Yu J, Jatkoe T, Berns E, Atkins D, Foekens J: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer.
Lancet 2005, 365:671679. PubMed Abstract  Publisher Full Text

Tukey JW: Tighening the clinical trial.
Control Clin Trials 1993, 14:266285. PubMed Abstract  Publisher Full Text

Tomasson H: Risk scores from logistic regression: unbiased estimates of relative and attributable risk.
Stat Med 1995, 14:13311339. PubMed Abstract  Publisher Full Text

Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J, Raffeld M, Yakhini Z, BenDor A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Sauter G: Geneexpression profiles in hereditary breast cancer.
N Engl J Med 2001, 344:539548. PubMed Abstract  Publisher Full Text

Lossos IS, Czerwinski DK, Alizadeh AA: Prediction of survival in diffuse largeBcell lymphoma based on the expression of six genes.
N Engl J Med 2004, 350:18281837. PubMed Abstract  Publisher Full Text

Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, Yuan A, Cheng CL, Wang CH, Terng HJ, Kao SF, Chan WK, Li HN, Liu CC, Singh S, Chen WJ, Chen JJ, Yang PC: A fivegene signature and clinical outcome in nonsmallcell lung cancer.
N Engl J Med 2007, 356:1120. PubMed Abstract  Publisher Full Text

Hsu YC, Yuan S, Chen HY, Yu SL, Liu CH, Hsu PY, Wu G, Lin CH, Chang GC, Li KC, Yang PC: A fourgene signature from NCI60 cell line for survival prediction in nonsmall cell lung cancer.
Clin Cancer Res 2009, 15:73097315. PubMed Abstract  Publisher Full Text

Beer DG, Kardia SLR, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JMG, Iannettoni MD, Orringer MB, Hanash S: Geneexpression profiles predict survival of patients with lung adenocarcinoma.
Nat Med 2002, 8:816824. PubMed Abstract  Publisher Full Text

Salmon S, Chen H, Chen S, Herbst R, et al.: Classification by mass spectrometry can accurately and reliably predict outcome in patients with nonsmall cell lung cancer treated with erlotinibcontaining regimen.
J Thorac Oncol 2009, 4:689696. PubMed Abstract  Publisher Full Text

Prentice RL: Covariate measurement errors and parameter estimation in a failure time regression model.
Biometrika 1982, 69:331342. Publisher Full Text

Zhao H, Tibshirani R, Brooks J: Gene expression profiling predicts survival in conventional renal cell carcinoma.
PLoS Med 2005, 3:115124. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Efron B, Tibshirani R: On testing the significance of sets of genes.
Ann Appl Stat 2007, 1:107129. Publisher Full Text

Kaplan EL, Meier P: Nonparametric estimator from incomplete observations.
J Amer Stat Assoc 1958, 53:457481. Publisher Full Text

van de Vijver MJ, He YD, van 't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A geneexpression signature as a predictor of survival in breast cancer.
N Engl J Med 2002, 347:19992009. PubMed Abstract  Publisher Full Text

Buyse M, Loi S, van't Veer L, Viale G, Delorenzi M, Glas AM, Saghatchian d'Assignies M, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ: Validation and clinical utility of a 70gene prognostic signature for women with nodenegative breast cancer.
J Nat Cancer Inst 2006, 98:11831192. PubMed Abstract  Publisher Full Text

Rosenstiel P, Till A, Schreiber S: NODlike receptors and human diseases.
Microb Infect 2007, 9:648657. PubMed Abstract  Publisher Full Text

Shyr Y, Kim K: Weighted flexible compound covariate method for classifying microarray data. In A practical approach to microarray data analysis. Norwell: Kluwer Academic Publishers, New York; 2003:186200.

Rdmacher MD, McShane LM, Simon R: A paradigm for class prediction using gene expression profiles.
J Comput Biol 2002, 9:505511. PubMed Abstract  Publisher Full Text