Center for Quantitative Sciences, Vanderbilt University, Nashville, TN, USA

Department of Biostatistics, Vanderbilt University, Nashville, TN, USA

Abstract

Background

Dealing with high dimensional markers, such as gene expression data obtained using microarray chip technology or genomics studies, is a key challenge because the numbers of features greatly exceeds the number of biological samples. After selecting biologically relevant genes, how to summarize the expression of selected genes and then further build predicted model is an important issue in medical applications. One intuitive method of addressing this challenge assigns different weights to different features, subsequently combining this information into a single score, named the compound covariate. Investigators commonly employ this score to assess whether an association exists between the compound covariate and clinical outcomes adjusted for baseline covariates. However, we found that some clinical papers concerned with such analysis report bias p-values based on flawed compound covariate in their training data set.

Results

We correct this flaw in the analysis and we also propose treating the compound score as a random covariate, to achieve more appropriate results and significantly improve study power for survival outcomes. With this proposed method, we thoroughly assess the performance of two commonly used estimated gene weights through simulation studies. When the sample size is 100, and censoring rates are 50%, 30%, and 10%, power is increased by 10.6%, 3.5%, and 0.4%, respectively, by treating the compound score as a random covariate rather than a fixed covariate. Finally, we assess our proposed method using two publicly available microarray data sets.

Conclusion

In this article, we correct this flaw in the analysis and the propose method, treating the compound score as a random covariate, can achieve more appropriate results and improve study power for survival outcomes.

Introduction

High-dimensional omics data

Personalized medicine is expected to enable a more predictive discipline, in which therapies are targeted toward the molecular constitution of individual patients and their disease; thus, molecular biomarkers are widely expected to revolutionize the current practice of medicine. For example, the progress of genomics has made it possible to evaluate molecular signatures to predict cancer metastasis

As high-dimensional omics research has advanced, the compound covariate (or compound score) has generally been held as a simpler and more straightforward approach. After selecting biologically relevent genes in training cohort, such a score is often a useful device in medical applications to define the information contained in a single set of data and to summarize the association of a set of variables with disease. Tukey

Problem statements

A compound covariate is a linear combination of the basic covariates being studied, with each covariate having its own coefficient or weight. For survival outcomes, a commonly used scheme is to 1) compute the univariate Cox regression

However, the linear combination of the group of genes, with each gene having its own estimated weight, should not be treated as an observed covariate or fixed covariate. Because selected weights are estimated through computing the univariate Cox regression of each individual gene, the compound "covariate" should be treated as a random covariate that includes estimated error. Besides, for the purpose of assessing whether an association exists between the compound covariate and survival outcomes, Cox regression is typically used to evaluate the significance level of the parameter of the compound score. However, bias concerns arise when the same data set, training cohort, is used for a double purpose: to construct the compound covariate and then to test it. This framework results in an over-fitting problem. As shown in Figure

P-value distribution under the null hypothesis with nominal level 0.05

**P-value distribution under the null hypothesis with nominal level 0.05**.

In this paper, we first contend the compound covariate should be treated as a random observation. Our idea is based on that proposed by Prentice

The proposed method

The compound covariate

In this section, we formally define some notations for compound covariate and introduce a procedure to identify whether a set of genes is truly associated with survival times in the training cohort. Let _{j }_{j }_{j }_{j }_{j}, C_{j}_{j }_{j }_{j}_{j }_{j1}, _{j2}, ..., _{jp}

where _{0k}(_{k }_{1}, _{2}, ..., _{p}

or another possible weighting policy depends on Wald statistics,

for each patient

To identify whether this gene expression pattern is truly associated with survival in training cohort, investigators prefer to use Cox regression analysis. That is, after fitting model (1), they construct

where _{0}(_{0 }is a corresponding parameter for the compound covariate _{0 }:_{0 }= 0. Under the null hypothesis, however, the method results in uncontrolled type I error, because the training data set has been used twice, both for building the model and for testing the regression parameter. If independent data are available, carry

Cox regression with a random compound covariate

The measuring mechanism makes the compound covariate an estimation and not a fixed observable. Naturally, such a covariate should be treated as a random covariate, and the variance of each score needs to be taken into account. To fit a Cox regression model with a random covariate, we use the idea advocated by Prentice, which presents the Cox model as a multiplicative hazards model, with a relative risk at time

This is a weighted average of a possible relative risk given the covariate

Because omics data sets involve a large number of features, we assume the distributions of the scores follow normal distribution. That is, for each patient _{j }_{j }

Thus, a quadratic term, **w **with dimension

where _{0 }is the parameter for the compound covariate,

A partial likelihood function and score test

Suppose now that _{1 }< ... <_{l }_{i}_{i}_{i }_{i}

where _{i }_{i}

**The explicit forms of a, b and c**. Additional file 1 is a PDF file which shows the explicit forms of

Click here for file

with the observed information matrix

Consequently, under the null hypothesis ^{T}^{-1 }** V **is nonsingular. In addition, (3) can be used in a standard manner for

is derived based on approximation and

by using Wald statistics as weight. We show the derivation in more detail in Additional file

**The derivation of variance for compound covariates**. Additional file 2 is a PDF file which shows the derivation of variance for compound covariates.

Click here for file

Multiple gene sets

Further, we extended the compound covariate to multiple gene sets. If there are

where _{s }

Let ^{T}^{-1 }** V **is nonsingular. If we reject the null hypothesis, we can conclude that the covariate vector is associated with survival time.

Simulation results

To assess the performance of the proposed testing procedure for compound covariate, we conducted simulation studies under various scenarios to study type I error rate and power. For the scenario of split training data set as two parts and the consideration of compound scores as random covariates, we denoted the compound score using _{B}_{W}_{B }_{W}_{B }_{W }**β **_{1}, _{2}, ..., _{p}^{T}, and variance-covariance matrix equal to the identity matrix for ^{T}** β**). All tests with nominal significance level 0.05 were applied and empirical rejection probability was obtained based on 2000 simulation runs.

For comparing empirical type I error rates, the value of ** β **was set to 0. The total sample size was set to 50, 75, or 100. After gene selection process, we assume the total number of disease relative genes was set to 10, 30, 50, or 70. Censoring times (denoted as cen.) were generated from an exponential distribution, and the overall censoring fraction in either setup was fixed at 10% or 40%. Table

Empirical type I error rates

**Method**

**
n
**

**cen**.

**The total number of genes**

**10**

**30**

**50**

**70**

_{B}

50

10%

0.052

0.057

0.051

0.048

40%

0.041

0.047

0.045

0.046

75

10%

0.052

0.048

0.045

0.046

40%

0.044

0.046

0.050

0.046

100

10%

0.056

0.049

0.052

0.052

40%

0.045

0.044

0.048

0.050

_{W}

50

10%

0.058

0.046

0.052

0.050

40%

0.034

0.046

0.036

0.043

75

10%

0.046

0.042

0.051

0.051

40%

0.044

0.038

0.044

0.040

100

10%

0.051

0.046

0.048

0.060

40%

0.044

0.041

0.046

0.048

_{B}

50

10%

0.937

1.000

1.000

1.000

40%

0.910

1.000

1.000

1.000

75

10%

0.944

1.000

1.000

1.000

40%

0.946

1.000

1.000

1.000

100

10%

0.957

1.000

1.000

1.000

40%

0.952

1.000

1.000

1.000

_{W}

50

10%

0.926

1.000

1.000

1.000

40%

0.916

1.000

1.000

1.000

75

10%

0.920

1.000

1.000

1.000

40%

0.936

1.000

1.000

1.000

100

10%

0.929

1.000

1.000

1.000

40%

0.933

1.000

1.000

1.000

Empirical type I error rates for comparing _{B}, SRC_{W }DC_{B }_{W }

Because type I error rates are preserved for both the _{B}_{W }_{B}_{W }

Power comparison under two different scenario

**
n
**

**cen**.

**Scenarios 1**

**Scenarios 2**

**
SRC**

**
SC**

**
SRC**

**
SC**

**
SRC**

**
SC**

**
SRC**

**
SC**

Strong effect: ** β **= [

50

10%

0.757

0.742

0.675

0.650

0.600

0.599

0.723

0.692

30%

0.624

0.580

0.536

0.490

0.480

0.422

0.546

0.505

50%

0.448

0.350

0.381

0.312

0.350

0.250

0.359

0.294

75

10%

0.960

0.956

0.907

0.902

0.876

0.870

0.944

0.942

30%

0.883

0.864

0.828

0.766

0.783

0.771

0.875

0.822

50%

0.758

0.626

0.690

0.526

0.607

0.494

0.694

0.580

100

10%

0.998

0.997

0.985

0.982

0.974

0.973

0.996

0.995

30%

0.982

0.974

0.948

0.917

0.940

0.905

0.966

0.955

50%

0.928

0.846

0.868

0.730

0.806

0.695

0.883

0.802

Low effect: ** β **= [

50

10%

0.666

0.625

0.594

0.576

0.266

0.242

0.326

0.305

30%

0.498

0.487

0.440

0.430

0.206

0.165

0.224

0.214

50%

0.362

0.285

0.328

0.244

0.144

0.122

0.162

0.124

75

10%

0.930

0.924

0.859

0.850

0.492

0.466

0.574

0.570

30%

0.824

0.756

0.756

0.688

0.370

0.325

0.432

0.421

50%

0.642

0.553

0.571

0.469

0.263

0.206

0.312

0.224

100

10%

0.992

0.990

0.964

0.950

0.662

0.654

0.796

0.792

30%

0.959

0.944

0.918

0.850

0.558

0.505

0.652

0.594

50%

0.852

0.760

0.802

0.654

0.412

0.319

0.472

0.370

We compared the power under each method _{B}_{W }_{B}_{W}

respectively. The second scenario considers 3 disease related genes, with the other 27 genes considered "noise" (i.e., no effect). Strong effect and low effect in this case were set as

Results are shown in Table

For scenario 1, all 30 genes have effects. As expected, the power of the tests increases with increase in total sample size and gene effect, but decreases as the censoring proportion grows. Under the first scenario, the power of the _{B }_{B}_{W }_{W}_{0}, _{j}

To further illustrate the effect of treating the compound score as a random covariate, in Figure _{B}, SC_{B}, SRC_{W}_{W }_{B }_{B }_{W }_{W }

Power curves with varying gene effect

**Power curves with varying gene effect**.

For the first scenario, tests based on _{B }_{W}_{W }_{B}_{B }_{W}_{B }_{W }_{B }_{W}_{W }_{B}_{W }_{W }_{B}

Power curves with varying gene effect and number of noise genes (sample size, 50, censoring fraction, 10%)

**Power curves with varying gene effect and number of noise genes (sample size, 50, censoring fraction, 10%)**.

Figure _{B }

Power curves with different numbers of genes and sample sizes

**Power curves with different numbers of genes and sample sizes**.

Examples

In this section, we demonstrate our methodology using two examples, an Amsterdam 70-gene breast cancer gene signature

Breast cancer data set

The well-known Amsterdam 70-gene breast cancer gene signature was published by Van't Veer

Kaplan-Meier curves for two data sets

**Kaplan-Meier curves for two data sets**.

Breast cancer data set analysis

**Method**

**Coef**

**RR**

**p-value**

_{B}

0.052

1.12

1.9 × 10^{-8}

_{W}

0.022

1.12

1.8 × 10^{-8}

_{B}

0.093

1.10

1.1 × 10^{-7}

_{W}

0.040

1.04

1.3 × 10^{-7}

_{B}

0.078

1.08

8.6 × 10^{-13}

_{W}

0.015

1.02

1.1 × 10^{-13}

To evaluate the established 70 breast cancer gene signature published by Van't Veer with ther proposed method.

Although all coefficients and relative risks are very close, the p-values are very different. When using _{B }_{W}^{-13 }and 1.1 × 10^{-13}, respectively. When treating the compound covariate as fixed, the p-values of _{B }_{W }^{-7 }and 1.3 × 10^{-7}. When using our procedure, the p-values of _{B }_{W }^{-8 }and 1.8 × 10^{-8}. Although the results remain significant regardless of method, we achieve appropriate p-values for the training cohort, showing that the 70-gene prognosis signature can be used to evaluate early events in breast cancer patients. We get consistent conclusion with the other researches

Non-small cell lung cancer data set

We also tested our method by applying it to a publicly available non-small-cell microarray data set downloaded from National Center for Biotechnology Information Gene Expression Omnibus (GSE14814). There are 90 gene expression profiling conducted on mRNA isolated from frozen tumor samples. In this example, two well-known cancer-related pathways were used to test association with survival outcomes for demonstration purposes. The first signaling pathway is the p53 pathway, which is induced by a number of stress signals, including DNA damage, oxidative stress, and activated ontogenesis. The other pathway is the NOD-like receptor signaling pathway, which has been associated with an increased risk for the development of different types of cancer

Non-small-cell lung cancer data set analysis

**Method**

**Pathway**

**Coef**

**RR**

**p-value**

**Overall p-value**

_{B}

NOD

0.033

1.0013

0.59

0.236

P53

0.037

1.0044

0.67

_{W}

NOD

0.016

1.0063

0.37

0.358

P53

0.001

1.0002

0.99

_{B}

NOD

0.077

1.08

0.36

0.432

P53

0.015

1.01

0.90

_{W}

NOD

0.034

1.03

0.24

0.432

P53

-0.01

0.99

0.74

_{B}

NOD

0.072

1.07

0.37

2.29 × 10^{-6}

P53

0.314

1.37

0.003

_{W}

NOD

0.019

1.02

0.21

1.85 × 10^{-5}

P53

0.055

1.06

0.006

To evaluate the 90 gene expression profiling from National Center for Biotechnology Information Gene Expression Omnibus (GSE14814). The first signaling pathway is the p53 pathway. The other pathway is the NOD-like receptor signaling pathway.

To summarize all the information, two compound covariates were used. As shown, conventional Cox regression yields overall p-values that are strongly statistically significant (2.29 × 10^{-6 }for _{B }^{-5 }for _{W}_{B }_{W }_{B }_{W }

Concluding remarks

In this paper, we focused on survival outcomes and proposed a feasible and correct method for testing the compound covariate to evaluate its association with survival outcomes for training cohort data. We have described the use of a random covariate, _{B}_{W}_{B}_{W}_{B}_{W}_{B}_{W }_{W }_{B}

Our method can simultaneously test for more than one gene set in a training cohort data. More generally, this procedure can be applied not only for survival outcomes, but also for binary or continuous outcomes. The weighted flexible compound covariate method WFCCM

using the same analysis procedure. The chosen weight

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PFS developed the mathematical derivations, designed and performed the simulation, and drafted the manuscript. Xi and HC performed the experimence about microarray analysis and gave valuable advice related to the compound score issues. YS supervised the research and managed this project. All authors read and approved the final manuscript.

Acknowledgements

The authors wish to thank editorial assistants, Lynne Berry and Yvonne Poindexter, for their editorial work on this manuscript. This work was supported by National Cancer Institute (grant numbers P30 CA068485, P50 CA095103, P50 CA090949, P50 CA098131).

This article has been published as part of