Email updates

Keep up to date with the latest news and content from BMC Medical Research Methodology and BioMed Central.

Open Access Research article

Multiple imputation for estimating hazard ratios and predictive abilities in case-cohort surveys

Helena Marti1*, Laure Carcaillon2 and Michel Chavance1

Author Affiliations

1 Inserm, CESP Centre for Research in Epidemiology and Population Health, U1018, Biostatistics team, F-94807 Villejuif, France

2 Inserm, CESP Centre for Research in Epidemiology and Population Health, U1018, Hormones and Cardiovascular Disease team, F-94807 Villejuif, France

For all author emails, please log on.

BMC Medical Research Methodology 2012, 12:24  doi:10.1186/1471-2288-12-24


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2288/12/24


Received:29 June 2011
Accepted:9 March 2012
Published:9 March 2012

© 2012 Marti et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

The weighted estimators generally used for analyzing case-cohort studies are not fully efficient and naive estimates of the predictive ability of a model from case-cohort data depend on the subcohort size. However, case-cohort studies represent a special type of incomplete data, and methods for analyzing incomplete data should be appropriate, in particular multiple imputation (MI).

Methods

We performed simulations to validate the MI approach for estimating hazard ratios and the predictive ability of a model or of an additional variable in case-cohort surveys. As an illustration, we analyzed a case-cohort survey from the Three-City study to estimate the predictive ability of D-dimer plasma concentration on coronary heart disease (CHD) and on vascular dementia (VaD) risks.

Results

When the imputation model of the phase-2 variable was correctly specified, MI estimates of hazard ratios and predictive abilities were similar to those obtained with full data. When the imputation model was misspecified, MI could provide biased estimates of hazard ratios and predictive abilities. In the Three-City case-cohort study, elevated D-dimer levels increased the risk of VaD (hazard ratio for two consecutive tertiles = 1.69, 95%CI: 1.63-1.74). However, D-dimer levels did not improve the predictive ability of the model.

Conclusions

MI is a simple approach for analyzing case-cohort data and provides an easy evaluation of the predictive ability of a model or of an additional variable.

Background

Case-cohort surveys produce incomplete data by design. A subcohort is selected by simple or stratified random sampling, all subjects are followed up and the events of interest are recorded. The phase-1 variables are observed for the entire cohort, whilethe phase-2 variables are only known for the case-cohort sample, i.e., subjects belonging to the subcohort and all those presenting the event of interest [1]. Thus, in case-cohort studies, the non-cases who do not belong to the subcohort are incompletely observed by design, enabling cost reduction with a small loss of efficiency.

Various approaches have been described to estimate the proportional hazard model in case-cohort surveys: Weighted estimators [2-6] are classically used in these surveys, with analysis restricted to the completely observed subsample, so the information collected for incompletely observed non-cases is ignored and inefficient estimators for the effect of phase-1 variables are obtained. One of the most popular is the Borgan II estimator [4]. Scheike and Martinussen [7] proposed a maximum likelihood estimator based on proportional hazards assumption, using the EM algorithm [8], therebyincreasing efficiency as compared to weighted estimators when the relative risk and disease incidence are high. However, in general, the studied disease incidence in case-cohort surveys is low. Breslow et al. [9] suggested calibrating or estimating the weights a posteriori, using all the phase-1 information, to improve precision with respect to classical weighted estimators. Marti and Chavance [10] showed that multiple imputation (MI) is a good alternative to classical weighted methods for the analysis of case-cohort data. When the imputation model was correct, the MI approach provided unbiased estimators of the log hazard ratios and correctly estimated the variance of its estimators. As expected, the MI approach was more precise than the usual weighted estimators for the parameters associated with phase-1 variables. The former was also slightly more precise than the latter for the phase-2 variable. In Marti and Chavance [10] the imputations were performed according to a correctly specified imputation model. However, in practise, the distribution of the phase-2 variable is unknown and onemay wonder how MI compares to weighted estimators when the imputation model is misspecified.

No standard method exists for quantifying the usefulness or predictive ability of a model or an additional variable in the framework of case-cohort surveys. The predictive ability can be measured in terms of calibration, which refers to the ability of a model to match predicted and observed values, when we are interested in individual predictions; or in terms of discrimination, which refers to the ability of a model to distinguish between subjects with or without a binary event, when we are interested in identifying a group of high-risk subjects. In the present work, we focus on discrimination.

As shown below, a naive measurement of predictive ability from case-cohort data often leads to a biased estimate of the predictive ability because it varies with the censoring rate and thus depends on the subcohort size. Alternatively, because MI reconstitutes whole cohorts, any tool developed to estimate the predictive ability in the framework of cohort surveys can be applied to case-cohort data, so we propose using the MI approach to estimate the predictive ability ofa model or of an additional variable and their standard errors.

The objectives of this study were 1) to evaluate MI for estimating hazard ratios when the distribution of the phase-2 variable is misspecified; and 2) to present an adequate methodology for estimating the predictive ability of a model or of an additional variable in case-cohort surveys. We performed a simulation study to validate the MI approach for estimating the predictive ability of a model or of an additional variable and to assess its potential limits. As an illustration, we analyzed case-cohort data from the Three-City study [11] to estimate the predictive ability of the D-dimer plasma concentration, a marker of coagulation and fibrinolysis, on coronary heart disease (CHD) and on vascular dementia (VaD) risks.

Methods

Incomplete observations and multiple imputation

Case-cohort surveys are a particular type of incomplete observations, in which data are missing at random [12] by design, as the probability of being completely observed depends only on the case status, with simple random sampling, and on some phase-1 variables with stratified sampling. MI is a simple and efficient method for analyzing incomplete observations, while taking into account all the levels of uncertainty regarding missing values. This provides an approximation of the maximumlikelihood estimator and thus enables the potential selection bias to be corrected. This method relies on the generation of several plausibly completed data sets (M ≥ 2), accounting for all levels of uncertainty concerning the missing values. A prediction model must be built, taking into consideration the relationships between the incomplete variable and the other variables, as observed in the complete part of the data. The missing data are not replaced by their expectation but by a value drawnfrom the distribution posited by the model. To take into account the uncertainty concerning the parameters of the imputation model, several imputations are performed with parameters drawn from the asymptotic distribution of their estimator. An estimate of the parameter of interest, <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M1">View MathML</a> and an estimate of the variance of the estimator, <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M2">View MathML</a> are obtained from each completed data set. If the imputation model is correct, these estimators are not biased. The MI estimate, also unbiased, is the mean of the M estimates:

<a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M3">View MathML</a>

(1)

where M is the number of completed data sets and <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M4">View MathML</a> is the estimate of the parameter of interest provided by the mth completed data set. The multiplicity of imputations enables correct estimation of the variance of this single estimator, which is the sum of 2 components: the within-imputations component, WMI, and the between-imputations component, BMI:

<a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M5">View MathML</a>

(2)

where the factor 1 + M-1 is an adjustment for using a finite number of imputations [13].

MI requires a model correctly reflecting the relationship between the incomplete variable and the outcome of interest. In case-cohort surveys, we need to impute phase-2 variable values for the non-cases who do not belong to the subcohort. Under the rare disease assumption, we have shown that a simple generalized linear model, using all the complete data (cases and non-cases) and including the case indicator among the explanatory variables, has to be considered [10]. Practically, in addition to the case indicator and the stratification variables, when the subcohort was selected by stratified sampling, it is necessary to include in the imputation model all the variables appearing in the proportional hazard model. Because imputations are based on asymptotic distributions, caution is necessary, since if too few subjects present the event of interest, the distribution of the estimators can differ from the asymptotic one. As a consequence, the maximum likelihood estimator of the imputation model could be biased or not normally distributed.

Predictive ability of a model and of a supplementary variable

Harrell et al. [14] proposed the C index to measure the predictive ability of a model in cohort studies as the agreement between the order of the predicted and observed survival times in any pair of subjects (the event of interest is assumed to be death, leading to the use of survival terminology). That is, the concordance probability using all pairs of subjects in the population. However, with censored data, it is not possible to consider all the pairs of subjects because survival time is not observed for censored subjects. Let Ti be the survival time for subject i, i = 1,...,N, where N is the cohort size, and Ci the censoring time for subject i. We observe Xi = min(Ti, Ci). Usable pairs are those for which the order of the predicted survival times can be compared to the order of the true survival times, i.e., pairs formed by 2 uncensored subjects or an uncensored subject and a subject censored after the uncensored subject's death. A pair of censored subjects carries no information about its agreement with the expected survival provided by the model since the order of the survival times is not known. Similarly a pair formed by a subject whose survival time is observed and a subject censored before this survival time provides no information on this agreement since the unknown survival time could be anterior or posterior to the observed one. Harrell et al. [15] showed that, in the common models used for survival analysis, such as the proportional hazard model, the predicted survival times and the predicted survival probabilities at a fixed time t can be interchanged for the comparison. The Harrell's C index is defined as:

<a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M6">View MathML</a>

(3)

where πc is the probability of concordance for a pair (i,j) and πd is the probability of discordance. We assume continuous survival times and continuous predicted survival probabilities, so P(Xi = Xj) = P(Yi = Yj) = 0, thus πc + πd = 1. C is estimated by the proportion of concordant pairs among the usable pairs. The estimated variance was given by Kremers [16].

In practice, we are often interested in estimating the predictive ability of an additional phase-2 variable. Let M1 be a proportional hazard model including only phase-1 variables, and C1 and SEC1 respectively the C indexof M1 and its standard error. Let M2 be a proportional hazard model adding the phase-2 variable to M1, and C2 and SEC2, respectively, the C index of M2 and its standard error. Harrell's predictive ability ofthe added phase-2 variable is Δ = C2 - C1 . Complementary measures of predictive ability of a new variable, such as the net reclassification improvement (NRI) and the integrated discrimination index (IDI), were proposed by Pencina [17]. NRI needs some a priori meaningful risk categories. It quantifies the correct reclassification introduced by using a model with the added variable as compared to the classification obtained without this variable. The IDI can be viewed as a continuous version of the NRI with probabilities used instead of categories. It can be defined as the discrimination-slope difference between the models with and without a quantitative variable. To estimate the predictive ability of a model or of an additional variable, we reconstructed plausible whole cohorts using MI. For each reconstructed whole cohort, we could then directly obtain C1, SEC1, C2, SEC2, Δ, NRI, IDI and their respective variances. Using equations (1) and (2), we obtained the MI estimates of these quantities. Concerning the variance of Δ, the between-imputation component is estimated by the empirical variance of the M estimates of Δ provided by the M completed data sets. However, for the within-imputation component, the asymptotic variance of the estimator provided by a complete data set, does not have an analytical form. With a fully observed cohort, bootstrapping is a way to estimate the variance of the corresponding Δ. Therefore, each whole cohort reconstructed by MI has to be resampled. In the simulations as in the real data analysis, we used 100 bootstrap samples.

Simulation study

Two phase-1 variables were simulated: a binary variable, Z1, and a Gaussian variable, Z3, observed for the entire cohort. For the phase-2 variable, Z2, we considered three different distributions: normal, log-normal and uniform, all of them with unit variance, independent of Z1, but having a correlation coefficient of 0.2 with Z3. The survival time had an exponential distribution, with λ = exp (β1Z1 + β2Z2 + β3Z3 ). β1, β2 and β3 were fixed at the same value and set at 0 or log(2). The censoring time followed a uniform distribution over the interval [0,τ], where τ was chosen so that the probability of an event was approximately 0.03 (τ= 0.025). The cohort size was 10,000. We also simulated a phase-1 variable predictive of Z2, <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M7">View MathML</a>with ε ~ N (0, σ2) independent of Z2. The variance σ2 was fixed at 1 which corresponds to a correlation between <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M8">View MathML</a> of approximately 0.7. We wanted to estimate the effect of Z2 on survival time and its predictive ability. The cohort was divided into 9 strata based on the tertiles of <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M9">View MathML</a>, and the non-cases were chosen by stratified sampling. Case-cohort sampling was simulated with 1,000 subjects in each subcohort. The phase-2 variable was not available for non-cases not included in the subcohort, so MI was used to complete the data set. Thus, we built the same linear prediction model for each Z2 based on the stratification phase-1 variable and the case indicator. Z3 was not directly included in the imputation model to predict Z2, because it was a stratification variable included in the model and because of the weak correlation between Z2 and Z3. The imputation model was: Z2 = α0 + α1 Icase + α2 Strata + ε , where α0 and α1 are scalar, α2 is a vector coefficient, Icase is the case indicator, Strata is the vector of stratum indicators and є is the vector of errors independently and identically distributed ~ N (0, σ). Thus, the imputation model was correctly specified for Z2 normally distributed but misspecified for Z2 log-normally or uniformly distributed. One thousand cohorts were simulated for each scenario.

Five imputations were performed and 5 complete data sets were generated for each cohort. We estimated the log hazard ratios using MI and the "Borgan II" weighted estimator [4]. We used MI to estimate the predictive ability of models with and without the phase-2 variable, and the predictive ability of the phase-2 variable, NRI and IDI. We also studied the consistency of the naive estimator of Harrell's C index in case-cohort surveys by varying the subcohort size. Using the above simulation conditions and, exceptionally, a scenario with β1 = β3 = log(2) and β2 = log(1.5), we simulated case-cohort samples with the subcohort size set at 300 or 1,000 subjects. We estimated the predictive ability in the case-cohort samples and in the multiply imputed data sets.

Case-cohort survey from Three-City study

Briefly, the 3C-Study was designed to examine the relationship between vascular diseases and dementia in a community housing 9,294 persons aged 65 years and over between 1999 and 2001 in three French cities. The detailed methodology has been previously described [11]. A case-cohort substudy was conducted [18], to investigate the relationship between biomarkers, such as plasma levels of D-dimer (a marker of coagulation and fibrinolysis) and the 4-year incidence of coronary heart disease (CHD), stroke and all subtypes of dementia, including vascular dementia (VaD), in an elderly population. The phase-1 variables provided information on socio-demographic characteristics, education, medical history, diet, alcohol and tobacco use. Blood pressure, height and weight were also available. A subcohort of size n = 1,254, (13.5% of the full cohort) was randomly selected, stratifying on age, sex and recruitment center. Observed cumulated incidences of CHD and VaD were approximately 2% and 0.6%, respectively. Plasma D-dimer levels were only available for phase-2 subjects. Carcaillon et al. [18] treated quintiles of D-dimer level both qualitatively and linearly. They reported a linear increase in the risk of VaD according to D-dimer quintiles.

We re-assessed the relationship between plasma D-dimer levels and the risk of CHD and VaD, using MI and weighted estimators, and evaluated the predictive ability of D-dimer levels on both risks. We included the same explanatory variables as Carcaillon et al. [18] although we used tertiles of D-dimer rather than quintiles, to estimate CHD and VaD risks, due to the small number of events. Therefore, to estimate the risk of CHD, the proportional hazard model includedthe phase-1 variables: age, sex, center, body mass index, hypertension, hypercholesterolemia, diabetes, tobacco use, diabetes drugs, and as phase-2 variables, indicators of D-dimer tertiles. To estimate the risk of VaD, the proportional hazard model included the phase-1 variables: age, sex, centre, educational level, body mass index, the presence or absence of an apolipoprotein є4 allele and indicators of D-dimer tertiles.

For each outcome (CHD or VaD), it was necessary to reproduce the relationships among the incomplete variable, the outcomes and the confounder variables. For each outcome, we built an imputation model of tertiles of D-dimer levels, including the variables used in the proportional hazard model and the case-indicator. We estimated the predictive ability of proportional hazard models, without (C1) and with (C2) D-dimer levels, Δ = C2 - C1, and IDI for CHD and VaD risks. The NRI requires that some a priori meaningful risk categories be known. Based on the Third Adult Treatment Panel [ATP III] [19] risk classification for the 10-year risk of CHD, we adapted the cut-offs to 4-year risk. For VaD, we do not know a priori meaningful risk categories and did not compute NRI.

Results

Simulation study

The mean fraction of missing information about the effect of Z2 ranged from 5 to 14 percent when β2 = 0 and from 23 to 30 per cent when β2 = log(2) (data not shown). For each estimator (full cohort, case-cohort with MI and case-cohort with weights), we give the mean of the estimated coefficients, the mean of their standard error estimates, the observed standard error of the estimated coefficients and the mean squared errors of 1000 simulations (Table 1). Not surprisingly, the full cohort estimates and the case-cohort weighted estimates of the log hazard ratios were unbiased. Similarly, with a correctly specified normal imputation model, all MI estimates were unbiased. With a misspecified normal imputation model, MI estimate of the effect β2 = log(2) of Z2 was biased (-13%) when Z2 was log-normally distributed. When Z2 was uniformly distributed, MI estimate of the effect of Z2 was slightly biased (-5%). With a misspecified normal imputation model and β2= 0, no bias was observed. TheMI variance and the weighted estimator variance agreed with the observed dispersions of the estimates. The observed dispersion was always smaller with MI than with the weighted estimator. For the phase-1 variables, this dispersion was similar for the entire cohort and with MI, whatever the distribution of the phase-2 variable. For the estimated effect of the phase-2 variable, the observed standard deviations were smaller with MI than with the Borgan II weighted estimator but, as expected, slightly larger with MI than in the full cohort analyses. Altogether, the mean squared errors were smaller with MI than with the weighted estimator, except for the effect of the phase-2 variable with β2 = log(2) and Z2 was log-normally distributed.

Table 1. Mean of the log hazard ratio estimates (Est), mean of the standard error estimates <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M10">View MathML</a>, standard error of the estimates (SE) and mean of the mean square error (MSE). Results of 1,000 simulations.

The results concerning the consistency of the naive estimator of Harrell's C index are reported in Table 2. In the scenario β1 = β 2 = β3 = 0, the mean C index was nearly 0.5 for both models, without and with Z2, whatever the analysis performed. In the scenarios β1 = β3 = log(2) and β2 = log(1.5) or β2 = log(2), the naive computation of C with the case-cohort data led to lower predictive ability than with the full cohort, especially for the smaller subcohort. Bycontrast, the Harrell's C indexes estimated by MI were similar to those computed for the full cohort and did not depend on the subcohort size. The estimated dispersion of the C index was slightly greater than the observed dispersion of the estimates. The rejection percentage of the null hypothesis Δ = 0 was always similar to the full cohort analysis and to MI. As a consequence of the standard error overestimation, the observed first type error rate was slightly lower than 5%. Nevertheless, in the considered scenarios, the observed power was very high. As expected, the loss of power when comparing case-cohort with MI to full cohort analysis was small: with β2 = log(1.5), the observed power was 84.6% with a subsample size of 300, and 90.6% with a subsample size of 1000 versus 91.6% with the full cohort. MI estimates of NRI and IDI indexes were close to those obtained with the full cohort analysis and did not depend on the subcohort size. As compared to the full cohort results, the rejection percentage of the null hypothesis NRI = 0 was smaller with MI analysis when β2 = 0, larger when β2 = log(1.5) and similar when β2 = log(2). When the effect of the phase-2 variable was not null, the rejection percentage of the null hypothesis IDI = 0 was similar with MI and with full cohort analysis. By contrast, whatever the effect of the phase-2 variable, the estimation of NRI and IDI in the case-cohort sample provided larger measures of these indexes than the full cohort analysis.

Table 2. Mean of the predictive ability estimates (Est), mean of the standard error estimates <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M10">View MathML</a> and standard error of the estimates (SE).

Table 3 gives the results of the estimated predictive abilities for the correctly specified and the two misspecified normal imputation models. Full cohort analysis and MI provided similar predictive abilities estimates when the imputation model was correctly specified or when the phase-2 variable had no effect on the studied risk. In the scenario β1 = β2 = β3 = log(2), when Z2 was uniformly distributed, MI and full cohort analysis still provided similar estimates. However, when Z2 was log-normally distributed, the MI estimate was slightly smaller than the full cohort estimate -15%).

Table 3. Predictive ability of the two models and of the phase-2 variable.

Mean of the predictive ability estimates (Est), mean of the standard error estimates <a onClick="popup('http://www.biomedcentral.com/1471-2288/12/24/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2288/12/24/mathml/M10">View MathML</a> and standard error of the estimates (SE), with a correctly specified normal imputation model (Z2 normally distributed), and with two misspecified normal imputation models (Z2 log-normally and uniformly distributed)

Application to the Three-City study

The mean fraction of missing information about the effect of D-dimer was 4.9 and 3.7 per cent for CHD and VaD risks, respectively. Table 4 gives the estimated hazard ratios associated with D-dimer tertiles. The MI and the weighted approaches yielded similar estimates and precision. The CI of the hazard ratio associated with the linear effect of a one-tertile difference were respectively (0.94-1.38) versus (0.92-1.38) for CHD and (1.13-2.53) versus (1.13-2.67) for VaD. For phase-1 variables, both estimators provided similar results, but MI was always the more precise (data not shown).

Table 4. Estimates of hazard ratios (HR) and 95% confidence interval (CI) associated with D-dimer tertiles.

Harrell's C for the models including only phase-1 variables were above 0.69 for CHD risk and above 0.86 for VaD risk (Table 5). Hence, CHD and VaD risks were largely explained by standard risk factors, and the inclusion of plasma D-dimer levels did not significantly improve the predictive ability of the model, despite the fact that elevated D-dimer levels significantly increased the VaD risk. For CHD as for VaD, the index did not significantly differ from 0.

Table 5. Predictive ability and 95% confidence interval (CI) of D-Dimer tertiles on cardiovascular heart disease (CHD) and vascular dementia (VaD) risks.

Discussion

Use of a consistent estimator does not guarantee the absence of any bias for finite sample. We only showed that MI analysis of case-cohort data provides unbiased estimates of the log-hazard ratio when the imputation model and the proportional hazard model are correctly specified. The misspecification of the imputation model can originate from an erroneous choice of the distribution, or from wrongly assuming that the estimator of the imputation model is consistent and normal, or from the omission of some important explanatory variable. Imputations carried out using a misspecified distribution in the imputation model can provide biased estimates of hazard ratios, especially, if the specified distribution of the phase-2 variable differs from the true one in terms of symmetry (log-normal versus normal distribution). The negative bias on a log hazard ratio of 0.69 was noticeable but not large when a log-normal variable was imputed according to a normal distribution (-0.09 or -13%), but it is clearly a type of misspecification easily identified with diagnostic tools [20]. One can then transform the incomplete variable in order to obtain a symmetrical distribution, impute transformed values and apply the inverse transformation to the imputed values. Note that although a normal and a uniform distribution are quite different, both are symmetrical and the observed bias was quite smaller (only 5%). In the 3C study of the relationship between VaD and D-dimer, we observed slightly different estimates of the log hazard ratio when comparing the third to the first tertile (2.77 versus 2.93, i.e. a relative difference of 8% between the MI and the weighted estimates). This is probably because of the qualitative imputation of D-dimer, and thus, the use of a multinomial imputation model, which implied estimation of parameters in separate strata defined by D-dimer concentration tertiles, some of which had a small number of events. Due to these small numbers (only 51 VaD in total), asymptotic conditions might not have been fulfilled in at least some strata, and the estimated coefficients of the imputation model could have been biased and notnormally distributed. We give below some recommendations regarding the choice of explanatory variables in the imputation model. Since the potential bias of MI estimates can be detected by comparing them to weighted estimates, we suggest building the proportional hazard model by using only the case-cohort data and weighted estimators. MI can eventually be used to reanalyze the data with the selected model to improve the precision of the results, while verifying that no bias was introduced.

In simulated data, for the phase-1 variables, the precision of MI and full cohort estimates was similar and smaller than with the weighted estimator. For the phase-2 variable, MI estimates were slightly more precise than weighted estimates. Globally, the mean squared errors were smaller with MI than with the weighted estimator, with one exception implying a normal imputation model for a log-normally distributed phase-2 variable, an error which should easily be avoided.

There is no standard method for estimating the predictive ability of a model in the framework of case-cohort surveys. We showed that the naive application of the C index to case-cohort surveys yielded an underestimation of the predictive ability of the model that depended on the subcohort size when the phase-2 variable had an effect on the risk. Similarly, the naive estimates of the predictive ability of an added phase-2 variable differed notably from the full cohort values when the effect of the phase-2 variable was not null. Harrell's C index could theoretically be estimated with a weighted approach, but this can be computationally difficult because it requires weighting each pair by the pairwise sampling probabilities, i.e., using a square matrix of size N'(N'-1), where N' is the size of the case-cohort sample. Computing the variance of this Horvitz-Thompson estimator requires either weighting each quadruplet by the quadruple-wise sampling probabilities, i.e., working with a matrix of size N'(N'-1)(N'-2)(N'-3), or bootstrapping the case-cohort data. By contrast, MI easily allows estimation of the predictive ability of a model or of an additional phase-2 variable and their variances in the context of case-cohort data, only requiring bootstrapping to estimate the variance of the predictive ability of the phase-2 variable. MI provided estimates of Harrell C, NRI and IDI indexes similar to those obtained with the full cohort analysis. Note, however, that the predictive abilities were always overestimated because the same data were used to estimate the model and its predictive ability.

Analysis of the Three-City case-cohort study was in agreement with our previous work [10]. The weighted and the MI approaches yielded similar estimates of the hazard ratios and MI was slightly more precise, particularly for phase-1 variables. The relative differences between both estimates was always below 2% for the hazard ratios related to CHD and D-dimer, but as early discussed, they could be slightly higher (8%) for a hazard ratio related to VaD and D-dimer. The precision was similar for both analyses.

The imputation model must reflect the association between the incomplete variable, the outcome and the other explanatory variables. Therefore, variables included in the proportional hazard model as well as the stratification variables must be included in the imputation model. If a surrogate of the phase-2 variable is available, it should also be included in the imputation model. On the other hand, multiple imputation approach can provide unbiased and more efficient estimates than weighted analysis even when no strong predictor of the phase-2 variable is available [10]. The inclusion of additional variables other than strongly predictive variables can lead to an increased inter-imputation variance. This prompted the use of different imputation models for D-dimer levels in the CHD and VaD analyses. However, we verified that adding the variables only used in the CHD analysis to the model used for VaD, did not modify the results observed in the former (data not shown).

The number of requested imputations depends on the proportion of missing information which, in case-cohort studies, is considerably smaller than the percentage of incompletely observed subjects. Rubin showed that with as much as 40 per cent information missing, M = 5 imputations provides an asymptotic relative efficiency was 0.97, and, with 50 per cent missing information, M = 10 provides an asymptotic relative efficiency of 0.98. Thus, a small number of imputations, 5-10, should suffice [21]. In our analyses, we used 5 imputations to limit the computer time of the simulations, a reasonable choice since the proportion of missing information was always smaller than 30 per cent. However, a slightly larger number of imputations (e.g. 10) could have been performed on the 3C study data at a reasonable time cost; it would have provided a more precise estimate of the between imputation variance and of the percentage of missing information.

The VaD risk increased with D-dimer tertiles. However, D-dimer inclusion did not significantly improve the predictive ability of the model for VaD risk. Computations of the C and IDI index yielded the same conclusion. To our knowledge, no other results concerning the predictive ability of D-dimer on the risk of VaD have been published to date. The risk of CHD did not vary with D-dimer, so, not surprisingly, the predictive ability of this variable was negligible, regardless of the index used. Wang et al. [22] and Tzoulaki [23] reported that the use of 10 and 4 biomarkers respectively added only moderately to the overall risk prediction based on conventional cardiovascular risk factors.

Conclusions

MI is a simple alternative approach to weighted analysis for analyzing case-cohort surveys, obtaining correct estimates of the log hazard ratios and their standard errors, improving precision for the phase-1 variable estimates, and providing at least the same precision as weighted estimators for phase-2 variable estimates. It allows an easy evaluation of the predictive ability of the model and, more generally, any tool proposed in the framework of cohort studies can be applied to case-cohort data using MI.

Abbreviations

MI: Multiple imputation; CHD: Coronary heart disease; VaD: Vascular dementia; NRI: Net reclassification index; IDI: Integrated discrimination index.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HM conducted the literature review, simulations, data analyses and wrote the manuscript. LC conducted the analysis of the relationship between D-dimer levels and CHD and VaD risks and supervised the epidemiological aspects of the application to the Three-City study. MC conducted and supervised the writing of the manuscript. All authors have read the manuscript, are in agreement that the work is ready for submission to the journal, and accept responsibility for the manuscript's contents.

Acknowledgements

This study was supported by a grant from the Région Île-de-France. It used data from the Three-City study which is conducted under an agreement between the Institut National de la Santé et de la Recherche Médicale and the Université Victor Segalen-Bordeaux 2. This manuscript was not prepared in collaboration with the 3C study Steering Committee and does not necessarily reflect its opinions or views.

References

  1. Prentice R: A case-cohort design for epidemiologic cohort studies and disease prevention trials.

    Biometrika 1986, 73:1-11. Publisher Full Text OpenURL

  2. Chen K, Lo SH: Case-cohort and case-control analysis with cox's model.

    Biometrika 1999, 86(4):755-764. Publisher Full Text OpenURL

  3. Therneau TM, Li H: Computing the cox model for case cohort designs.

    Lifetime Data Anal 1999, 5(2):99-112. PubMed Abstract | Publisher Full Text OpenURL

  4. Borgan O, Langholz B, Samuelsen SO, Goldstein L, Pogoda J: Exposure stratified case-cohort designs.

    Lifetime Data Anal 2000, 6:39-58. PubMed Abstract | Publisher Full Text OpenURL

  5. Kulich M, Lin D: Improving the efficiency of relative-risk estimation in case-Cohort studies.

    J Am Stat Assoc 2004, 99:832-844. Publisher Full Text OpenURL

  6. Langholz B, Jiao J: Computational methods for case-cohort studies.

    Comput Stat Data Anal 2007, 51(8):3737-3748. Publisher Full Text OpenURL

  7. Scheike TH, Martinussen T: Maximum likelihood estimation for Cox's regression model under case-Cohort sampling.

    Scand Stat Theory Appl 2004, 31(2):283-293. OpenURL

  8. Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm.

    J R Stat Soc Series B Stat Methodol 1977, 39:1-38. OpenURL

  9. Breslow N, Lumley BCCLT, Kulich M: Using the whole cohort in the analysis of case-cohort data. [http://dx.doi.org/10.1093/aje/kwp055] webcite

    Am J Epidemiol 2009, 169(11):1398-1405. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Marti H, Chavance M: Multiple imputation analysis of case-cohort studies.

    Stat Med 2011, 30(13):1595-1607. PubMed Abstract | Publisher Full Text OpenURL

  11. Alperovitch A, 3C Study Grp: Vascular factors and risk of dementia: Design of the three-city study and baseline characteristics of the study population.

    Neuroepidemiology 2003, 22(6):316-325. PubMed Abstract OpenURL

  12. Little R, Rubin D: Statistical analysis with missing data. New York: Wiley; 1987. OpenURL

  13. Rubin DB, Schenker N: Multiple imputation in health-care databases: an overview and some applications.

    Stat Med 1991, 10(4):585-598. PubMed Abstract | Publisher Full Text OpenURL

  14. Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the yield of medical tests.

    J Am Med Assoc 1982, 247(18):2543-2546. Publisher Full Text OpenURL

  15. Harrell F, Lee K, Mark D: Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.

    Stat Med 1996, 15(4):361-387. PubMed Abstract | Publisher Full Text OpenURL

  16. Kremers WK: Concordance for survival time data: fixed and time-dependent covariates and possible ties in predictor and time.

    Tech rep Mayo Fundation 2007. OpenURL

  17. Pencina M, D'Agostino R Sr, D'Agostino R Jr, Vasan R: Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond.

    Stat Med 2008, 27(2):157-172. PubMed Abstract | Publisher Full Text OpenURL

  18. Carcaillon L, Gaussem P, Ducimetiere P, Giroud M, Ritchie K, Dartigues JF, Scarabin PY: Elevated plasma fibrin D-dimer as a risk factor for vascular dementia: the three-city cohort study.

    J Thromb Haemost 2009, 7(12):1972-1978. PubMed Abstract | Publisher Full Text OpenURL

  19. Executive Summary of The Third Report of The National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, And Treatment of High Blood Cholesterol In Adults (Adult Treatment Panel III)

    J Am Med Assoc 2001, 285(19):2486-2497. Publisher Full Text OpenURL

  20. D'Agostino RB, Stephens MA (Eds): Goodness-of-fit techniques. New York: Marcel Dekker; Inc; 1986. OpenURL

  21. Rubin DB: Multiple imputation for nonresponse in surveys. New York: Wiley; 1987. OpenURL

  22. Wang TJ, Gona P, Larson MG, Tofler GH, Levy D, Newton-Cheh C, Jacques PF, Rifai N, Selhub J, Robins SJ, Benjamin EJ, D'Agostino RB, Vasan RS: Multiple biomarkers for the prediction of first major cardiovascular events and death.

    N Engl J Med 2006, 355(25):2631-2639. PubMed Abstract | Publisher Full Text OpenURL

  23. Tzoulaki I, Murray GD, Lee AJ, Rumley A, Lowe GDO, Fowkes FGR: Relative value of inflammatory, hemostatic, and rheological factors for incident myocardial infarction and stroke - The Edinburgh artery study.

    Circulation 2007, 115(16):2119-2127. PubMed Abstract | Publisher Full Text OpenURL

Pre-publication history

The pre-publication history for this paper can be accessed here:

http://www.biomedcentral.com/1471-2288/12/24/prepub