Abstract
Background
The weighted estimators generally used for analyzing casecohort studies are not fully efficient and naive estimates of the predictive ability of a model from casecohort data depend on the subcohort size. However, casecohort studies represent a special type of incomplete data, and methods for analyzing incomplete data should be appropriate, in particular multiple imputation (MI).
Methods
We performed simulations to validate the MI approach for estimating hazard ratios and the predictive ability of a model or of an additional variable in casecohort surveys. As an illustration, we analyzed a casecohort survey from the ThreeCity study to estimate the predictive ability of Ddimer plasma concentration on coronary heart disease (CHD) and on vascular dementia (VaD) risks.
Results
When the imputation model of the phase2 variable was correctly specified, MI estimates of hazard ratios and predictive abilities were similar to those obtained with full data. When the imputation model was misspecified, MI could provide biased estimates of hazard ratios and predictive abilities. In the ThreeCity casecohort study, elevated Ddimer levels increased the risk of VaD (hazard ratio for two consecutive tertiles = 1.69, 95%CI: 1.631.74). However, Ddimer levels did not improve the predictive ability of the model.
Conclusions
MI is a simple approach for analyzing casecohort data and provides an easy evaluation of the predictive ability of a model or of an additional variable.
Background
Casecohort surveys produce incomplete data by design. A subcohort is selected by simple or stratified random sampling, all subjects are followed up and the events of interest are recorded. The phase1 variables are observed for the entire cohort, whilethe phase2 variables are only known for the casecohort sample, i.e., subjects belonging to the subcohort and all those presenting the event of interest [1]. Thus, in casecohort studies, the noncases who do not belong to the subcohort are incompletely observed by design, enabling cost reduction with a small loss of efficiency.
Various approaches have been described to estimate the proportional hazard model in casecohort surveys: Weighted estimators [26] are classically used in these surveys, with analysis restricted to the completely observed subsample, so the information collected for incompletely observed noncases is ignored and inefficient estimators for the effect of phase1 variables are obtained. One of the most popular is the Borgan II estimator [4]. Scheike and Martinussen [7] proposed a maximum likelihood estimator based on proportional hazards assumption, using the EM algorithm [8], therebyincreasing efficiency as compared to weighted estimators when the relative risk and disease incidence are high. However, in general, the studied disease incidence in casecohort surveys is low. Breslow et al. [9] suggested calibrating or estimating the weights a posteriori, using all the phase1 information, to improve precision with respect to classical weighted estimators. Marti and Chavance [10] showed that multiple imputation (MI) is a good alternative to classical weighted methods for the analysis of casecohort data. When the imputation model was correct, the MI approach provided unbiased estimators of the log hazard ratios and correctly estimated the variance of its estimators. As expected, the MI approach was more precise than the usual weighted estimators for the parameters associated with phase1 variables. The former was also slightly more precise than the latter for the phase2 variable. In Marti and Chavance [10] the imputations were performed according to a correctly specified imputation model. However, in practise, the distribution of the phase2 variable is unknown and onemay wonder how MI compares to weighted estimators when the imputation model is misspecified.
No standard method exists for quantifying the usefulness or predictive ability of a model or an additional variable in the framework of casecohort surveys. The predictive ability can be measured in terms of calibration, which refers to the ability of a model to match predicted and observed values, when we are interested in individual predictions; or in terms of discrimination, which refers to the ability of a model to distinguish between subjects with or without a binary event, when we are interested in identifying a group of highrisk subjects. In the present work, we focus on discrimination.
As shown below, a naive measurement of predictive ability from casecohort data often leads to a biased estimate of the predictive ability because it varies with the censoring rate and thus depends on the subcohort size. Alternatively, because MI reconstitutes whole cohorts, any tool developed to estimate the predictive ability in the framework of cohort surveys can be applied to casecohort data, so we propose using the MI approach to estimate the predictive ability ofa model or of an additional variable and their standard errors.
The objectives of this study were 1) to evaluate MI for estimating hazard ratios when the distribution of the phase2 variable is misspecified; and 2) to present an adequate methodology for estimating the predictive ability of a model or of an additional variable in casecohort surveys. We performed a simulation study to validate the MI approach for estimating the predictive ability of a model or of an additional variable and to assess its potential limits. As an illustration, we analyzed casecohort data from the ThreeCity study [11] to estimate the predictive ability of the Ddimer plasma concentration, a marker of coagulation and fibrinolysis, on coronary heart disease (CHD) and on vascular dementia (VaD) risks.
Methods
Incomplete observations and multiple imputation
Casecohort surveys are a particular type of incomplete observations, in which data are missing at random [12] by design, as the probability of being completely observed depends only on the case status, with simple random sampling, and on some phase1 variables with stratified sampling. MI is a simple and efficient method for analyzing incomplete observations, while taking into account all the levels of uncertainty regarding missing values. This provides an approximation of the maximumlikelihood estimator and thus enables the potential selection bias to be corrected. This method relies on the generation of several plausibly completed data sets (M ≥ 2), accounting for all levels of uncertainty concerning the missing values. A prediction model must be built, taking into consideration the relationships between the incomplete variable and the other variables, as observed in the complete part of the data. The missing data are not replaced by their expectation but by a value drawnfrom the distribution posited by the model. To take into account the uncertainty concerning the parameters of the imputation model, several imputations are performed with parameters drawn from the asymptotic distribution of their estimator. An estimate of the parameter of interest, and an estimate of the variance of the estimator, are obtained from each completed data set. If the imputation model is correct, these estimators are not biased. The MI estimate, also unbiased, is the mean of the M estimates:
where M is the number of completed data sets and is the estimate of the parameter of interest provided by the m^{th }completed data set. The multiplicity of imputations enables correct estimation of the variance of this single estimator, which is the sum of 2 components: the withinimputations component, W_{MI}, and the betweenimputations component, B_{MI}:
where the factor 1 + M^{1 }is an adjustment for using a finite number of imputations [13].
MI requires a model correctly reflecting the relationship between the incomplete variable and the outcome of interest. In casecohort surveys, we need to impute phase2 variable values for the noncases who do not belong to the subcohort. Under the rare disease assumption, we have shown that a simple generalized linear model, using all the complete data (cases and noncases) and including the case indicator among the explanatory variables, has to be considered [10]. Practically, in addition to the case indicator and the stratification variables, when the subcohort was selected by stratified sampling, it is necessary to include in the imputation model all the variables appearing in the proportional hazard model. Because imputations are based on asymptotic distributions, caution is necessary, since if too few subjects present the event of interest, the distribution of the estimators can differ from the asymptotic one. As a consequence, the maximum likelihood estimator of the imputation model could be biased or not normally distributed.
Predictive ability of a model and of a supplementary variable
Harrell et al. [14] proposed the C index to measure the predictive ability of a model in cohort studies as the agreement between the order of the predicted and observed survival times in any pair of subjects (the event of interest is assumed to be death, leading to the use of survival terminology). That is, the concordance probability using all pairs of subjects in the population. However, with censored data, it is not possible to consider all the pairs of subjects because survival time is not observed for censored subjects. Let T_{i }be the survival time for subject i, i = 1,...,N, where N is the cohort size, and C_{i }the censoring time for subject i. We observe X_{i }= min(T_{i}, C_{i}). Usable pairs are those for which the order of the predicted survival times can be compared to the order of the true survival times, i.e., pairs formed by 2 uncensored subjects or an uncensored subject and a subject censored after the uncensored subject's death. A pair of censored subjects carries no information about its agreement with the expected survival provided by the model since the order of the survival times is not known. Similarly a pair formed by a subject whose survival time is observed and a subject censored before this survival time provides no information on this agreement since the unknown survival time could be anterior or posterior to the observed one. Harrell et al. [15] showed that, in the common models used for survival analysis, such as the proportional hazard model, the predicted survival times and the predicted survival probabilities at a fixed time t can be interchanged for the comparison. The Harrell's C index is defined as:
where π_{c }is the probability of concordance for a pair (i,j) and π_{d }is the probability of discordance. We assume continuous survival times and continuous predicted survival probabilities, so P(X_{i }= X_{j}) = P(Y_{i }= Y_{j}) = 0, thus π_{c }+ π_{d }= 1. C is estimated by the proportion of concordant pairs among the usable pairs. The estimated variance was given by Kremers [16].
In practice, we are often interested in estimating the predictive ability of an additional phase2 variable. Let M_{1 }be a proportional hazard model including only phase1 variables, and C_{1 }and SE_{C1 }respectively the C indexof M_{1 }and its standard error. Let M_{2 }be a proportional hazard model adding the phase2 variable to M_{1}, and C_{2 }and SE_{C2}, respectively, the C index of M_{2 }and its standard error. Harrell's predictive ability ofthe added phase2 variable is Δ = C_{2 } C_{1 }. Complementary measures of predictive ability of a new variable, such as the net reclassification improvement (NRI) and the integrated discrimination index (IDI), were proposed by Pencina [17]. NRI needs some a priori meaningful risk categories. It quantifies the correct reclassification introduced by using a model with the added variable as compared to the classification obtained without this variable. The IDI can be viewed as a continuous version of the NRI with probabilities used instead of categories. It can be defined as the discriminationslope difference between the models with and without a quantitative variable. To estimate the predictive ability of a model or of an additional variable, we reconstructed plausible whole cohorts using MI. For each reconstructed whole cohort, we could then directly obtain C_{1}, SE_{C1}, C_{2}, SE_{C2}, Δ, NRI, IDI and their respective variances. Using equations (1) and (2), we obtained the MI estimates of these quantities. Concerning the variance of Δ, the betweenimputation component is estimated by the empirical variance of the M estimates of Δ provided by the M completed data sets. However, for the withinimputation component, the asymptotic variance of the estimator provided by a complete data set, does not have an analytical form. With a fully observed cohort, bootstrapping is a way to estimate the variance of the corresponding Δ. Therefore, each whole cohort reconstructed by MI has to be resampled. In the simulations as in the real data analysis, we used 100 bootstrap samples.
Simulation study
Two phase1 variables were simulated: a binary variable, Z_{1}, and a Gaussian variable, Z_{3}, observed for the entire cohort. For the phase2 variable, Z_{2}, we considered three different distributions: normal, lognormal and uniform, all of them with unit variance, independent of Z_{1}, but having a correlation coefficient of 0.2 with Z_{3}. The survival time had an exponential distribution, with λ = exp (β_{1}Z_{1 }+ β_{2}Z_{2 }+ β_{3}Z_{3 }). β_{1}, β_{2 }and β_{3 }were fixed at the same value and set at 0 or log(2). The censoring time followed a uniform distribution over the interval [0,τ], where τ was chosen so that the probability of an event was approximately 0.03 (τ= 0.025). The cohort size was 10,000. We also simulated a phase1 variable predictive of Z_{2}, with ε ~ N (0, σ^{2}) independent of Z_{2}. The variance σ^{2 }was fixed at 1 which corresponds to a correlation between of approximately 0.7. We wanted to estimate the effect of Z_{2 }on survival time and its predictive ability. The cohort was divided into 9 strata based on the tertiles of , and the noncases were chosen by stratified sampling. Casecohort sampling was simulated with 1,000 subjects in each subcohort. The phase2 variable was not available for noncases not included in the subcohort, so MI was used to complete the data set. Thus, we built the same linear prediction model for each Z_{2 }based on the stratification phase1 variable and the case indicator. Z_{3 }was not directly included in the imputation model to predict Z_{2}, because it was a stratification variable included in the model and because of the weak correlation between Z_{2 }and Z_{3}. The imputation model was: Z_{2 }= α_{0 }+ α_{1 }I_{case }+ α_{2 }Strata + ε , where α_{0 }and α_{1 }are scalar, α_{2 }is a vector coefficient, I_{case }is the case indicator, Strata is the vector of stratum indicators and є is the vector of errors independently and identically distributed ~ N (0, σ). Thus, the imputation model was correctly specified for Z_{2 }normally distributed but misspecified for Z_{2 }lognormally or uniformly distributed. One thousand cohorts were simulated for each scenario.
Five imputations were performed and 5 complete data sets were generated for each cohort. We estimated the log hazard ratios using MI and the "Borgan II" weighted estimator [4]. We used MI to estimate the predictive ability of models with and without the phase2 variable, and the predictive ability of the phase2 variable, NRI and IDI. We also studied the consistency of the naive estimator of Harrell's C index in casecohort surveys by varying the subcohort size. Using the above simulation conditions and, exceptionally, a scenario with β_{1 }= β_{3 }= log(2) and β_{2 }= log(1.5), we simulated casecohort samples with the subcohort size set at 300 or 1,000 subjects. We estimated the predictive ability in the casecohort samples and in the multiply imputed data sets.
Casecohort survey from ThreeCity study
Briefly, the 3CStudy was designed to examine the relationship between vascular diseases and dementia in a community housing 9,294 persons aged 65 years and over between 1999 and 2001 in three French cities. The detailed methodology has been previously described [11]. A casecohort substudy was conducted [18], to investigate the relationship between biomarkers, such as plasma levels of Ddimer (a marker of coagulation and fibrinolysis) and the 4year incidence of coronary heart disease (CHD), stroke and all subtypes of dementia, including vascular dementia (VaD), in an elderly population. The phase1 variables provided information on sociodemographic characteristics, education, medical history, diet, alcohol and tobacco use. Blood pressure, height and weight were also available. A subcohort of size n = 1,254, (13.5% of the full cohort) was randomly selected, stratifying on age, sex and recruitment center. Observed cumulated incidences of CHD and VaD were approximately 2% and 0.6%, respectively. Plasma Ddimer levels were only available for phase2 subjects. Carcaillon et al. [18] treated quintiles of Ddimer level both qualitatively and linearly. They reported a linear increase in the risk of VaD according to Ddimer quintiles.
We reassessed the relationship between plasma Ddimer levels and the risk of CHD and VaD, using MI and weighted estimators, and evaluated the predictive ability of Ddimer levels on both risks. We included the same explanatory variables as Carcaillon et al. [18] although we used tertiles of Ddimer rather than quintiles, to estimate CHD and VaD risks, due to the small number of events. Therefore, to estimate the risk of CHD, the proportional hazard model includedthe phase1 variables: age, sex, center, body mass index, hypertension, hypercholesterolemia, diabetes, tobacco use, diabetes drugs, and as phase2 variables, indicators of Ddimer tertiles. To estimate the risk of VaD, the proportional hazard model included the phase1 variables: age, sex, centre, educational level, body mass index, the presence or absence of an apolipoprotein є4 allele and indicators of Ddimer tertiles.
For each outcome (CHD or VaD), it was necessary to reproduce the relationships among the incomplete variable, the outcomes and the confounder variables. For each outcome, we built an imputation model of tertiles of Ddimer levels, including the variables used in the proportional hazard model and the caseindicator. We estimated the predictive ability of proportional hazard models, without (C_{1}) and with (C_{2}) Ddimer levels, Δ = C2  C1, and IDI for CHD and VaD risks. The NRI requires that some a priori meaningful risk categories be known. Based on the Third Adult Treatment Panel [ATP III] [19] risk classification for the 10year risk of CHD, we adapted the cutoffs to 4year risk. For VaD, we do not know a priori meaningful risk categories and did not compute NRI.
Results
Simulation study
The mean fraction of missing information about the effect of Z_{2 }ranged from 5 to 14 percent when β_{2 }= 0 and from 23 to 30 per cent when β_{2 }= log(2) (data not shown). For each estimator (full cohort, casecohort with MI and casecohort with weights), we give the mean of the estimated coefficients, the mean of their standard error estimates, the observed standard error of the estimated coefficients and the mean squared errors of 1000 simulations (Table 1). Not surprisingly, the full cohort estimates and the casecohort weighted estimates of the log hazard ratios were unbiased. Similarly, with a correctly specified normal imputation model, all MI estimates were unbiased. With a misspecified normal imputation model, MI estimate of the effect β_{2 }= log(2) of Z_{2 }was biased (13%) when Z_{2 }was lognormally distributed. When Z_{2 }was uniformly distributed, MI estimate of the effect of Z_{2 }was slightly biased (5%). With a misspecified normal imputation model and β_{2}= 0, no bias was observed. TheMI variance and the weighted estimator variance agreed with the observed dispersions of the estimates. The observed dispersion was always smaller with MI than with the weighted estimator. For the phase1 variables, this dispersion was similar for the entire cohort and with MI, whatever the distribution of the phase2 variable. For the estimated effect of the phase2 variable, the observed standard deviations were smaller with MI than with the Borgan II weighted estimator but, as expected, slightly larger with MI than in the full cohort analyses. Altogether, the mean squared errors were smaller with MI than with the weighted estimator, except for the effect of the phase2 variable with β_{2 }= log(2) and Z_{2 }was lognormally distributed.
Table 1. Mean of the log hazard ratio estimates (Est), mean of the standard error estimates , standard error of the estimates (SE) and mean of the mean square error (MSE). Results of 1,000 simulations.
The results concerning the consistency of the naive estimator of Harrell's C index are reported in Table 2. In the scenario β_{1 }= β _{2 }= β_{3 }= 0, the mean C index was nearly 0.5 for both models, without and with Z_{2}, whatever the analysis performed. In the scenarios β_{1 }= β_{3 }= log(2) and β_{2 }= log(1.5) or β_{2 }= log(2), the naive computation of C with the casecohort data led to lower predictive ability than with the full cohort, especially for the smaller subcohort. Bycontrast, the Harrell's C indexes estimated by MI were similar to those computed for the full cohort and did not depend on the subcohort size. The estimated dispersion of the C index was slightly greater than the observed dispersion of the estimates. The rejection percentage of the null hypothesis Δ = 0 was always similar to the full cohort analysis and to MI. As a consequence of the standard error overestimation, the observed first type error rate was slightly lower than 5%. Nevertheless, in the considered scenarios, the observed power was very high. As expected, the loss of power when comparing casecohort with MI to full cohort analysis was small: with β_{2 }= log(1.5), the observed power was 84.6% with a subsample size of 300, and 90.6% with a subsample size of 1000 versus 91.6% with the full cohort. MI estimates of NRI and IDI indexes were close to those obtained with the full cohort analysis and did not depend on the subcohort size. As compared to the full cohort results, the rejection percentage of the null hypothesis NRI = 0 was smaller with MI analysis when β_{2 }= 0, larger when β_{2 }= log(1.5) and similar when β_{2 }= log(2). When the effect of the phase2 variable was not null, the rejection percentage of the null hypothesis IDI = 0 was similar with MI and with full cohort analysis. By contrast, whatever the effect of the phase2 variable, the estimation of NRI and IDI in the casecohort sample provided larger measures of these indexes than the full cohort analysis.
Table 2. Mean of the predictive ability estimates (Est), mean of the standard error estimates and standard error of the estimates (SE).
Table 3 gives the results of the estimated predictive abilities for the correctly specified and the two misspecified normal imputation models. Full cohort analysis and MI provided similar predictive abilities estimates when the imputation model was correctly specified or when the phase2 variable had no effect on the studied risk. In the scenario β_{1 }= β_{2 }= β_{3 }= log(2), when Z_{2 }was uniformly distributed, MI and full cohort analysis still provided similar estimates. However, when Z_{2 }was lognormally distributed, the MI estimate was slightly smaller than the full cohort estimate 15%).
Table 3. Predictive ability of the two models and of the phase2 variable.
Mean of the predictive ability estimates (Est), mean of the standard error estimates and standard error of the estimates (SE), with a correctly specified normal imputation model (Z_{2 }normally distributed), and with two misspecified normal imputation models (Z_{2 }lognormally and uniformly distributed)
Application to the ThreeCity study
The mean fraction of missing information about the effect of Ddimer was 4.9 and 3.7 per cent for CHD and VaD risks, respectively. Table 4 gives the estimated hazard ratios associated with Ddimer tertiles. The MI and the weighted approaches yielded similar estimates and precision. The CI of the hazard ratio associated with the linear effect of a onetertile difference were respectively (0.941.38) versus (0.921.38) for CHD and (1.132.53) versus (1.132.67) for VaD. For phase1 variables, both estimators provided similar results, but MI was always the more precise (data not shown).
Table 4. Estimates of hazard ratios (HR) and 95% confidence interval (CI) associated with Ddimer tertiles.
Harrell's C for the models including only phase1 variables were above 0.69 for CHD risk and above 0.86 for VaD risk (Table 5). Hence, CHD and VaD risks were largely explained by standard risk factors, and the inclusion of plasma Ddimer levels did not significantly improve the predictive ability of the model, despite the fact that elevated Ddimer levels significantly increased the VaD risk. For CHD as for VaD, the index did not significantly differ from 0.
Table 5. Predictive ability and 95% confidence interval (CI) of DDimer tertiles on cardiovascular heart disease (CHD) and vascular dementia (VaD) risks.
Discussion
Use of a consistent estimator does not guarantee the absence of any bias for finite sample. We only showed that MI analysis of casecohort data provides unbiased estimates of the loghazard ratio when the imputation model and the proportional hazard model are correctly specified. The misspecification of the imputation model can originate from an erroneous choice of the distribution, or from wrongly assuming that the estimator of the imputation model is consistent and normal, or from the omission of some important explanatory variable. Imputations carried out using a misspecified distribution in the imputation model can provide biased estimates of hazard ratios, especially, if the specified distribution of the phase2 variable differs from the true one in terms of symmetry (lognormal versus normal distribution). The negative bias on a log hazard ratio of 0.69 was noticeable but not large when a lognormal variable was imputed according to a normal distribution (0.09 or 13%), but it is clearly a type of misspecification easily identified with diagnostic tools [20]. One can then transform the incomplete variable in order to obtain a symmetrical distribution, impute transformed values and apply the inverse transformation to the imputed values. Note that although a normal and a uniform distribution are quite different, both are symmetrical and the observed bias was quite smaller (only 5%). In the 3C study of the relationship between VaD and Ddimer, we observed slightly different estimates of the log hazard ratio when comparing the third to the first tertile (2.77 versus 2.93, i.e. a relative difference of 8% between the MI and the weighted estimates). This is probably because of the qualitative imputation of Ddimer, and thus, the use of a multinomial imputation model, which implied estimation of parameters in separate strata defined by Ddimer concentration tertiles, some of which had a small number of events. Due to these small numbers (only 51 VaD in total), asymptotic conditions might not have been fulfilled in at least some strata, and the estimated coefficients of the imputation model could have been biased and notnormally distributed. We give below some recommendations regarding the choice of explanatory variables in the imputation model. Since the potential bias of MI estimates can be detected by comparing them to weighted estimates, we suggest building the proportional hazard model by using only the casecohort data and weighted estimators. MI can eventually be used to reanalyze the data with the selected model to improve the precision of the results, while verifying that no bias was introduced.
In simulated data, for the phase1 variables, the precision of MI and full cohort estimates was similar and smaller than with the weighted estimator. For the phase2 variable, MI estimates were slightly more precise than weighted estimates. Globally, the mean squared errors were smaller with MI than with the weighted estimator, with one exception implying a normal imputation model for a lognormally distributed phase2 variable, an error which should easily be avoided.
There is no standard method for estimating the predictive ability of a model in the framework of casecohort surveys. We showed that the naive application of the C index to casecohort surveys yielded an underestimation of the predictive ability of the model that depended on the subcohort size when the phase2 variable had an effect on the risk. Similarly, the naive estimates of the predictive ability of an added phase2 variable differed notably from the full cohort values when the effect of the phase2 variable was not null. Harrell's C index could theoretically be estimated with a weighted approach, but this can be computationally difficult because it requires weighting each pair by the pairwise sampling probabilities, i.e., using a square matrix of size N'(N'1), where N' is the size of the casecohort sample. Computing the variance of this HorvitzThompson estimator requires either weighting each quadruplet by the quadruplewise sampling probabilities, i.e., working with a matrix of size N'(N'1)(N'2)(N'3), or bootstrapping the casecohort data. By contrast, MI easily allows estimation of the predictive ability of a model or of an additional phase2 variable and their variances in the context of casecohort data, only requiring bootstrapping to estimate the variance of the predictive ability of the phase2 variable. MI provided estimates of Harrell C, NRI and IDI indexes similar to those obtained with the full cohort analysis. Note, however, that the predictive abilities were always overestimated because the same data were used to estimate the model and its predictive ability.
Analysis of the ThreeCity casecohort study was in agreement with our previous work [10]. The weighted and the MI approaches yielded similar estimates of the hazard ratios and MI was slightly more precise, particularly for phase1 variables. The relative differences between both estimates was always below 2% for the hazard ratios related to CHD and Ddimer, but as early discussed, they could be slightly higher (8%) for a hazard ratio related to VaD and Ddimer. The precision was similar for both analyses.
The imputation model must reflect the association between the incomplete variable, the outcome and the other explanatory variables. Therefore, variables included in the proportional hazard model as well as the stratification variables must be included in the imputation model. If a surrogate of the phase2 variable is available, it should also be included in the imputation model. On the other hand, multiple imputation approach can provide unbiased and more efficient estimates than weighted analysis even when no strong predictor of the phase2 variable is available [10]. The inclusion of additional variables other than strongly predictive variables can lead to an increased interimputation variance. This prompted the use of different imputation models for Ddimer levels in the CHD and VaD analyses. However, we verified that adding the variables only used in the CHD analysis to the model used for VaD, did not modify the results observed in the former (data not shown).
The number of requested imputations depends on the proportion of missing information which, in casecohort studies, is considerably smaller than the percentage of incompletely observed subjects. Rubin showed that with as much as 40 per cent information missing, M = 5 imputations provides an asymptotic relative efficiency was 0.97, and, with 50 per cent missing information, M = 10 provides an asymptotic relative efficiency of 0.98. Thus, a small number of imputations, 510, should suffice [21]. In our analyses, we used 5 imputations to limit the computer time of the simulations, a reasonable choice since the proportion of missing information was always smaller than 30 per cent. However, a slightly larger number of imputations (e.g. 10) could have been performed on the 3C study data at a reasonable time cost; it would have provided a more precise estimate of the between imputation variance and of the percentage of missing information.
The VaD risk increased with Ddimer tertiles. However, Ddimer inclusion did not significantly improve the predictive ability of the model for VaD risk. Computations of the C and IDI index yielded the same conclusion. To our knowledge, no other results concerning the predictive ability of Ddimer on the risk of VaD have been published to date. The risk of CHD did not vary with Ddimer, so, not surprisingly, the predictive ability of this variable was negligible, regardless of the index used. Wang et al. [22] and Tzoulaki [23] reported that the use of 10 and 4 biomarkers respectively added only moderately to the overall risk prediction based on conventional cardiovascular risk factors.
Conclusions
MI is a simple alternative approach to weighted analysis for analyzing casecohort surveys, obtaining correct estimates of the log hazard ratios and their standard errors, improving precision for the phase1 variable estimates, and providing at least the same precision as weighted estimators for phase2 variable estimates. It allows an easy evaluation of the predictive ability of the model and, more generally, any tool proposed in the framework of cohort studies can be applied to casecohort data using MI.
Abbreviations
MI: Multiple imputation; CHD: Coronary heart disease; VaD: Vascular dementia; NRI: Net reclassification index; IDI: Integrated discrimination index.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
HM conducted the literature review, simulations, data analyses and wrote the manuscript. LC conducted the analysis of the relationship between Ddimer levels and CHD and VaD risks and supervised the epidemiological aspects of the application to the ThreeCity study. MC conducted and supervised the writing of the manuscript. All authors have read the manuscript, are in agreement that the work is ready for submission to the journal, and accept responsibility for the manuscript's contents.
Acknowledgements
This study was supported by a grant from the Région ÎledeFrance. It used data from the ThreeCity study which is conducted under an agreement between the Institut National de la Santé et de la Recherche Médicale and the Université Victor SegalenBordeaux 2. This manuscript was not prepared in collaboration with the 3C study Steering Committee and does not necessarily reflect its opinions or views.
References

Prentice R: A casecohort design for epidemiologic cohort studies and disease prevention trials.
Biometrika 1986, 73:111. Publisher Full Text

Chen K, Lo SH: Casecohort and casecontrol analysis with cox's model.
Biometrika 1999, 86(4):755764. Publisher Full Text

Therneau TM, Li H: Computing the cox model for case cohort designs.
Lifetime Data Anal 1999, 5(2):99112. PubMed Abstract  Publisher Full Text

Borgan O, Langholz B, Samuelsen SO, Goldstein L, Pogoda J: Exposure stratified casecohort designs.
Lifetime Data Anal 2000, 6:3958. PubMed Abstract  Publisher Full Text

Kulich M, Lin D: Improving the efficiency of relativerisk estimation in caseCohort studies.
J Am Stat Assoc 2004, 99:832844. Publisher Full Text

Langholz B, Jiao J: Computational methods for casecohort studies.
Comput Stat Data Anal 2007, 51(8):37373748. Publisher Full Text

Scheike TH, Martinussen T: Maximum likelihood estimation for Cox's regression model under caseCohort sampling.

Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm.

Breslow N, Lumley BCCLT, Kulich M: Using the whole cohort in the analysis of casecohort data. [http://dx.doi.org/10.1093/aje/kwp055] webcite
Am J Epidemiol 2009, 169(11):13981405. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marti H, Chavance M: Multiple imputation analysis of casecohort studies.
Stat Med 2011, 30(13):15951607. PubMed Abstract  Publisher Full Text

Alperovitch A, 3C Study Grp: Vascular factors and risk of dementia: Design of the threecity study and baseline characteristics of the study population.
Neuroepidemiology 2003, 22(6):316325. PubMed Abstract

Little R, Rubin D: Statistical analysis with missing data. New York: Wiley; 1987.

Rubin DB, Schenker N: Multiple imputation in healthcare databases: an overview and some applications.
Stat Med 1991, 10(4):585598. PubMed Abstract  Publisher Full Text

Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the yield of medical tests.
J Am Med Assoc 1982, 247(18):25432546. Publisher Full Text

Harrell F, Lee K, Mark D: Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.
Stat Med 1996, 15(4):361387. PubMed Abstract  Publisher Full Text

Kremers WK: Concordance for survival time data: fixed and timedependent covariates and possible ties in predictor and time.

Pencina M, D'Agostino R Sr, D'Agostino R Jr, Vasan R: Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond.
Stat Med 2008, 27(2):157172. PubMed Abstract  Publisher Full Text

Carcaillon L, Gaussem P, Ducimetiere P, Giroud M, Ritchie K, Dartigues JF, Scarabin PY: Elevated plasma fibrin Ddimer as a risk factor for vascular dementia: the threecity cohort study.
J Thromb Haemost 2009, 7(12):19721978. PubMed Abstract  Publisher Full Text

Executive Summary of The Third Report of The National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, And Treatment of High Blood Cholesterol In Adults (Adult Treatment Panel III)
J Am Med Assoc 2001, 285(19):24862497. Publisher Full Text

D'Agostino RB, Stephens MA (Eds): Goodnessoffit techniques. New York: Marcel Dekker; Inc; 1986.

Rubin DB: Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.

Wang TJ, Gona P, Larson MG, Tofler GH, Levy D, NewtonCheh C, Jacques PF, Rifai N, Selhub J, Robins SJ, Benjamin EJ, D'Agostino RB, Vasan RS: Multiple biomarkers for the prediction of first major cardiovascular events and death.
N Engl J Med 2006, 355(25):26312639. PubMed Abstract  Publisher Full Text

Tzoulaki I, Murray GD, Lee AJ, Rumley A, Lowe GDO, Fowkes FGR: Relative value of inflammatory, hemostatic, and rheological factors for incident myocardial infarction and stroke  The Edinburgh artery study.
Circulation 2007, 115(16):21192127. PubMed Abstract  Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: