Department of Statistics, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
National School of Public Health, Oswaldo Cruz Foundation, Rio de Janeiro, Brazil
Abstract
Background
Longitudinal studies often employ complex sample designs to optimize sample size, overrepresenting population groups of interest. The effect of sample design on parameter estimates is quite often ignored, particularly when fitting survival models. Another major problem in longterm cohort studies is the potential bias due to loss to followup.
Methods
In this paper we simulated a dataset with approximately 50,000 individuals as the target population and 15,000 participants to be followed up for 40 years, both based on real cohort studies of cardiovascular diseases. Two sample strategies  simple random (our golden standard) and Stratified by professional group, with nonproportional allocation  and two loss to followup scenarios  noninformative censoring and losses related to the professional group  were analyzed.
Results
Two modeling approaches were evaluated: weighted and nonweighted fit. Our results indicate that under the correctly specified model, ignoring the sample weights does not affect the results. However, the model ignoring the interaction of sample strata with the variable of interest and the crude estimates were highly biased.
Conclusions
In epidemiological studies misspecification should always be considered, as different sources of variability, related to the individuals and not captured by the covariates, are always present. Therefore, allowance must be made for the possibility of unknown confounders and interactions with the main variable of interest in our data. It is strongly recommended always to correct by sample weights.
Background
It is widely acknowledged, both theoretically and in practice, that incorporating design features into estimation of descriptive parameters, such as prevalence, can help avoid bias and reduce standard errors
This paper was motivated by discussion of the sample strategy used in a recent large multicenter cohort study, with approximately 50,000 people as the target population and 15,000 participants to be followedup for at least 20 years
Stratified random sampling involves dividing the population members into nonoverlapping groups called strata, defined by selected characteristics and each sampled separately. Varying sample fractions by stratum improves the efficiency of sample design and estimators for relatively small but important population subgroups. As the proportion of the samples in each stratum varies, the weight of each individual will be proportional to the inverse of the sample fraction in the respective group, as described in Kish (1965)
Varying sample weights across the strata may induce a difference between the probability distributions for the outcome in the sample and in the population, because of the covariates included in the model. In such cases the design carries information about the outcome, and is therefore considered informative or nonignorable.
In a survival model, where timetoevent
Another major problem in longterm cohort studies is potential bias due to loss to followup. This problem is widely recognized and several approaches deal with it
However, this is an unwarranted assumption in longterm cohort studies, and differential losses related to the sampling strata may increase the bias. Lawless (2003)
The next section presents the case study, describing the simulated population and two different scenarios of loss to followup. Next the sample plan strategies and model fitting are presented. The results section uses a graphical representation to make the discussion of the impact of ignoring sample design more accessible to nonmathematical readers.
Methods: Simulation exercise
The population
A population of 52750 individuals belonging to three sampling strata was generated. As the focus of our motivating exemple was a study in a working population, we defined the strata by occupational category, which relates to socioeconomic status. The groups, in descending order of occupational category, were:
Smoking affected survival in interaction with the occupational position: hazard ratios of 1.5 among
The Weibull density equation and curves for timetoevent using the parameters above are presented in Figure
Weibull distribution
Weibull distribution. Effect of changing the scale parameter on the timetoevent curve based on the simulated scenario.
The sample plans
The sample size estimated in our motivating exemple
We generated 2,000 samples, with 15,000 individuals each, for both random and stratified sample plans. To evaluate the impact of loss to followup we used the same samples as already simulated, censoring individuals that had experienced the event. Two different scenarios were defined: a 15% random loss and a differential loss by sample strata (
Model fitting
Each sample was fitted using Cox proportional hazards model. The first  Full (Eq:1)  model used the same information that generated the population, except for the parametric Weibull curve. The second  Marginal (Eq:2)  model included the strata as independent terms, but not interacting with our variable of interest
The population parameters for each model are given in Table
Estimated Population Hazard Ratios for each fitted model
Variables
MODELS
Full
Marginal
Misspecified
HR
SE
HR
SE
HR
SE
Administrative
2.93
0.05316
3.71
0.04345


Technicians
2.55
0.05727
1.66
0.04872


Smoking


2.21
0.03758
2.47
0.0374
Prof*Smoking
1.41
0.08341




Admin*Smoking
2.93
0.05354




Tech*Smoking
1.93
0.07386




Results and Discussion
Comparison of sampling schemes under different models
Considering the Full model, both sample designs and fitting strategies give nonbiased estimates. For the designrelated variables, variance in parameter estimates is slightly smaller with simple random sampling than with weighted sampling. On the other hand, the variance in the samples for interaction of
Simulated Hazard Ratios under the Full Model
Simulated Hazard Ratios under the Full Model. Correctly specified model returns exactly the same results independently of considering sample weights.
The Marginal model, with just a common effect of
Simulated Hazard Ratios under Marginal Model
Simulated Hazard Ratios under Marginal Model. Large difference is observed for the hazards associated with smoking when fitting without sample weights, if the model does not include the interaction with professional category.
The Smokeonly model returned very similar results (Figure
Simulated Hazard Ratios under Smokeonly Model
Simulated Hazard Ratios under Smokeonly Model. The pattern is similar to the Marginal model, with similar bias.
Comparison of modeling strategies in terms of loss to followup
Random loss is a noninformative censoring mechanism. Therefore it affects only precision, with results similar to those presented in the previous section (Figures
Simulated Hazard Ratios with loss to followup under Full Model
Simulated Hazard Ratios with loss to followup under Full Model. The upper frames show the random loss to followup and the lower ones the nonrandom censoring.
Simulated Hazard Ratios with loss to followup under Marginal Model
Simulated Hazard Ratios with loss to followup under Marginal Model. The upper frames, with the random loss to followup, show the bias for the smokinghazard ratio for the nonweighted model. The lower frames with nonrandom censoring show the bias for all models.
Simulated Hazard Ratios with loss to followup under Smokeonly Model
Simulated Hazard Ratios with loss to followup under Smokeonly Model. The upper frame shows the bias for the nonweighted smokeonly model and the lower one the bias for all approaches due to nonrandom loss.
The Marginal model (Figure
Overall comparison
The average variance of the estimates for each covariate (Figure
Average Variance of Estimates according to two Scenarios: without loss and with nonrandom loss to followup
Average Variance of Estimates according to two Scenarios: without loss and with nonrandom loss to followup. The upper frame, without loss, shows smaller variance than the lower one and a similar pattern.
Mean square error (MSE) is the sum of the variance and the squared bias of the estimates. This statistic is a good summary of the quality of a point estimate, as it combines the random and systematic error
Mean Square Error according to two Scenarios: without loss and with nonrandom loss to followup
Mean Square Error according to two Scenarios: without loss and with nonrandom loss to followup. Both simulations, without loss (upper frame) and with loss (lower one), display a similar pattern, with the nonweighted model performing much worse.
The simulation exercise was restricted to Cox regression, with only a few scenarios. We tested many different scenarios with other covariates, omitted risk factors, and so on, but decided to present only these simpler models, so as to highlight the impact of ignoring the sample weights. Evidently, the large disparity in sample weights favored clear demonstration of the bias. However, these sample weights reflect our experience. Other modeling approaches, such as repeated measures analysis, were not implemented, and different results could be obtained.
If nonadministrative censoring is considerable, then a valuable tool is to take a subdistribution hazard approach, reweighting individuals in the risk set. The sample weighting itself could be recalculated at each dropout
Conclusions
Quite often researchers do not include either sample weights or strata indicators in statistical models. Yeboah et al (2010)
Our results confirmed that, in a correctlyspecified model, ignoring the weights does not change the estimated parameters, and precision may improve (a result theoretically proven for inference based on ordinary least squares)
The primary objective of analyzing survey data is to make inferences about the population of interest
The stratification by professional categories, which assigns much larger weight to the lower social stratum, was guided by the need to increase the power to detect socialrelated risk factors. Nevertheless, almost any covariate displays different prevalence in different socioeconomic groups. Also almost all covariates interact, positively or negatively, changing the risk. Smoking itself presents similar physiological risk across socioeconomic strata. However, belonging to the most deprived stratum implies differences in other risk factors such as larger body mass index, worse diet, inadequate exercise, all associated with cardiovascular diseases, and these are the known and easilymeasured risk factors. Unknown or unreliable measures, such as stress or mental health, will always exist. Therefore allowance has to be made for the possibility of unknown confounders and interactions in our data associated with the sample strata. Rubin
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Both authors designed, analyzed and wrote the paper.
Acknowledgements
CCC and MSC received support from the Brazilian Research Council (CNPq); MSC was also funded by the Rio de Janeiro State Research Foundation (FAPERJ).
Prepublication history
The prepublication history for this paper can be accessed here: