Department of Social and Preventive Medicine, University of Montreal, PO Box 6128, Downtown Station, Montreal, Quebec H3C 3J7, Canada
CHUM Research Centre, 3875 rue SaintUrbain, Montreal, Quebec H2W 1V1, Canada
University of Bordeaux, ISPED, Centre INSERM U897EpidemiologyBiostatistics, 146 rue Leo Saignat, Bordeaux, F33000, France
Abstract
Background
Casecontrol studies are generally designed to investigate the effect of exposures on the risk of a disease. Detailed information on past exposures is collected at the time of study. However, only the cumulated value of the exposure at the index date is usually used in logistic regression. A weighted Cox (WC) model has been proposed to estimate the effects of timedependent exposures. The weights depend on the age conditional probabilities to develop the disease in the source population. While the WC model provided more accurate estimates of the effect of timedependent covariates than standard logistic regression, the robust sandwich variance estimates were lower than the empirical variance, resulting in a low coverage probability of confidence intervals. The objectives of the present study were to investigate through simulations a new variance estimator and to compare the estimates from the WC model and standard logistic regression for estimating the effects of correlated temporal aspects of exposure with detailed information on exposure history.
Method
We proposed a new variance estimator using a superpopulation approach, and compared its accuracy to the robust sandwich variance estimator. The full exposure histories of source populations were generated and casecontrol studies were simulated within each source population. Different models with selected timedependent aspects of exposure such as intensity, duration, and time since cessation were considered. The performances of the WC model using the two variance estimators were compared to standard logistic regression. The results of the different models were finally compared for estimating the effects of correlated aspects of occupational exposure to asbestos on the risk of mesothelioma, using populationbased casecontrol data.
Results
The superpopulation variance estimator provided better estimates than the robust sandwich variance estimator and the WC model provided accurate estimates of the effects of correlated aspects of temporal patterns of exposure.
Conclusion
The WC model with the superpopulation variance estimator provides an alternative analytical approach for estimating the effects of timevarying exposures with detailed history exposure information in casecontrol studies, especially if many subjects have timevarying exposure intensity over lifetime, and if only one control is available for each case.
Background
Populationbased casecontrol studies are widely used in epidemiology to investigate the association between environmental or occupational exposures over lifetime and the risk of cancer or other chronic diseases. Many of the exposures of interest are protracted and a huge amount of information is often retrospectively collected for each subject about his/her potential past exposure over lifetime. For example, for occupational exposures, the whole occupational history is usually investigated for each subject, and different methods exist to estimate the average dose of exposure at each past job
A timedependent weighted Cox (WC) model has recently been proposed to incorporate this dynamic information on exposure, in order to more accurately estimate the effect of timedependent exposures in populationbased casecontrol studies
There is an extensive statistical literature on the weighted analyses of cohort sampling designs (see among many others
The asymptotic properties of the Lin variance estimator have been investigated and a small simulation study has been conducted to investigate these properties in finite samples
The first objective of the present study is to investigate through extensive simulations the accuracy of the Lin variance estimator for estimating the effects of timevarying covariates in casecontrol data, using the weights proposed in the WC model
The regression model and the variance estimators
The WC model
The Cox proportional hazards model specifies the hazard function as
where
where
with
In the WC model proposed for casecontrol data
where
The weights defined in Equation (2) can be implemented in any statistical software that handles timedependent weights in the Cox model, such as the coxph function in R or the SAS PROC PHREG function.
The variance estimators
The robust sandwich variance estimator for
where
The robust variance estimator in Equation (3) can be rewritten as
M1 < −coxph(Surv(start,stop,event) ~ x + cluster(id), weights = weight)
V1 < − M1$var
with the vector of weights derived from Equation (2) for the WC model.
The robust variance estimator
With R, the superpopulation variance estimate from Equation (5) can simply be obtained using the command
V2 < V1 + M1$naive.var
All along this paper, the WC model using the robust variance estimator
Simulations
Overview of the simulation design
The main objective of the simulation study was to evaluate the performance of Lin’s superpopulation variance estimator
We generated 1000 source populations of 1000 or 5000 individuals each, and within each source population, we simulated a casecontrol study. The age at event for each subject in each source population was generated from a standard Cox model with timedependent covariates, using a permutation algorithm described elsewhere and assuming Weibull marginal distribution of age at event
The distribution of the exposures variables were chosen to be close to the observed distributions of occupational asbestos exposure variables in our casecontrol data on PM
Censoring for age at event in the source population was independently generated from a uniform distribution such that the event rate was about 10% in each source population of 1000 subjects, and 2% in each source population of 5 000 subjects. Each subject of the source population who had the event of interest was selected as a case in the casecontrol dataset. The event rates in the source population thus implied that we had about 100 cases in each casecontrol data set. For each case, 1, 2, or 4 controls were randomly selected with replacement among subjects at risk at the case’s event age, which corresponds to 1:1, 1:2, or 1:4 individual matching on age, respectively. On average, each casecontrol dataset was therefore made of about 100 cases and 100, 200, or 400 controls.
Analytical methods used to analyze the simulated data
Each casecontrol sample was analyzed using four regression models (WC1 and WC2 models and two standard logistic regression models) that were correctly specified in terms of the exposure variables included. In the WC1 and WC2 models, the exposure variables were timedependent, and the probability
Statistical criteria used to compare the performance of the different estimators
For each of the four regression models WC1, WC2, CLR, and ULR, we calculated the relative bias of the regression parameter estimator
Simulation results
Table
Population size
Case: control ratio
Intensity patterns (a)
Exposure variables
Method (b)
Relative bias (%) (c)
Relative bias / Cox pop (%) (c)
Relative efficiency (d)
RMSE × 10 ^{ 3 } (e)
ASE/SDE (e)
Cov. rate (e)
(a) Exposure intensity was either constant over lifetime for 85% of the subjects, highly increasing for 6%, moderately decreasing for 6%, and moderately increasing intensity for 3% (Scenario A); or, was highly increasing for 50% and moderately decreasing for 50% (Scenario B).
(b) WC1, weighted Cox models with robust sandwich variance; WC2, weighted Cox model with superpopulation variance; CLR, conditional logistic regression on age; ULR, unconditional logistic regression adjusted for age as a continuous covariate.
(c) Relative bias as compared to the true effect and as compared to the estimated effect of the Cox model using the full population source data. Each of these two bias was the same for WC1 and WC2 since these models used the same regression parameter estimator
(d) Relative efficiency as compared to the Cox model estimated on the full population source. This quantity was the same for WC1 and WC2 since these models used the same regression parameter estimator
(e) RMSE, root mean squared error (same for WC1 and WC2 which used the same regression parameter estimator
1 000
1:1
A
Intensity
1.39
WC1
2.9
2.4
0.61
158
0.87
89.1
WC2




1.17
97.5
CLR
5.9
5.5
0.14
327
0.95
96.5
ULR
−2.6
−3.0
0.31
218
0.97
93.1
Duration
0.05
WC1
3.3
2.0
0.41
14
0.82
88.3
WC2




1.08
97.1
CLR
6.2
4.6
0.19
20
0.96
95.7
ULR
−5.3
−6.6
0.35
15
1.03
95.1
1:1
B
Intensity
1.39
WC1
2.6
2.7
0.59
158
0.88
89.9
WC2




1.18
98.7
CLR
3.4
3.4
0.14
315
0.94
94.9
ULR
−3.8
−3.7
0.31
219
0.98
92.0
Duration
0.05
WC1
2.0
2.4
0.45
14
0.79
88.3
WC2




1.04
96.1
CLR
1.9
2.2
0.21
21
0.94
94.2
ULR
−8.6
−8.3
0.39
16
0.99
93.4
5 000
1:1
B
Intensity
1.39
WC1
7.3
9.3
0.20
254
0.72
76.1
WC2




0.85
85.6
CLR
−0.9
0.7
0.10
325
0.87
89.8
ULR
−3.3
−1.7
0.23
219
0.92
91.8
Duration
0.05
WC1
1.6
7.0
0.17
28
0.79
89.0
WC2




0.90
93.0
CLR
−15.7
−12.5
0.19
27
0.92
90.8
ULR
−15.4
−11.9
0.32
22
0.94
90.4
1:2
B
Intensity
1.39
WC1
−0.3
1.6
0.25
203
0.78
86.7
WC2




0.93
92.8
CLR
−3.0
−1.2
0.22
218
0.98
93.1
ULR
−3.5
−1.7
0.34
181
0.96
91.8
Duration
0.05
WC1
−3.4
1.3
0.27
22
0.85
89.2
WC2




1.00
94.5
CLR
−10.0
−6.5
0.33
21
0.95
92.3
ULR
−10.2
−6.5
0.45
18
0.96
93.3
1:4
B
Intensity
1.39
WC1
−5.3
−3.6
0.37
191
0.80
85.0
WC2




0.99
91.5
CLR
−3.9
−2.2
0.36
187
0.93
89.8
ULR
−3.6
−1.8
0.47
164
0.93
91.0
Duration
0.05
WC1
−10.6
−6.5
0.39
19
0.86
88.9
WC2




1.06
94.6
CLR
−11.1
−7.3
0.49
17
0.95
91.7
ULR
−10.9
−6.9
0.58
16
0.95
92.6
Model
Intensity patterns (a)
Exposure variables
Method (b)
Relative bias (%) (c)
Relative bias / Cox pop (%) (c)
Relative efficiency (d)
RMSE × 10 ^{ 3 } (e)
ASE/SDE (e)
Cov. rate (e)
(a) Exposure intensity was either constant over lifetime for 85% of the subjects, highly increasing for 6%, moderately decreasing for 6%, and moderately increasing intensity for 3% (Scenario A); or, was highly increasing for 50% and moderately decreasing for 50% (Scenario B).
(b) WC1, weighted Cox models with robust sandwich variance; WC2, weighted Cox model with superpopulation variance; CLR, conditional logistic regression on age; ULR, unconditional logistic regression adjusted for age as a continuous covariate.
(c) Relative bias as compared to the true effect and as compared to the estimated effect of the Cox model using the full population source data. Each of these two bias was the same for WC1 and WC2 since these models used the same regression parameter estimator
(d) Relative efficiency as compared to the Cox model estimated on the full population source. This quantity was the same for WC1 and WC2 since these models used the same regression parameter estimator
(e) RMSE, root mean squared error (same for WC1 and WC2 which used the same regression parameter estimator
2
A
Intensity
1.39
WC1
3.7
2.7
0.60
164
0.86
91.1
WC2




1.18
98.3
CLR
9.5
8.3
0.09
435
0.82
96.3
ULR
−2.9
−3.9
0.28
230
0.97
94.1
Duration
0.05
WC1
1.5
1.9
0.44
16
0.80
88.5
WC2




1.05
95.7
CLR
4.9
4.7
0.13
29
0.84
95.5
ULR
−11.9
−11.8
0.36
19
1.01
93.6
Age at first exposure
−0.11
WC1
4.7
3.1
0.44
32
0.79
86.3
WC2




1.04
95.3
CLR
9.9
7.7
0.18
50
0.92
95.1
ULR
0.4
−1.2
0.39
33
1.00
95.1
B
Intensity
1.39
WC1
3.1
2.8
0.64
161
0.88
90.1
WC2




1.19
98.4
CLR
6.9
6.5
0.10
405
0.84
94.1
ULR
−4.4
−4.7
0.32
229
0.98
93.3
Duration
0.05
WC1
1.3
1.3
0.49
16
0.82
90.4
WC2




1.09
96.5
CLR
5.6
5.0
0.17
27
0.91
95.0
ULR
−12.7
−12.8
0.37
19
1.00
92.9
Age at first exposure
−0.11
WC1
3.6
2.8
0.51
30
0.83
89.8
WC2




1.09
96.5
CLR
7.7
6.1
0.17
53
0.87
94.7
ULR
−1.7
−2.6
0.40
34
0.99
95.5
3
A
Intensity
1.39
WC1
3.4
3.0
0.58
165
0.84
90.3
WC2




1.13
97.0
CLR
6.0
5.5
0.14
333
0.92
95.9
ULR
−1.5
−1.9
0.33
213
0.99
94.2
Duration
0.05
WC1
0.3
0.0
0.47
23
0.80
88.7
WC2




1.05
96.1
CLR
5.2
5.3
0.24
32
0.93
95.1
ULR
−2.7
−2.7
0.40
24
0.98
95.1
Time since cessation
0.04
WC1
0.8
2.3
0.43
27
0.78
87.3
WC2




1.02
95.9
CLR
8.0
4.2
0.24
36
0.93
95.4
ULR
2.9
−6.0
0.38
28
0.97
95.2
B
Intensity
1.39
WC1
2.9
3.0
0.63
160
0.88
90.4
WC2




1.18
98.8
CLR
4.6
4.6
0.15
326
0.92
95.9
ULR
−2.8
−2.7
0.36
208
1.02
93.7
Duration
0.05
WC1
−0.7
1.1
0.44
23
0.79
86.9
WC2




1.04
95.9
CLR
−1.8
0.6
0.24
31
0.94
95.4
ULR
−7.7
−6.2
0.39
25
0.98
94.5
Time since cessation
0.04
WC1
−1.2
11.2
0.46
26
0.82
88.7
WC2




1.07
95.4
CLR
−0.3
9.5
0.25
35
0.97
95.6
ULR
−2.3
−13.2
0.40
27
1.01
95.6
As suggested by the ratio ASE/SDE, the superpopulation variance estimator (WC2) tended to give estimates that were closer to the true variance than the robust variance estimator (WC1) that systematically underestimated the true variance. Despite the superpopulation variance estimator tended to overestimate the true variance for the effect of exposure intensity when the population was made of 1000 subjects only (Tables
While the relative biases from all analytical models (WC, ULR and CLR) tended to be low and of the same magnitude in all scenarios, the relative efficiency as compared to the Cox model estimated on the full population source, as well as the accuracy in terms of RMSE, tended to be different. Indeed, in all scenarios with 1:1 case:control ratio within population source of 1000 subjects, the regression coefficient estimator from the WC models was much more efficient and thus also more accurate than that from CLR and ULR (Tables
Interestingly, CLR did not perform better in terms of both bias and RMSE than ULR, despite individual matching of cases and controls. ULR was actually systematically more efficient than CLR. This result may be consistent with our previous results where we found that CLR might have difficulty in separating the effects of correlated timedependent variables
Application to occupational exposure to asbestos and pleural mesothelioma
Mesothelioma is a rare tumor mostly located in the pleura and usually caused by exposure to asbestos. The role of the different temporal patterns of occupational exposure to this substance has still to be explored using appropriate statistical methods accounting for individual changes over time in the exposure intensity
Data source
The data came from a large French populationbased casecontrol study described in Lacourt et al.
where
Because our objective was to accurately investigate the effects of the quantitative timerelated aspects of occupational exposure, all our analyses were restricted to subjects ever exposed to asbestos (68.9% in males and 20.9% in females). In addition, because the sample size for females was too small to ensure adequate statistical power and accurate estimates in separate multiple regression analyses of this group
Characteristics
Cases
Controls
(
(
Results from the French casecontrol study on mesothelioma, 1987–2006.
(a) Measured by the mean index of exposure (MIE).
Age at diagnosis / interview (years)
67.0 (10.0)
65.9 (6.3)
Year of birth
1 931.1 (10.0)
1 931.0 (9.3)
Age at first exposure (years)
21.0 (7.1)
22.6 (8.1)
Mean exposure intensity over lifetime (fibers/ml) (a)
0.62 (1.43)
0.21 (0.44)
Total exposure duration (years)
27.8 (12.9)
25.0 (14.1)
Time since last exposure (years)
16.9 (13.4)
17.4 (14.6)
Analytical methods used to analyze the casecontrol data on pleural mesothelioma
To derive the weights proposed in the WC models (Equation 2), we first estimated the ageconditional probabilities
Age
(a)
(b)
044
0.1
0.000942
4549
0.4
0.000941
5054
1.2
0.000937
5559
2.8
0.000925
6064
5.2
0.000897
6569
8.0
0.000845
7074
10.5
0.000765
7579
13.2
0.000660
8084
15.2
0.000528
8589
14.5
0.000376
9094
11.6
0.000231
95 or more
11.5
0.000115
For comparison purpose, the data were further analyzed with ULR which is the standard method to analyze frequency matched casecontrol data, as well as with CLR. Age was the time axis in WC1 and WC2 models, and a continuous covariate in ULR and CLR. We did not perform lefttruncation in WC1 and WC2 models thus assuming that all subjects of the population source were passively followedup for PM since birth. The matching factor, birth year, was a quantitative covariate in WC1, WC2, and ULR, and was the stratification variable (in 5 years groups) in CLR. Using each of the four approaches (WC1, WC2, CLR and, ULR), we estimated the effects of intensity and duration of occupational asbestos exposure, the age at first exposure, and time since last exposure, using the same combination of quantitative exposure variables as in Models 1–3 of the simulation study. All the effects of these variables were therefore assumed to be linear. Despite our recent results that suggested that these effects were not linear on the logit of PM
Results
Table
Model
Exposure variables (a)
Unit
Method (b)
95% CI
Results from the French casecontrol study on mesothelioma, 1987–2006.
(a) All the exposure variables were timedependent in WC1 and WC2 models, and fixed at their value at diagnosis/interview in CLR and ULR. Intensity was measured by the mean index of exposure (MIE).
(b) WC1, weighted Cox models with robust sandwich variance; WC2, weighted Cox model with superpopulation variance; Both WC1 and WC2 used age as the time axis and included birth year as a quantitative covariate; ULR, unconditional logistic regression including age at diagnosis/interview and birth year as quantitative covariates; CLR, conditional logistic regression stratified on birth year group (5 years), and including age at diagnosis/interview as a quantitative covariate.
(c) Hazard ratio estimates for WC1 and WC2 (same value for WC1 and WC2) and odds ratio estimates for CLR and ULR, adjusted for age and birth year, and corresponding 95% confidence interval (CI).
1
Intensity
1.0 fiber/ml
WC1
1.75
1.66
1.84
WC2

1.65
1.85
CLR
2.55
2.29
2.83
ULR
2.33
2.14
2.54
Duration
14 years
WC1
1.32
1.24
1.40
WC2

1.23
1.41
CLR
1.18
1.12
1.24
ULR
1.17
1.12
1.23
2
Intensity
1.0 fiber/ml
WC1
1.73
1.64
1.82
WC2

1.63
1.83
CLR
2.49
2.24
2.76
ULR
2.31
2.12
2.52
Duration
14 years
WC1
1.19
1.12
1.27
WC2

1.11
1.28
CLR
1.08
1.02
1.14
ULR
1.10
1.05
1.15
Age at first exposure
8 years
WC1
0.63
0.58
0.68
WC2

0.57
0.70
CLR
0.66
0.61
0.72
ULR
0.77
0.73
0.82
3
Intensity
1.0 fiber/ml
WC1
1.74
1.65
1.83
WC2

1.64
1.84
CLR
2.53
2.28
2.82
ULR
2.33
2.14
2.53
Duration
14 years
WC1
1.90
1.68
2.14
WC2

1.64
2.19
CLR
1.41
1.27
1.57
ULR
1.41
1.29
1.53
Time since last exposure
14 years
WC1
1.55
1.37
1.75
WC2

1.34
1.79
CLR
1.24
1.11
1.39
ULR
1.25
1.14
1.37
As expected, the associations between all asbestos exposure variables and PM were significant with each of the four analytical approaches (Table
The 95% CI from WC1 and WC2 were almost identical (Table
The strongest contrasts between the estimates from the WC models and ULR or CLR were for the effect of exposure intensity. Indeed, the estimated effect of intensity was systematically weaker with the WC models than with ULR or CLR, with even non overlapping 95% CI. Note that, as for Scenario A in our simulation study, CLR provided the strongest estimates for the strong effect of intensity. By contrast, for the effects of duration, age at initiation, and time since last exposure, the strongest estimates were provided by the WC models, but the discrepancies with ULR and CLR were weaker than for intensity.
There are different potential explanations for the discrepancies between the results from the Cox (WC1 and WC2) and logistic (CLR and ULR) models. First the adjustment for age was largely different in the two series of models. While age was the time axis in the Cox models, and was therefore adequately adjusted for in both WC1 and WC2, it was included as a continuous covariate in both logistic models. This assumed that its effect was linear on the logit, which is actually not true
Discussion
Our simulation results suggest that the superpopulation variance estimator
Our simulation results also confirmed that the WC model is an alternative method for estimating the effects of timevarying exposure variables in casecontrol studies. In particular, when compared to standard logistic regression that did not dynamically account for the different values of covariates over lifetime, the WC model tended to provide more accurate estimates of the effects of variables for which an important percentage of subjects had timevarying values over lifetime, such as intensity. However, the superiority of the WC did not persist when more than one control were selected from the risk set. Our results also suggest that the estimates from the WC model are not more affected by correlations between timedependent covariates included in the model than logistic regression with fixedintime covariates. Note that the modelling of the exposure in the WC model could further be improved by incorporating some more complex function of the trajectory of the exposure over time that have recently been proposed
The application of the WC model requires estimating the ageconditional probabilities in the source population for populationbased casecontrol studies, or in the full cohort for nested casecontrol studies. In our application to populationbased casecontrol data on PM, these probabilities were estimated from health statistics on the general French male population. Yet, our analyses were restricted to ever exposed males only who have much higher probability to develop PM than the general French male population. Further studies are needed to investigate the impact of biased estimates of the ageconditional probabilities on the WC estimates. Accounting for uncertainty in the weight estimates could further improve the variance estimator
The WC model with timedependent variables requires also information on the values of the covariates at each event time, so at each age of diagnosis in cases. Such information may be missing, and different approaches could be considered to impute these values. However, further studies are needed to assess the impact of measurement errors of the timedependent covariate values. Indeed, missmodeling the covariates has already been shown to induce bias in sandwich variance estimator based on dfbetas of unweighted Cox model for nested casecontrol analysis
Conclusion
We believe that the WC model using the superpopulation variance estimator may provide a potential alternative analytical method for casecontrol analyses with detailed information on the history of the exposure of interest, especially if a large part of the subjects has a timevarying exposure intensity over lifetime, and if only one control is available for each case.
Abbreviations
ASE: Average standard errors; CI: Confidence interval; CLR: Conditional logistic regression; JEM: Jobexposure matrix; MIE: Mean index of exposure; PM: Pleural mesothelioma; RMSE: Root mean squared error; SDE: Standard deviation of the estimates; ULR: Unconditional logistic regression; WC: Weighted Cox model.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HG has drafted the manuscript, programmed and run the simulation study, analyzed the casecontrol data on mesothelioma, and has contributed to the interpretation of all the results. AL has provided the casecontrol data on mesothelioma and has revised the manuscript. KL has drafted and revised the manuscript, has designed the simulation study, and supervised HG in all stages. All authors read and approved the final manuscript.
Acknowledgements
This research was mostly supported by grants from the CHUM research center awarded to Dr. Karen Leffondre. Dr. Aude Lacourt was supported by a postdoctoral fellowship from the Fondation de France. The collection of data was partly supported by the French Institute for Public Health Surveillance.
Prepublication history
The prepublication history for this paper can be accessed here: