Email updates

Keep up to date with the latest news and content from BMC Medical Research Methodology and BioMed Central.

Open Access Highly Accessed Research article

Variable selection under multiple imputation using the bootstrap in a prognostic study

Martijn W Heymans12348*, Stef van Buuren56, Dirk L Knol7, Willem van Mechelen234 and Henrica CW de Vet4

Author Affiliations

1 Vrije Universiteit, Institute for Health Sciences, Department of Methodology and Applied Biostatistics, Amsterdam, The Netherlands

2 Body@Work, Research Center Physical Activity, Work and Health, TNO-VUmc, Amsterdam, The Netherlands

3 Department of Public and Occupational health, VU University Medical Center, Amsterdam, The Netherlands

4 Institute for Research in Extramural Medicine, VU University Medical Center, Amsterdam, The Netherlands

5 TNO Quality of Life, Leiden, The Netherlands

6 Department of Methodology and Statistics, University of Utrecht, The Netherlands

7 Clinical Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands

8 EMGO-Institute (Metropolitan building), VU University Medical Center, Van der Boechorststraat 7, 1081 BT Amsterdam, The Netherlands

For all author emails, please log on.

BMC Medical Research Methodology 2007, 7:33  doi:10.1186/1471-2288-7-33

Published: 13 July 2007

Abstract

Background

Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection.

Method

In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels.

Results

We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found.

Conclusion

We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values.