An exploration of the missing data mechanism in an Internet based smoking cessation trial
1 , MRC Biostatistics Unit, Cambridge, UK
2 Behavioural Science Group, Institute of Public Health, University of Cambridge, Cambridge, UK
BMC Medical Research Methodology 2012, 12:157 doi:10.1186/1471-2288-12-157Published: 15 October 2012
Missing outcome data are very common in smoking cessation trials. It is often assumed that all such missing data are from participants who have been unsuccessful in giving up smoking (“missing=smoking”). Here we use data from a recent Internet based smoking cessation trial in order to investigate which of a set of a priori chosen baseline variables are predictive of missingness, and the evidence for and against the “missing=smoking” assumption.
We use a selection model, which models the probability that the outcome is observed given the outcome and other variables. The selection model includes a parameter for which zero indicates that the data are Missing at Random (MAR) and large values indicate “missing=smoking”. We examine the evidence for the predictive power of baseline variables in the context of a sensitivity analysis. We use data on the number and type of attempts made to obtain outcome data in order to estimate the association between smoking status and the missing data indicator.
We apply our methods to the iQuit smoking cessation trial data. From the sensitivity analysis, we obtain strong evidence that older participants are more likely to provide outcome data. The model for the number and type of attempts to obtain outcome data confirms that age is a good predictor of missing data. There is weak evidence from this model that participants who have successfully given up smoking are more likely to provide outcome data but this evidence does not support the “missing=smoking” assumption. The probability that participants with missing outcome data are not smoking at the end of the trial is estimated to be between 0.14 and 0.19.
Those conducting smoking cessation trials, and wishing to perform an analysis that assumes the data are MAR, should collect and incorporate baseline variables into their models that are thought to be good predictors of missing data in order to make this assumption more plausible. However they should also consider the possibility of Missing Not at Random (MNAR) models that make or allow for less extreme assumptions than “missing=smoking”.