Research article

# Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research

Jochen Hardt*, Max Herke and Rainer Leonhart

BMC Medical Research Methodology 2012, 12:184  doi:10.1186/1471-2288-12-184

### The meaning of MAR, is MAR(X) really MAR, and bias of complete case analysis

Jonathan Bartlett   (2014-11-04 16:18)  London School of Hygiene and Tropical Medicine

The paper by Hardt, Herke and Leonhart is a very useful investigation into the performance of multiple imputation in small samples and how it varies with the inclusion of auxiliary variables. Their recommendations and conclusions are welcome, particularly as researchers increasingly face datasets with larger and larger p relative to n.

I have comments on the definition of MAR, whether the authors MAR(X) mechanism is really MAR, and on the bias of complete case analysis.

1. The meaning of MAR and terminology

On page 3, the authors write "No full MAR mechanism was applied here, because inreal data, finding a variable that completely explains theprocess of missingness is unlikely.", and then describe their MAR(Y) mechanism, where missingness is determined by the value of Y+c, where c~N(0,1), as 50% MCAR and 50% MAR, following an earlier paper by Allison. MAR does not mean that missingness follows a deterministic mechanism which is a function of observed values. Rather it means that conditional on the observed values, missingness in a variable no longer depends on the value of that variable. Thus I would label the authors' MAR(Y) mechanism as simply an MAR mechanism, rather than 50% MCAR and 50% MAR, which I fear might lead to confusion.

2. Is MAR(X) really MAR?

If I understand the authors' data generation scheme correctly, I do not believe their MAR(X) mechanism is truely MAR. I understand the scheme used to be that missingness in X1 and Z_a was determined by the value of d1=X2+c, and missingness in X2 depended on X1+c, where these steps were "carried out separately for each variable", which I presume means the c~N(0,1) variable was newly generated for determining missingness in X2. If I have understood the data generation correctly, the missing data are not MAR. For example, missingness in X2 depends on X1, but X1 is sometimes itself missing. More formally, in non-monotone missingness settings, MAR means that the probability of a pattern being realised only depends on the observed values in that particular pattern - see Robins and Gill, Statistics in Medicine, 1997 16:39-56. As a result, MI would be expected to give some bias under the authors' MAR(X) mechanism. I have tried simulating a large (n=100,000) dataset, with no auxiliary variables, making missingness in X1 and X2 according to what I understand the MAR(X) mechanism to be. MI then gave estimates with a slight downward bias for the coefficients of X1 and X2, so perhaps in this particular setup the problem (that the data are not really MAR) is not such a big deal.

3. Why is CC biased under MAR(X)?

As the authors note in their introduction, complete case analysis (CC) can be unbiased in certain situations - specifically when missingness is independent of the outcome variable, conditional on the covariates. In the MAR(X) mechanism, missingness is generated in a way which only depends on X1 and X2, and not on Y, so CC ought to be unbiased here. Yet Table 2 shows that CC is biased under MAR(X) - can the authors shed any light on why this is the case (in my large simulated dataset under MAR(X), I got no bias in CC).

Competing interests

None

top

### Ridge parameter

Stef van Buuren   (2013-02-15 14:56)  Netherlands Organization for Applied Scientific Research TNO

The paper by Hardt, Herke and Leonhart is a welcome addition to the literature. It warns against simplistic approaches that throw just anything into the imputation model. While the imputation model is generally robust against including junk variables, the paper clearly demonstrates that we should not drive this to the edge. In general building the imputation model requires appropriate care. My personal experience is that it is not beneficial to include more than -say- 25 well-chosen variables into the imputation model.

In their simulations the authors investigate cases where the number of variables specified in the imputation model exceeds the number of cases. Many programs break down in this case, but MICE will run because it uses ridge regression instead of the usual OLS estimate. The price for this increased computational stability is -as confirmed by Hardt et all - that the parameters estimates will be biased towards zero. It is therefore likely that some of the bias observed by the authors is not intrinsic to PMM, but rather due to the setting of the ridge parameter (the default value 1E-5 may be easily changed as mice(..., ridge = 1E-6)). Would a tighter ridge setting (e.g., 1E-6 or 1E-7) appreciably reduce the bias?

The '1 out of 3 of the complete cases' rule is interesting and easily remembered. However, a complication in practice is that there are often no complete cases in real data, especially in merged datasets. What would the authors think of the slightly more liberal rule 'n/3 variables'?

Stef van Buuren

Competing interests

Author of the mice package.

top