MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Robinson Way, Cambridge CB2 0SR, UK

Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London WC1E 7HT, UK

Abstract

Background

Multiple imputation is often used for missing data. When a model contains as covariates more than one function of a variable, it is not obvious how best to impute missing values in these covariates. Consider a regression with outcome ^{2}. In 'passive imputation' a value ^{2 }is imputed as (^{2}. A recent proposal is to treat ^{2 }as 'just another variable' (JAV) and impute ^{2 }under multivariate normality.

Methods

We use simulation to investigate the performance of three methods that can easily be implemented in standard software: 1) linear regression of ^{2}; 2) the same regression but with predictive mean matching (PMM); and 3) JAV. We also investigate the performance of analogous methods when the analysis involves an interaction, and study the theoretical properties of JAV. The application of the methods when complete or incomplete confounders are also present is illustrated using data from the EPIC Study.

Results

JAV gives consistent estimation when the analysis is linear regression with a quadratic or interaction term and

Conclusions

Given the current state of available software, JAV is the best of a set of imperfect imputation methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression.

Background

In most medical and epidemiological studies some of the data that should have been collected are missing. This presents problems for the analysis of such data. One approach is to restrict the analysis to complete cases, i.e. those subjects for whom none of the variables in the analysis model are missing. Data are said to be missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR)

A method for handling missing data that gives valid inference under MAR and which is more efficient than just using complete cases is multiple imputation (MI)

This article is concerned with the use of MI when the analysis model includes as covariates more than one function of the same variable and this variable can be missing. Such situations arise when the analysis model includes both linear and higher-order terms of the same variable or when the model includes an interaction term. This is the case, for example, when non-linear associations are explored using fractional polynomials or splines ^{2}, and 2) linear regression of

In 'passive imputation', an imputation model is specified for the distribution of ^{2 }or

It is possible that passive imputation might be improved by using predictive mean matching (PMM)

Passive imputation and PMM ensure that the imputed values conform to the known functional relation between the covariates, e.g. that the imputed value of ^{2 }is equal to the square of the imputed value of ^{2 }(or ^{2 }(or ^{2 }will not, in general, be consistent with one another, e.g. ^{2 }is imputed as 5. However, Von Hippel argued that this does not matter for estimation of the parameters of the analysis model. We shall examine Von Hippel's argument in detail in the Results section.

In the present article we investigate, using simulation, the performance of three methods easily implemented in standard software -- passive imputation, PMM and JAV -- in the two settings described above. We look at bias of parameter estimators and coverage of confidence intervals. In addition to considering linear regression analysis models, we also look at the logistic regression of binary ^{2}. Von Hippel justified the use of JAV for a linear regression analysis model, but suggested that it might also work well in the setting of logistic regression, because the logistic link function is fairly linear except in regions where the fitted probability is near to zero and one. In the Methods section, we formally describe the three approaches and the simulations we performed to assess the performance of these approaches. We also describe a dataset from the EPIC study on which we illustrate the methods. In the Results, we present a theoretical investigation of the properties of JAV, showing that although JAV gives consistent estimation for linear regression under MCAR, it will not, in general, under MAR. Results from the simulations and from applying the methods to the EPIC dataset are also described there. These results are followed by a discussion and conclusions.

Methods

Three imputation methods

We begin by describing passive imputation, PMM and JAV for the setting of linear regression of ^{2}. We then describe the modifications necessary for regression of

Let _{i }
_{i }
_{1}, _{1}),...,(_{n}, Y_{n}
_{i }
_{i }
_{i }
_{1 }denote the number of complete cases, and **W**
_{i}
_{i }
_{i}
^{T }
**W **
^{T }

Passive imputation

In the approach we call 'linear imputation model with passive imputation of ^{2}' (or just 'passive imputation') the linear regression model ^{2}. If ^{2 }are treated as a priori independent with joint density proportional to ^{-2}, then the posterior distribution of ^{2 }is _{i}

PMM

The approach we call 'linear imputation model with predictive mean matching' (or just 'PMM') is the same as passive imputation up to the generation of _{j }
_{j }
_{i}
_{i }
_{j }
_{i}
_{j}

JAV

In the JAV approach, (^{2}) is assumed to be jointly normally distributed:

Expression (1) can equivalently be written as

where (for ^{2 }are then generated from distribution (3) using the perturbed values of the parameters. As

The methods described above need only minor adaption for the setting of linear regression of ^{2}), is assumed to be multivariate normally distributed.

Simulation studies

Linear regression with quadratic term

In all our linear regression simulation studies, a sample size of 200 was assumed and 1000 simulated datasets were created. For each simulated dataset, we generated 200 ^{2},^{2 }equal to 0.1, 0.5 or 0.8. Although ^{2 }values greater than 0.5 are uncommon in medical studies, we wanted also to investigate the performance of methods in extreme situations. The top two rows of Figure

Typical datasets for normally or log-normally distributed ^{2 }or (^{2}, and ^{2 }= 0.1, 0.5 or 0.8

**Typical datasets for normally or log-normally distributed X (each with mean 2 and variance 1), normally distributed Y with mean 2X + X**. Dotted line shows expected value of Y given

Missingness was then imposed on these data. Let expit(^{-1}. _{0 }+ _{1}
_{1}=-1/SD(_{0 }was chosen to make the marginal probability of observing

For the three methods, passive imputation ('Passive'), PMM and JAV, we used

Finally, we instead generated ^{2},^{2 }= 0.1, 0.5 or 0.8. As the mean of

Linear regression with interaction

We focussed on normally and log-normally distributed covariates. Four bivariate distributions were assumed for the two covariates ^{2 }= 0.1 or 0.5.

Logistic regression with quadratic term

A sample size of 2000 was assumed and 1000 simulated datasets were created. This larger sample size was used because binary outcomes provide less information for estimating parameter values than do continuous outcomes. We used the same normal and log normal distributions for _{0}+2_{2}
_{2}
^{2}). The value of _{2 }was chosen to make the log odds ratio of _{2 }= 1/12) or 2 (_{2 }= 1/6). When _{0 }was chosen so that the marginal probability of

_{0 }+ _{1}
_{1 }= 0; for MAR _{1 }= - 2. In both cases _{0 }was chosen to give a marginal probability of observing

Analysis of vitamin C data from EPIC Study

EPIC-Norfolk is a cohort of 25,639 men and women recruited during 1993-97 from the population of individuals aged 45-75 in Norfolk, UK

There is evidence of a non-linear association between vitamin C intake and plasma vitamin C

Of the 25639 subjects, 10224 had incomplete data: 3165 had missing plasma vitamin C; 8100 missing dietary vitamin C; 32 missing weight; 220 missing smoking status (age and sex were fully observed). If the data are not MCAR, estimators from a complete-case analysis may be biased. When logistic regression was used on the set of subjects with observed plasma vitamin C, higher values of log plasma vitamin C were associated with a lower probability of being a complete case (

Three forms of MI were used. In the first and second forms, the full-conditional specification (FCS: also known as 'chained-equations') approach was used ^{2}' in the simulations. The second form of MI was identical to the first except that PMM was used to impute dietary vitamin C. The third form of MI was JAV, i.e. imputation under a multivariate normal distribution. For JAV, smoking status was represented by two binary indicator variables, one of which was equal to 1 for former smokers, the other of which equalled 1 for never smokers. Imputed values for these binary variables were not rounded. This method of handling categorical variables was advocated by Ake

We also applied a variant of JAV in which FCS was used. This variant was identical to the first form of MI except that the dietary vitamin C-squared variable was imputed using a linear regression model involving all the other variables (including dietary vitamin C) as covariates. This is equivalent to imputing dietary vitamin C and dietary vitamin C-squared from a bivariate normal distribution conditional on all the other variables.

In all our analyses smoking status was categorised as current smoker (baseline), former smoker or never smoker. The first and second forms of MI and the variant JAV method were implemented using ice in STATA. The original JAV method was implemented using mi impute mvn in STATA.

Results

Properties of JAV under MCAR and MAR

In this section, we summarise the argument of Von Hippel (2009) for why the JAV approach will give consistent estimation of the parameters of the analysis model when the data are MAR, and then explain why JAV actually requires the stronger condition of MCAR for consistency.

Assume that the analysis model is the regression of ^{2 }(the argument is analogous for the interaction model). Model (1), or equivalently (2)-(3), is misspecified, since (^{2}) is not joint normally distributed. The values of _{1}, _{11}, etc. that minimise the Kullback-Leibler distance between the true distribution of (^{2}) and the multivariate normal distribution are called their 'least false' values _{1}, _{2 }and _{3 }are just the population means of ^{2}; those of _{11},..., _{33 }are the population variances and covariances. If the missing ^{2 }values are imputed from distribution (3) with **Δ **= (_{20},_{21}, _{30}, _{31}, _{22}, _{23}, _{33},) equal to its least false value, then the mean and variance of (^{2}) in the imputed dataset will consistently estimate the mean and variance in the population. The true parameter values of the analysis model are functions of this population mean and variance: they are ^{2 }values are imputed using the least false values, the parameters of the analysis model will be consistently estimated.

Von Hippel argues that, when ^{2}) is jointly normally distributed (model (1)). He does this using PROC MI in SAS.

When the data are MCAR, the above argument is valid. However, when the data are MAR, the observed-data MLEs of **Δ **are not necessarily consistent for the least false value, because model (1) is misspecified. Consequently, the analysis model parameters will not, in general, be consistently estimated unless

It can be seen from expressions (2) and (3) that the MLEs of _{1 }and _{11 }are functions only of the **Δ **is a function of the ^{2 }values of subjects for whom _{1 }and _{11 }consistently estimate their least false values. If **Δ **is also consistently estimated by its MLE. The rest of Von Hippel's argument now applies. On the other hand, if

The argument that JAV will give consistent estimation when **
S
**say. In this case, the normal distribution for (

So far, we have been concerned with parameter estimation. Now consider variance estimation. Von Hippel uses Rubin's Rules to estimate variances and hence confidence intervals. However, there is no particular reason to assume that Rubin's Rules will give a consistent variance estimator, because derivations of Rubin's Rules assume a correctly specified parametric imputation model

Simulation studies

Linear regression with quadratic term

We focus on the quadratic term, whose true value is 1. The first block of five rows of Table ^{2}, JAV might well be more efficient than CCase, because CCase would not use data on individuals with observed

Linear regression with ^{2},

**
R
**

^{2 }= 0.5

^{2 }= 0.8

**bias**

**cover**

**r.prec**.

**bias**

**cover**

**r.prec**.

**bias**

**cover**

**r.prec**.

MCAR,

CData

-3

95

100

-1

95

100

0

95

100

CCase

-2

95

64

-1

95

64

0

95

64

Passive

-32

99

124

-21

95

104

-20

87

86

PMM

-3

92

59

0

93

65

2

92

64

JAV

-4

94

61

-1

95

61

0

95

62

MAR,

CData

-6

95

100

-1

96

100

-2

95

100

CCase

-23

95

72

-13

95

59

-8

94

48

Passive

-45

99

144

-27

95

120

-42

50

122

PMM

-36

89

50

-13

93

49

8

91

36

JAV

-12

94

52

-1

95

42

0

93

38

MAR,

CData

-6

96

100

0

95

100

-1

95

100

CCase

-21

94

42

-19

94

24

-7

94

20

Passive

-72

98

70

24

93

21

-3

88

31

PMM

-46

88

29

-19

90

15

47

86

6

JAV

-7

92

28

7

91

12

18

91

10

Table 1 Percentage bias, coverage and relative precision for quadratic term in linear regression when ^{2}, ^{2 }= 0.1, 0.5 and 0.8, respectively. For MAR,

The above observations remain broadly true when

The second block of five rows of Table

The third block of five rows of Table ^{2 }= 0.1, CCase is biased and JAV approximately unbiased. However, when ^{2 }= 0.8, JAV is considerably more biased than CCase. There is some evidence of slight undercoverage of JAV. When ^{2 }= 0.1, but biased (bias = 20%) when ^{2 }= 0.8. JAV is approximately unbiased with coverage 93% when ^{2}.

Increasing the sample size to ^{2 }= 0.1, it stayed the same when ^{2 }= 0.8 and increased from -19% to 38% when ^{2 }= 0.5. The biases of Passive and JAV are not improved. Coverage of PMM, Passive and JAV worsen, especially when ^{2 }= 0.8 or ^{2 }= 0.8, coverages were 0%, 67% and 88%, respectively. We also investigated whether the bias of PMM that was still evident for a sample size of ^{2 }= 0.5 and 26% when ^{2 }= 0.8. This is probably because the difference between the largest missing value (or values) of

Table ^{2}, ^{2 }= 0.8 and ^{2 }= 0.1, 0.5 or 0.8, respectively. When ^{2 }= 0.5 and 22% (coverage 19%) when ^{2 }= 0.8. The bias is 41% or 71% when ^{2 }= 0.5 or 0.8, respectively. These biases do not improve as sample size increased. The performances of JAV and PMM are similar when ^{2}.

Linear regression with ^{2},

**
R
**

^{2 }= 0.5

^{2 }= 0.8

**bias**

**cover**

**r.prec**.

**bias**

**cover**

**r.prec**.

**bias**

**cover**

**r.prec**.

MCAR,

CData

-1

95

100

0

95

100

0

95

100

CCase

0

95

64

0

95

64

0

95

64

Passive

-31

86

110

-31

48

55

-30

32

21

PMM

-2

93

62

1

93

61

2

90

47

JAV

-1

94

64

0

93

63

0

92

52

MAR,

CData

0

94

100

0

95

100

0

94

100

CCase

-14

92

54

-9

88

37

-4

91

30

Passive

-41

80

108

-38

45

52

-32

48

18

PMM

-10

88

42

4

88

26

16

51

10

JAV

0

93

41

18

68

21

22

19

10

MAR,

CData

2

94

100

0

95

100

0

95

100

CCase

-12

94

44

-8

94

27

-4

94

20

Passive

-41

96

81

-25

87

18

-9

90

4

PMM

-10

88

29

8

91

12

35

70

3

JAV

7

92

27

41

70

6

71

20

2

Table 2 Percentage bias, coverage and relative precision for quadratic term in linear regression when ^{2}, ^{2 }= 0.1, 0.5 and 0.8, respectively. For MAR,

Linear regression with interaction

We focus on the interaction term, whose true value is 1. Table

Linear regression with interaction

**
R
**

^{2 }= 0.5

^{2 }= 0.8

**bias**

**cover**

**r.prec**.

**bias**

**cover**

**r.prec**.

**bias**

**cover**

**r.prec**.

MCAR,

CData

3

93

100

1

93

100

0

93

100

CCase

-3

95

71

-1

95

71

0

95

71

Passive1

-31

97

136

-19

94

116

-18

88

106

Passive2

-11

95

86

-17

94

115

-17

89

103

PMM

-12

96

86

-15

96

106

-13

91

93

JAV

-2

93

66

-1

94

65

0

94

65

MAR,

CData

-1

94

100

-2

95

100

0

95

100

CCase

-15

96

82

-12

94

69

-5

95

62

Passive1

-36

99

147

-24

94

112

-25

79

110

Passive2

-14

96

75

-26

94

111

-25

82

89

PMM

-19

97

84

-23

94

94

-17

90

85

JAV

-3

94

60

-4

92

54

1

94

53

MAR,

CData

-1

96

100

2

95

100

1

96

100

CCase

-17

94

57

-9

95

38

-4

96

30

Passive1

-43

98

129

-20

96

76

-34

68

65

Passive2

-40

96

71

-42

89

58

-45

73

27

PMM

-40

96

79

-38

92

66

-27

85

30

JAV

-3

93

41

8

92

26

14

92

20

Table 3 Percentage bias, coverage and relative precision for interaction term in linear regression. The true value of the interaction term is 1. For MCAR, ^{2 }= 0.1, 0.5 and 0.8, respectively. For MAR,

When ^{2 }= 0.1, the biases of Passive1, Passive2 and PMM are 24%, -16% and 24%, respectively; when ^{2 }= 0.1, they are -29%, -37% and -64%.

Logistic regression with quadratic term

We focus on the quadratic term, whose true value is _{2 }= 1/12 or 1/6. Table _{2 }= 1/2 and _{2 }= 1/6 or

Logistic regression with quadratic term

**( p, β**

**(0.5, 1/12)**

**(0.5, 1/6)**

**(0.1, 1/12)**

**bias**

**cover**

**r.prec**.

**bias**

**cover**

**r.prec**.

**bias**

**cover**

**r.prec**.

MCAR,

CData

1

95

100

-1

95

100

-6

94

100

CCase

1

96

70

-1

95

73

-8

95

67

Passive

-30

97

137

-30

92

136

-34

99

119

PMM

0

94

67

-1

94

70

-10

93

63

JAV

-7

96

76

-23

92

102

27

91

72

MCAR,

CData

6

95

100

4

94

100

4

94

100

CCase

7

94

69

4

95

73

4

96

71

Passive

-36

96

222

-45

90

308

-40

93

127

PMM

8

93

67

6

92

68

5

95

68

JAV

-66

71

178

-118

3

398

55

85

56

MAR,

CData

0

96

100

1

95

100

-8

96

100

CCase

-1

97

67

0

95

63

-28

96

28

Passive

33

97

125

-30

92

115

-71

99

171

PMM

-2

94

65

-1

92

59

-33

85

27

JAV

37

89

59

56

62

79

51

82

26

MAR,

CData

5

93

100

5

96

100

5

94

100

CCase

7

93

70

7

95

69

7

95

38

Passive

-8

98

100

2

99

106

-202

16

81

PMM

8

91

67

7

93

64

5

84

34

JAV

22

92

81

-30

80

105

333

25

7

Table 4 Percentage bias, coverage and relative precision for quadratic term in logistic regression. For MCAR, _{2})=(0.5,1/12), (0.5, 1/6) and (0.1, 1/12), respectively. For MCAR,

Now consider the results when _{2 }= 1/12 and

Analysis of vitamin C data from EPIC study

Figure

Log plasma vitamin C and log dietary vitamin C in 15415 individuals for whom both variables are observed

**Log plasma vitamin C and log dietary vitamin C in 15415 individuals for whom both variables are observed**.

Analysis of vitamin-C data

**Complete**

**FCS**

**FCS**

**JAV**

**Cases**

**with Passive**

**with PMM**

**Est**

**SE**

**Est**

**SE**

**Est**

**SE**

**Est**

**SE**

intercept

0.990

0.201

0.570

0.177

0.903

0.181

1.030

0.163

log diet C

1.141

0.090

1.322

0.079

1.163

0.081

1.106

0.075

log diet C sqrd

-0.090

0.010

-0.113

0.009

-0.094

0.009

-0.088

0.008

sex

0.169

0.008

0.173

0.007

0.172

0.007

0.172

0.007

weight (per 10 Kg)

-0.042

0.003

-0.041

0.003

-0.040

0.003

-0.041

0.003

age (per 10 yrs)

-0.052

0.004

-0.043

0.003

-0.043

0.003

-0.043

0.003

former smoker

0.212

0.015

0.213

0.012

0.213

0.012

0.212

0.012

never smoker

0.216

0.014

0.218

0.012

0.218

0.012

0.219

0.012

Table 5 Point estimates and SEs from complete-case analysis and three MI methods (full conditional specification with and without predictive mean matching, and JAV) for the regression of log plasma vitamin C on log dietary vitamin C ('log diet C'), its square ('log diet C sqrd') and a set of confounders

Table

The SEs when using MI are somewhat smaller than those from the complete-case analysis, indicating that MI has made use of information from the subjects with missing data. This gain in efficiency is greater for JAV than for PMM. Had there been more missing values in the confounders, we would have expected a greater efficiency gain from using MI.

Discussion

In this article, we have investigated imputation of an incomplete variable when the model of interest includes as covariates more than one function of that variable. We have focused on linear regression with a quadratic or interaction term, and have examined three imputation methods that can be easily implemented in standard software. In STATA, for example, the ice command can be used for passive imputation and PMM, and the mi impute mvn command for JAV; in R the mice function can be used for passive imputation and PMM, and the mix library for JAV. Note that although ice and mice use chained equations and hence, in general, involve iteration, when the data are monotone missing, as is the case in our simulation studies, no iteration is required.

In the JAV approach, each function of the incomplete variable is treated as an unrelated variable and a multivariate normal imputation model is used. Von Hippel (2009) claimed that this would give consistent estimation for linear regression when the data were MAR. In this paper we have shown that the consistency actually requires MCAR; when data are MAR, bias is to be expected. None of the three MI methods we investigated worked well in all the MAR scenarios considered. In general, JAV performed better than passive imputation or PMM for linear regression with a quadratic or interaction effect. We have shown, however, that there are circumstances in which JAV can have large bias for the quadratic effect of a linear regression model. JAV was found to perform very badly when the analysis model is a logistic regression, unless the outcome is common and covariates only have small effects on its probability. In view of this, we recommend that, given the current state of available software, JAV is the best of a set of imperfect imputation methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression. For logistic regression, the best performing imputation method was PMM. However, when ^{2}) are the only covariates in the model and are MAR, the complete-case analysis is unbiased, and hence we recommend its use in that case.

In our simulations, we found that using PMM was nearly always better than using passive imputation without PMM. However, for linear regression analysis models, its performance was usually worse than JAV.

In the scenarios we considered in our simulations, the analysis model only involves one variable (

We (like Von Hippel, 2009) have presented JAV as a method using a multivariate normal imputation model. If the data are MCAR, JAV with this imputation model will give consistent point estimation in linear regression. The principle of JAV, i.e. that functions of the same variable are treated as separate and the functional relation between them ignored, is not tied to the normal distribution. However, the properties of a method using the JAV principle with another imputation model are thus far unknown.

Conclusions

JAV gives consistent estimation for linear regression with a quadratic or interaction term when data are MCAR, but may be biased when data are MAR. The bias of JAV can be severe when used for logistic regression. JAV is the best of a set of imperfect methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

IRW proposed the study. All authors made substantial contributions to the direction of the study, the design of the simulation studies, the interpretation of the results, and the writing of the manuscript. SRS carried out the simulations and data analysis and drafted the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

SRS and IRW were funded by the MRC Unit Programme Number U105260558 and Grant Number MC_US_AO30_0015. JWB was supported by the ESRC Follow-On Funding scheme (RES-189-25-0103) and MRC grant G0900724. We thank Patrick Royston and James Carpenter for valuable discussions, and Ruth Keogh for help with acquiring and understanding the EPIC data. We would like to acknowledge the contribution of the staff and participants of the EPIC-Norfolk Study. EPIC-Norfolk is supported by the MRC programme grant (G0401527).

Pre-publication history

The pre-publication history for this paper can be accessed here: