Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, USA

Department of Mathematics, Statistics, and Computer Science, Bar Dan University, Ramat Gan 52900, Israel

Abstract

Background

Many randomized trials involve missing binary outcomes. Although many previous adjustments for missing binary outcomes have been proposed, none of these makes explicit use of randomization to bound the bias when the data are not missing at random.

Methods

We propose a novel approach that uses the randomization distribution to compute the anticipated maximum bias when missing at random does not hold due to an unobserved binary covariate (implying that missingness depends on outcome and treatment group). The anticipated maximum bias equals the product of two factors: (

Results

We illustrated the methodology using data from the Polyp Prevention Trial. We anticipated a maximum bias under complete confounding of .25. With only 7% and 9% missing in each arm, the upper bound factor, after adjusting for age and sex, was .10. The anticipated maximum bias of .25 × .10 =.025 would not have affected the conclusion of no treatment effect.

Conclusion

This approach is easy to implement and is particularly informative when less than 15% of subjects are missing in each arm.

Background

Missing outcome data are common in clinical studies

We illustrate the methodology using data from the Polyp Prevention Trial (PPT) in which 2079 men and women with recently removed colorectal adenoma were randomized to receive either intensive counseling to adopt a low-fat diet (intervention) or a standard brochure on healthy eating (control)

Methods

Adjusting for Observed Covariates

As a starting point, we assume the data are missing at random (MAR). Let

Because

In other words, under the MAR assumption in (1), the probability of adenoma recurrence conditional on treatment assignment and baseline covariates is the same in all subjects as in subjects not missing outcome. Baker and Laird

With binary outcomes, the overall measure of treatment effect is typically a difference, a relative risk, or an odds ratio. We focus on the difference because it is easy to interpret _{s }denote the treatment effect for stratum 5, namely

Δ_{s }=

By virtue of the randomization

Δ = Σ_{s}Δ_{s }

If the missing-data mechanism is given in (1), then from (2), the treatment effect in stratum

Δ_{s }=

Let _{zsy }denote the number of subjects in treatment group _{s }by

_{s }= _{s1 }- _{s0}, where _{sz }= _{zs1}/_{zs+}, (6)

where "+" denotes summation over the indicated subscript. Let _{zs }denote the number of subjects (with either observed or missing outcomes) in treatment group _{s }= _{+s}/_{++}, giving an overall estimate of treatment effect,

_{s}_{s}_{s } (7)

The estimate in (7) is closely related to the estimate proposed by Horvitz and Thompson _{1}_{1 }+ _{2}_{2 }+ .... _{h-1}_{h-1 }+ _{h }(1 -

where _{h }= 1 -

Bias from an omitted binary covariate

Suppose that instead of (1), the probability of missingness depends on treatment assignment, baseline strata,

In other words the data would be MAR if

We assume that for each level of

Δ_{s }=

=

Importantly Δ_{s }in (11) does not depend on

To formalize the relationship between _{s }let

α_{xs }=

ψ_{s }= α_{1s }- α_{0s } (14)

φ_{zs }=

ε_{s }= φ_{1s }- φ_{0s}. (16)

Combining (11) and (13), we can write

_{xs }+ Δ_{s}. (17)

Substituting (13)-(17) into (12) gives

For a tabular display of these calculations see Table

BK-plot of bias from an unobserved binary covariate among subjects not missing outcome

BK-plot of bias from an unobserved binary covariate among subjects not missing outcome. The upper diagonal line is the probability of outcome among subjects not missing outcome in randomization group _{0s }and the probability of outcome is indicated by point A. For subjects in group 1, the fraction with _{1s }and the probability of outcome is indicated by point B. The true treatment effect Δ_{s }is the difference between the diagonal lines. The apparent treatment effect Δ_{s }is the vertical distance between points A and B, which equals Δ + ψ_{s}ε_{s}, where ε_{s }= φ_{1s }- φ_{0s }and ψ_{s }= α_{1s }- α_{0s }= the slope of each diagonal line. To bound the overall bias Σ_{s}ψ_{s}ε_{s}_{s }based only on the fraction missing and a plausible value for the maximum of ψ_{s }based on the estimates of ψ_{s }if an observed covariate were missing.

Cell probabilities in a generic stratum

randomization group

unobserved covariate

probability of outcome given group, unobserved covariate,

probabilitity of unobserved covariate given group,

probability of outcome given group,

=

1

0

α_{0s }+ Δ_{s}

(1 - φ_{1s})

(α_{0s }+ Δ_{s}) (1 - φ_{1s}) + (α_{1s }+ Δ_{s}) φ_{1s}

1

α_{1s }+ Δ_{s}

φ_{1s}

0

0

α_{0s}

(1 - φ_{0s})

α_{0s}(1 - φ_{0s}) + α_{1s }φ_{0s}

1

α_{1s}

φ_{0s}

difference between randomization groups:

Δ_{s }+ ψ_{s}ε_{s}, where ε_{s }= φ_{1s }- φ_{0s}, ψ_{s }= α_{1s }- α_{0s}

Under missing at random (MAR), the probabilities in the third column are the same for subjects not missing outcome as for all subjects, so Δ_{s }represents the true treatment effect, which is the same for both levels of _{s }+ ψ_{s}ε_{s}. To bound the overall bias Σ_{s}ψ_{s}ε_{s}_{s }based only on the fraction missing and a plausible value for the maximum of ψ_{s }based on the estimates of ψ_{s }if an observed covariate were missing.

From (18) the bias from omitting _{s }ε_{s}. The first factor

ψ_{s }=

is the effect of _{s }=

ε_{s }=

ranges from -1 to 1 and measures the degree of confounding between _{s }= 0, there is no confounding and no bias because the distribution of _{s }= ± 1 there is complete confounding and the bias reaches the maximum value of ± ψ_{s}. Taking a weighted average over all strata, the overall apparent treatment effect is

and the overall bias is

_{s}ψ_{s }ε_{s }_{s}. (22)

Remarkably it is possible to obtain simple bounds on ε_{s }based only on the proportion of subjects who are missing in each randomized group in stratum

π_{zs }=

denote the proportion of subjects in randomization group _{s}, which we call the upper bound factor, is

If only 15% of the subjects are missing in each arm ε_{(max)s }is less than .18. If we let ψ_{max }denote the anticipated maximum value of ψ_{s}, then substituting (24) into (22) gives the anticipated maximum bias,

_{max }= ± ψ_{max }Σ_{s }ε_{(max)s }_{s}, (25)

where the anticipated maximum bias under complete confounding, ψ_{max}, is specified by the investigator; the upper bound factor, ε_{(max)s}, is based on the fraction with observed outcomes in stratum _{s }is the fraction of subjects in stratum

Thus the investigator need only specify ψ_{max}. One might argue that if _{max }would be close to 1. However because, "eligible subjects had no history of colorectal cancer, surgical resection of adenomas, bowel resection, the polyposis syndrome, or inflammatory bowel disease" _{max}, we suggest estimating ψ_{s}, as defined in (19), based on observed covariates. (See the Results section.) Of course the relationship between observed covariates and missingness could differ substantially from the relationship between an unobserved covariate and missingness. Nevertheless, we believe that estimates of ψ_{s }from observed covariates are helpful for specifying a realistic value for ψ_{max}.

Click here for file

Results

We applied our approach to data from the PPT trial stratified by age and sex (Table

Results of Polyp Prevention Trial

stratum s

adenoma

difference in observed

weight

bias factor ε_{(max)s}

stratum s

recurrence

rates of recurrence _{s}

_{
s
}

sex

age

group

no

yes

missing

control

573

374

94 (9%)

study

578

380

76 (7%)

men

30–49

control

33

22

5 (8%)

-.23

.07

.09

study

58

12

3 (4%)

40–59

control

99

76

7 (4%)

.01

.17

.05

study

94

76

9 (5%)

60–69

control

122

105

25 (10%)

-.04

.23

.11

study

144

105

18 (7%)

70–79

control

65

76

26 (16%)

-.04

.13

.20

study

70

71

29 (17%)

women

30–49

control

54

11

3 (4%)

.03

.10

.07

study

47

12

4 (6%)

40–59

control

69

24

4 (4%)

.02

.11

.04

study

69

27

4 (4%)

60–69

control

77

31

13(11%)

.08

.12

.11

study

68

40

5 (4%)

70–79

control

54

29

11(12%)

.22

.07

.12

study

28

37

4 (6%)

The overall estimate of the difference in probabilities of recurrence between study and control groups is _{s}_{s}_{s }= -.003 with a standard error .022. We define ε_{(max)s }= _{0s})/π_{1s}, (1 - π_{1s})/π_{0s}), where π_{zs }equals one minus the fraction missing in group _{max }Σ_{s }ε_{(max)s }_{s }= ± .10 ψ_{max}, where ψ_{max }is the anticipated bias if there were complete confounding of the unobserved covariate and treatment.

To compute the anticipated maximum bias (25) we first computed ε_{(max)s }using (24) and estimated _{s }from the observed fractions (Table _{s}ε_{(max)s }_{s }= .10. We then specified ψ_{max}, the anticipated maximum bias under complete confounding. To obtain a plausible value for ψ_{max}, we estimated ψ_{s }in (19) pretending either sex or age was the unobserved covariate _{max}, we specified a slightly larger value, ψ_{max }= .25, so that the anticipated maximum bias is _{max }= ± .25 × .10 = .025. The MAR confidence interval is shifted to the right or left by the anticipated maximum bias (Figure

Comparison of missing data adjustments for Polyp Prevention Trial

Comparison of missing data adjustments for Polyp Prevention Trial. The graph plots the estimated differences in the probability of adenoma recurrence between the intevention and control groups and the 95% confidence intervals. MAR is missing at random within strata. MAR ± bias shifts the MAR confidence interval based on the anticipated maximum bias. Worst and best case imputes missing data to the randomization group that would give the largest positive and negative effect, respectively.

For purpose of comparison, we also computed estimates and confidence intervals under a worst (best) case imputation

Our sensitivity analysis showed that the worst and best case imputations were too extreme. Because the absolute value of the anticipated maximum bias, .025, is smaller than 1.96 ×

Discussion

The key idea of our method is to incorporate non-MAR missingness by postulating an unobserved binary covariate. Although similar in spirit to using an unobserved binary covariate with observational data

The proposed methods hinges on first selecting the appropriate baseline covariates. We agree with Myers

We also agree with Shih _{zs }of subjects. Then it is more informative to write _{zs }as _{zs}. Because _{zs }contains no information about the effect of _{zs }by π_{zs }- _{zs}, which reduces ε_{(max)s }and hence reduces the anticipated maximum bias.

Although we applied our methodology to a cross-classification of categorical covariates, it could also be applied to continuous covariates or a univariate combination of covariates in a manner analogous to a propensity score _{z }= _{z}) = _{z}) = _{z}) = _{z}|_{z}) = _{z}. Therefore _{z}), and thus e_{z }contains the same information for the probability of being observed as _{z }to summarize the covariates predicting missingness. To form five strata for randomized group _{z }for each subject in group _{z }into quintiles.

Conclusion

The bias due to an unobserved binary covariate could arise when the probability of missingness depends on both treatment and outcome. Computation of the bias is easy because it equals the maximum anticipated bias under complete confounding multiplied by an upper bound factor. The maximum anticipated bias might require some expert input but some lower bound values can be obtained using observed baseline covariate. The upper bound factor is easily computed from the fraction missing in each group. The methodology is particularly useful in the common situation when no more than 15% of the subjects (in excess of those definitely MAR) have missing outcomes, so that the upper bound factor in the bias is less than .18.

Contributions

SGB devised the basic model with the unobserved covariate, worked out the unconstrained maximization, and wrote the initial draft of the manuscript. LSF worked out the constrained maximization and provided substantive improvements to the manuscript.

Pre-publication history

The pre-publication history for this paper can be accessed here: