Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Germany

German Cochrane Centre, University Medical Center Freiburg, Germany

Medical Statistics Unit, London School of Hygiene and Tropical Medicine, London, UK

Abstract

Background

The heterogeneity statistic ^{2}, interpreted as the percentage of variability due to heterogeneity between studies rather than sampling error, depends on precision, that is, the size of the studies included.

Methods

Based on a real meta-analysis, we simulate artificially 'inflating' the sample size under the random effects model. For a given inflation factor

Results

As precision increases, while estimates of the heterogeneity variance ^{2 }remain unchanged on average, estimates of ^{2 }increase rapidly to nearly 100%. A similar phenomenon is apparent in a sample of 157 meta-analyses.

Conclusion

When deciding whether or not to pool treatment estimates in a meta-analysis, the yard-stick should be the clinical relevance of any heterogeneity present. ^{2}, rather than ^{2}, is the appropriate measure for this purpose.

Background

In meta-analysis, three principal sources of heterogeneity can be distinguished. These are (i)

In this paper, we show that ^{2 }increases with the number of patients included in the studies in a meta-analysis. In the light of this, we argue that ^{2 }is in general of limited use in assessing clinically relevant heterogeneity.

The article is structured as follows. After introducing existing measures of heterogeneity in meta-analysis and discussing their properties, we illustrate the problem of interpreting the measure ^{2 }using an example from the literature. We then present a simulation study which explores the effect of sample size inflation on ^{2}, and finally conclude with a discussion.

Methods

Let _{i }be the within-study treatment effect estimate (e.g., a log odds ratio), _{i}, and _{i }the weight of study _{i }= 1/_{i }= 1/(^{2}) if the random effects model is used (see below for definition and estimation of the heterogeneity variance ^{2}). Several measures of statistical heterogeneity are widely used:

1. Cochran's ^{2 }distribution with

2. Higgins' and Thompson's ^{2}, derived from Cochran's

3. the between-study variance, ^{2}, as estimated in a random effects meta-analysis. There are several proposals for estimating ^{2 }in a meta-analysis, such as the REML estimator or the Hedges-Olkin estimator ^{2 }

4. ^{2}, derived from Cochran's

and

5. ^{2}, similar to ^{2 }and calculated from ^{2 }and a so-called 'typical' within-study variance ^{2 }(which must be estimated), and defined as:

As seen here, and described elsewhere

Properties of measures of heterogeneity.

Measure

measured on

increasing with

scale

range

number of studies in meta-analysis

precision (size of studies)

absolute

[0, ∞)

yes

yes

^{2}

percent

[0, 100%]

no

yes

^{2}

outcome

[0, ∞)

no

no

^{2}

absolute

[1, ∞)

no

yes

^{2}

absolute

[1, ∞)

no

yes

1. ^{2 }distribution with _{0}, is the weighted sum of squared differences between the study means and the fixed effect estimate. It always increases with the number of studies,

2. In contrast to ^{2 }was introduced by Higgins and Thompson ^{2 }is interpreted as the percentage of variability in the treatment estimates which is attributable to heterogeneity between studies rather than to sampling error.

3. ^{2 }describes the underlying between-study variability. Its square root,

4. ^{2 }is a test statistic. It describes the relative difference between the observed

5. ^{2 }is the square of a statistic ^{2 }= 1 indicates perfect homogeneity

Notice that, in contrast to ^{2}, the measures ^{2}, ^{2}, increases. This is reflected in the interpretation: As ^{2 }is the percentage of variability that is due to between-study heterogeneity, 1 - ^{2 }is the percentage of variability that is due to sampling error. When the studies become very large, the sampling error tends to 0 and ^{2 }tends to 1. Such heterogeneity may not be clinically relevant.

We now explore this further using simulation. Note first that simply looking at the effect of scaling up all sample sizes by a common factor (leaving their treatment effects unchanged) is not appropriate. This is because if study sizes were truly to increase, estimates would approach the true value for each study and not be fixed at the original observed value. Instead, we simulate under the random effects model. Under this model, ^{2 }are assumed constant, and the total variance in study ^{2}, which decreases with increasing study sample size, eventually tending to ^{2}.

Study size inflation based on the random effects model

Suppose in a meta-analysis trial _{i }(e.g., on the log odds scale) with observed sampling variance ^{2 }denote the heterogeneity variance. The model is

where

We generate an illustrative meta-analysis for each inflation factor. For each trial in each meta-analysis, we generate a random _{M,i }from this model, using ^{2}.

Results

We use data from a large meta-analysis (of 70 trials) to estimate the effect of thrombolytic therapy in acute myocardial infarction ^{2 }is 0.018 (^{2 }= 18.6%, 95% CI [0%; 40.1%]). As

We now explore the effect of increasing ^{2 }is essentially random, the values of ^{2 }increase rapidly with increasing sample size.

Effect of increasing within trial precision (factor

Factor

Measure

^{2}

1

0.018

85 (0.0953)

18.6% [0%; 40.1%]

1.11 [1; 1.29]

4

0.008

98 (0.0135)

29.2% [4.5%; 47.6%]

1.19 [1.02; 1.38]

16

0.027

454 (<0.0001)

84.8% [81.4%; 87.5%]

2.56 [2.32; 2.83]

64

0.028

1708 (<0.0001)

96.0% [95.4%; 96.5%]

4.98 [4.65; 5.32]

Top left panel: Meta-analysis of thrombolytic therapy in acute myocardial infarction

**Top left panel: Meta-analysis of thrombolytic therapy in acute myocardial infarction **

Figures ^{2 }varies randomly, while (i) the average of the within study variances; (ii) the estimated total variance (under the model), and (iii) the observed total variance, all decrease rapidly with increasing ^{2 }behaves. Note how rapidly it approaches 100%.

Within-study variation, decreasing with increasing sample size while heterogeneity remains constant

**Within-study variation, decreasing with increasing sample size while heterogeneity remains constant**. Details in text.

Percentage ^{2 }of variation due to heterogeneity rather than to sampling error against sample size (same simulation data as in Figure 2)

**Percentage**** I**^{2 }**of variation due to heterogeneity rather than to sampling error against sample size (same simulation data as in Figure 2)**.

Empirical evaluation: a sample of meta-analyses

In order to examine the behavior and the order of magnitude of ^{2 }empirically, we further looked at a sample of 157 meta-analyses with binary endpoints. This data set was kindly provided by Peter Jüni ^{2 }and ^{2 }for each meta-analysis. Further, for each meta-analysis, we calculated the median study size of the contributing studies, denoted _{i}, ^{2 }= ^{2 }= 0 (^{2 }as outcome and _{i }as covariates (thus implicitly assuming a log-normal distribution for study size).

As expected, ^{2 }increases with both heterogeneity (_{τ }= 65.873, SE = 4.788, _{log n }= 8.503, SE = 1.460, ^{2}, ^{2 }depends strongly on study size. Figure

^{2 }against median study size in a sample of 157 meta-analyses

**I**^{2 }**against median study size in a sample of 157 meta-analyses**. Light, grey and black dots and regression lines correspond to the first, second and third tercile of the distribution of ^{2}.

Light, grey and black dots and regression lines correspond to the first, second and third tercile of the distribution of ^{2}. Within each class of meta-analyses, ^{2 }is increasing with median study size.

Discussion

The main advantage of the statistic ^{2 }is that it does not depend on the number of studies in a meta-analysis. Thus, using ^{2 }instead of ^{2 }is easily interpreted by clinicians as the percentage of variability in the treatment estimates which is attributable to heterogeneity between studies rather than to sampling error.

However, an immediate (but often overlooked) consequence of this interpretation is that ^{2 }increases with the number of patients included in the studies in a meta-analysis. In a recent simulation using continuous outcomes, others found empirically that ^{2 }increased with increasing numbers of patients per trial though ^{2 }was kept fixed ^{2 }to decide whether to pool studies in a meta-analysis. Some authors also seem to be reluctant to call ^{2 }a statistic, using instead words such as metric ^{2 }in one of these references ^{2 }falls below a prespecified level. In response to this ^{2}.

Our simulation highlights the problem of interpreting heterogeneity measured by ^{2 }as clinical heterogeneity. This is analogous to interpreting statistically significant effects (^{2}. Instead, studies with relatively large ^{2 }may usefully be pooled when the clinically relevant heterogeneity (in efficacy and covariates) is acceptably small.

Further, as _{0 }= - log 0.8 = log 1.25 = 0.22 = ^{2}.

While Higgins and Thompson in their papers ^{2}] greater than 50% may be considered as substantial heterogeneity'. The recent Version 5.0.1, while admitting that 'thresholds for the interpretation of ^{2 }can be misleading, since the importance of inconsistency depends on several factors', nevertheless lists overlapping ranges of ^{2 }which provide 'a rough guide to interpretation' (see Table ^{2 }> 50% ^{2 }> 50%

Ranges for interpretation of ^{2 }following the Cochrane Handbook for Systematic Reviews of Interventions (Version 5.0.1)

0% to 40%

might not be important

30% to 60%

may represent moderate heterogeneity

50% to 90%

may represent substantial heterogeneity

75% to 100%

considerable heterogeneity

We believe the interpretation issues stem from the concept of ^{2 }as 'the proportion of variance (un)explained', referred to as 'widely familiar' to clinicians by Higgins and Thompson _{2}: On the one hand, ^{2 }tends to 100% as the number of patients increases. Although one may argue that the 'unit' corresponding to the 'observation' in a regression is the study, not the patient, this link is only strictly valid if sample size of new studies are distributed similarly to those of existing studies. This is not universally true. Often small trials are followed by larger ones. Thus ^{2 }will tend to increase artificially as evidence accumulates.

To address this, more weight should be given to often overlooked comments by Higgins and Thompson, ^{2}, but with different degrees of sampling error ^{2}, will produce different measures.... Describing the underlying between-study variability ... can best be achieved simply by estimating the between-study variance, ^{2}.'

Conclusion

When deciding whether or not to pool treatment estimates in a meta-analysis, the yard-stick should be the clinical relevance of any heterogeneity present. ^{2}, rather than ^{2 }is the appropriate measure for this purpose.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

GR proposed the model for sample size inflation, did all calculations and wrote the first draft of the manuscript. GS, JC and MS contributed to the writing and approved the final version.

Acknowledgements

GR and JC are funded by Deutsche Forschungsgemeinschaft (FOR 534 Schw 821/2-2). The authors wish to thank Peter Jüni for providing data and all reviewers and Douglas G Altman for helpful discussion.

Pre-publication history

The pre-publication history for this paper can be accessed here: