Abstract
Background
Metaanalysis can be used to pool rate measures across studies, but challenges arise when followup duration varies. Our objective was to compare different statistical approaches for pooling count data of varying followup times in terms of estimates of effect, precision, and clinical interpretability.
Methods
We examined data from a published Cochrane Review of asthma selfmanagement education in children. We selected two rate measures with the largest number of contributing studies: school absences and emergency room (ER) visits. We estimated fixed and randomeffects standardized weighted mean differences (SMD), stratified incidence rate differences (IRD), and stratified incidence rate ratios (IRR). We also fit Poisson regression models, which allowed for further adjustment for clustering by study.
Results
For both outcomes, all methods gave qualitatively similar estimates of effect in favor of the intervention. For school absences, SMD showed modest results in favor of the intervention (SMD 0.14, 95% CI 0.23 to 0.04). IRD implied that the intervention reduced school absences by 1.8 days per year (IRD 0.15 days/childmonth, 95% CI 0.19 to 0.11), while IRR suggested a 14% reduction in absences (IRR 0.86, 95% CI 0.83 to 0.90). For ER visits, SMD showed a modest benefit in favor of the intervention (SMD 0.27, 95% CI: 0.45 to 0.09). IRD implied that the intervention reduced ER visits by 1 visit every 2 years (IRD 0.04 visits/childmonth, 95% CI: 0.05 to 0.03), while IRR suggested a 34% reduction in ER visits (IRR 0.66, 95% CI 0.59 to 0.74). In Poisson models, adjustment for clustering lowered the precision of the estimates relative to stratified IRR results. For ER visits but not school absences, failure to incorporate study indicators resulted in a different estimate of effect (unadjusted IRR 0.77, 95% CI 0.59 to 0.99).
Conclusions
Choice of method among the ones presented had little effect on inference but affected the clinical interpretability of the findings. Incidence rate methods gave more clinically interpretable results than SMD. Poisson regression allowed for further adjustment for heterogeneity across studies. These data suggest that analysts who want to improve the clinical interpretability of their findings should consider incidence rate methods.
Background
Metaanalysis has become recognized as an objective means of summarizing evidence from disparate clinical trials [1]. It is particularly useful when the trials are small and the data are conflicting. Metaanalysis incorporates statistical approaches to pool aggregate data from clinical trials into a summary effect measure [2]. This measure then reflects the effect of an intervention on average across all studies. However, metaanalysis is limited by inclusion of poor quality trials that are prone to report biased findings and exclusion of unpublished trials that do not report findings. Methods for assessing the effect of these limitations on summary measures have been developed and are available [35].
At times, data from clinical trials may conform to continuous rate measures (events per persontime) in which the numerator represents a count of total events "x" and the denominator represents a given time duration multiplied by the number of subjects, e.g. health care visits per personyear. Data such as these are being reported more frequently in clinical trials as evidenced by inclusion of rate measures in recent Cochrane Systematic Reviews [69]. If the reported length of followup is the same across studies, e.g. 12 months, then metaanalysis might involve pooling the weighted withinstudy differences in the mean number of events per person between intervention and control groups, a method we will call the weighted mean difference (WMD) [10]. The interpretation is straightforward and reflects the change in "x" per unit time. However, if the reported length of followup from various studies is different, e.g. 6 months versus 1 year, then metaanalysis could involve the conversion of the study differences into a common metric prior to pooling. This is often accomplished by dividing the per study differences between groups by the pooled standard deviation, a procedure known as the standardized weighted mean difference (SMD) [10]. This method is robust to assumptions of varying followup time. However, the interpretation is more difficult, since it reflects the difference between intervention and control groups measured in standard deviation units rather than natural time units.
In this paper, we examined data from a recently published Cochrane Systematic Review that included continuous rate measures as outcomes. We compared different statistical approaches to pooling continuous rate measures when they were reported with varying followup time. Specifically, we examined the SMD, considered the standard approach, to two alternative methods, incidence rate differences and incidence rate ratios. We examined the results from the different approaches in terms of the point estimates of treatment effect, their precision, and clinical interpretability. We are unaware of previously published studies that have attempted to address this problem.
Methods
Data were taken from a recently published Cochrane systematic review on the effects of asthma selfmanagement education in children [11]. We selected the two outcomes involving continuous rate measures with the greatest number of contributing studies: days of school absence and emergency room (ER) visits. Our goal was to compare the standardized weighted mean difference with two alternative statistical approaches to pooling rate data, incidence rate differences and incidence rate ratios.
The standardized weighted mean difference (SMD) represents a weighted average of the per study difference in mean events per person between treatment and control groups. We first calculated standardized effect sizes for each study by subtracting the reported mean number of events in the control group from the reported mean number of events in the treatment group and dividing by the pooled standard deviation [10]. The per study standardized effect sizes were then combined using both fixed and randomeffects models [12,13]. The fixedeffects model is essentially a weighted average of the studyspecific results in which the weight for each study is proportional to the inverse of the variance of the studyspecific SMD. The randomeffects model allows for variability among studies in the SMD by incorporating a term for the amongstudy variability into the weights. Fixed and randomeffects models will generally agree when there is little heterogeneity among studies.
To estimate stratified incidence rate differences (IRD) and stratified incidence rate ratios (IRR), we calculated incidence rates taking time explicitly into account. For each study, we knew the mean number of events (days absent or emergency room visits) and the number of months of observation according to the reported study design. We multiplied the mean by the sample size for each treatment arm to get the total number of events observed in each arm, e.g. the total number of days absent for all participants in the control group. We rounded this to the nearest whole number of events. To obtain the total persontime of followup, we assumed that there was no loss to followup during the study, i.e. all participants were observed for the entire length of the study. We multiplied the number of months of followup by the sample size for each arm to obtain the total number of personmonths of followup. The studyspecific rate of events per personmonth for each arm was then the total number of events (days absent or emergency room visits) divided by the total number of personmonths of observation for each arm.
The analysis of the rates used stratified IRD and IRR methods estimated in STATA (version 7). To obtain a summary stratified IRD, we used a program, cowritten by one of us (JAB) to implement a fixedeffects MantelHaenszel (MH) procedure in STATA. Specifically, the program produced the estimates of IRD and its variance described in Rothman and Greenland's textbook [14]. We also utilized an inversevariance weighted average approach to estimate a randomeffects models by first using STATA's "ird" command, saving those studyspecific results, then using the STATA command "meta" to compute the weighted average IRDs [13].
To obtain a summary stratified IRR, we used a fixedeffects MH type procedure as implemented in the "ir" command in STATA, which should give results similar to fitting a Poisson regression model with indicator variables for "study." This MH approach produces a summary estimate stratified on study. To take studytostudy variability into account, we also fit Poisson regression models allowing for clustering of the data by study, both with and without study indicator variables. The inclusion of indicator variables forces the comparison between treatments to be made within study, thereby mimicking the stratified analysis. In STATA, we also fit Poisson regression models using the "cluster" option, which uses a robust (HuberWhite "sandwich") estimator of the variance [15]. The intent of fitting these models that allowed for clustering was to inflate the variance estimates to allow for amongstudy variability, and (as will be demonstrated) would not affect the point estimates of treatment effect.
Our interest was in comparing the qualitative and, where possible, the quantitative results across the different methods. We were interested in differences in inference that could be made from the various models, which integrate information about the point estimates of treatment effects and the precision of their estimation but may vary in their assumptions. We also compared conclusions as to the heterogeneity of effects across studies. The methods based on weighted averages use a test of heterogeneity similar in principle to the Cochrane Qstatistic. The test for heterogeneity in the Poisson regression models is based on the interactions between the treatment variable and the study indicator variables. Most importantly, we were concerned with the clinical interpretability of the results. All pvalues reported are twosided and all confidence intervals are calculated at the 95% level.
Results
We illustrate the use of SMD, IRD, and IRR methods for pooling continuous rate measures using data from a published Cochrane systematic review and metaanalysis that examined the effect of selfmanagement education on morbidity and health services outcomes in children and adolescents with asthma [11]. The metaanalysis included 32 separate trials, involving 3706 children and adolescents aged 2 to 18 years. The majority were small, randomized controlled trials that enrolled children with severe asthma. We abstracted data on two outcomes–days of school absence and ER visits – from the original published study. For each outcome, we abstracted the reported mean number of events, standard deviation, sample size, and observation time in months for treatment and control groups. We contacted study authors to identify missing data from published reports. If appropriate measures of variance were not reported nor obtained by author contact, we imputed pooled standard deviations using a conservative approach given the tstatistic or the pvalue if the tstatistic was not reported [16].
Table 1 lists the treatment and control group sizes, mean number of events, standard deviations, rates (events/personmonth), duration of followup, and standardized effect size for each of the 16 trials contributing data on school absences. Sample sizes ranged from 19 to 404 participants, and the duration of observation varied widely from 1 to 12 months. Most of the trials favored the treatment arm, i.e. negative effect sizes implied a reduction in school absences. However, larger studies tended to have standardized effect size estimates closer to the null.
Table 1. Characteristics of Studies Reporting on School Absences.*
Similarly, table 2 lists the treatment and control group sizes, mean number of events, standard deviations, rates (events/personmonth), duration of followup, and standardized effect size for each of the 12 trials contributing data on ER visits. Again, sample sizes ranged from a low of 14 to a high of 232, but the duration of followup was more homogenous with most trials reporting 12 months of observation. Again, most trials favored the treatment arm. Similar to school absences, larger studies tended to have standardized effect size estimates closer to the null.
Table 2. Characteristics of Studies Reporting on Emergency Room Visits.*
Table 3 presents the summary outcome measures for school absences. Effect sizes from the 3 methods gave qualitatively similar conclusions and suggest that treatment reduces school absences. Both fixed and randomeffects SMD gave identical estimates, since there was little to no statistical heterogeneity present (p = 0.61). IRD methods gave clinically interpretable results on the absolute scale. The fixedeffects results suggest that treatment results in an average reduction of 0.15 school absences per child per month (1.8 absences per year). Randomeffects estimates were consistent with the fixedeffects results but with wider confidence intervals. IRR methods gave clinically interpretable results on the relative scale. These results suggest that treatment results in a 14% reduction in school absences. IRR estimates obtained using Poisson regression with HuberWhite sandwich estimators gave a more conservative estimate than IRR estimates obtained using MH procedures. The IRR estimate obtained without study indicators was similar to the IRR estimate with study indicators, suggesting that confounding by study was not present for this outcome (see the Appendix for further discussion of this point). Heterogeneity was statistically detected when data were pooled using IRD (p < 0.001) and IRR methods (p < 0.001) but not SMD, suggesting that treatment effects varied across studies when assessed in terms of rates, but not when assessed in terms of standard deviation units.
Table 3. Summary Outcome Measures for Days of School Absence.
Table 4 presents summary outcomes measures for ER visits. Results were again qualitatively similar regardless of method and suggest that treatment reduces ER visits. Randomeffects SMD gave a more conservative estimate with wider confidence intervals than the corresponding fixed effects SMD, due to heterogeneity in effects across the studies (p = 0.05). IRD methods gave clinically interpretable results on the absolute scale: treatment results in an average reduction of 0.04 ER visits per child per month (one ER visit every other year). The estimate obtained by the randomeffects model was consistent with that from the fixedeffects model but with wider confidence intervals. IRR methods gave clinically interpretable results on the relative scale: treatment results in a 23 to 34% reduction in ER visits. IRR estimates obtained using Poisson regression with HuberWhite sandwich estimators gave a more conservative estimate than IRR estimates obtained using MH procedures. The IRR estimate obtained without study indicators was closer to the null than the IRR estimate with study indicators, suggesting that confounding by study was present for this outcome (see Appendix). Heterogeneity was statistically present in IRD (p < 0.001) and IRR (p < 0.001) methods as well as for SMD for this outcome, suggesting that treatment effects varied across studies.
Table 4. Summary Outcome Measures for Emergency Room Visits.
Discussion
This paper presented three statistical methods of pooling continuous rate measures in which the denominator reflects varying duration of observation. All methods were fairly easy to implement using standard statistical software. Results were statistically consistent regardless of the method employed and suggested a significant treatment effect on average. All methods allowed for explicit adjustment for individual studies. Failure to take stratification by study into account, as illustrated in the Poisson models without study indicators, resulted in a different estimate for one outcome, ER visits, but not the other, school absences.
IRD methods gave clinically interpretable results on an absolute scale. These results suggest that treatment results in an average reduction of 0.15 school absences per personmonth or roughly 2 days per personyear. These results also suggest that treatment results in an average of 0.04 fewer ER visits per personmonth or roughly 1 fewer visit per person every 2 years. IRR methods gave clinically interpretable results on a relative scale. These results suggest that treatment results in a 14% reduction in school absences and a 34% reduction in ER visits.
The SMD results were not immediately clinically interpretable. On a standard deviation scale, these results suggest that treatment results in a modest reduction in school absences and ER visits. Conversion back to the original scale would allow for more clinically interpretable results but would require making an assumption about the size of the standard deviation and the event rate in the control group across studies. For standard deviations, it is not clear whether one should use a studyspecific estimate of the standard deviation or an estimate pooled across studies. Additionally, the data can be skewed, in which case mean events might not appropriately represent the central tendency of the data.
Heterogeneity was statistically present for both outcomes, suggesting variability in treatment effects across studies when incidence ratebased methods were used, and for ED visits but not school absences when SMD was used. It should be kept in mind that, although all of these analyses are attempting to address the same underlying substantive question (i.e., whether asthma education "works"), the SMD analyses address this question on a fundamentally different scale by converting measurements into standard deviation units. This difference in scale could well account for the different results of the heterogeneity tests.
Another alternative that we tried but abandoned because of its nonstandard nature was simply to convert the time units from the various studies into a common scale and pool the data using WMD. We found (data not shown) slight but noticeable differences depending on whether we multiplied up for the shorter studies or down for the longer studies to achieve the common scale. For example, studies with 6month followup and 12month followup could be put on a common scale, by either multiplying the 6month study means and standard deviations by 2 or dividing the 12month study means and standard deviations by 2. These different approaches changed the perstudy weights and produced slight differences in summary measures. We believe that the fundamental problem with this approach is that it rests on the assumption that the event rates stay constant over the entire time period of observations. This is also true for the rate models we did use, but unlike those models, multiplying up essentially imputes data beyond the actual period of observation. This has implications not only for the mean number of events, but possibly also for the variance estimates. For these reasons, we chose not to consider this approach any further.
There are limitations to these findings. First, we explored differences in the three approaches using only data from a single systematic review. However, the outcomes we chose had a sufficient number of contributing studies to assess for small differences among the approaches. Second, in the calculation of event rates using the incidence ratebased methods, we assumed complete followup of participants in each study. However, this method is robust to incomplete followup if the number of events and the amount of time contributed by each participant are known or it can be assumed that individuals lost to followup contribute no events or followup time and loss to followup is not differential between the treatment groups.
Conclusions
In this study, we demonstrated that choice of method among the ones presented here for continuous rate measures had little effect on inference. SMD, IRD, and IRR methods all gave qualitatively similar estimates of effect and suggest that the intervention was effective for both outcomes. However, choice of method clearly affected clinical interpretability. SMD, reportedly the standard method employed for analysis of rate measures of varying time duration, was not immediately interpretable. Stratified IRD allowed for clinical interpretability on an absolute scale. Stratified IRR or Poisson models allowed for clinical interpretability on a relative scale. For further discussion of the merits of absolute versus relative effects, we recommend that the reader consult additional references [10]. In addition as we have shown, failure to incorporate study indicators in the Poisson analysis may produce different (and inappropriate) estimates of treatment effect. (For an explanation of why we consider this an inappropriate approach, see the Appendix). We recommend that statistical software packages used for metaanalysis consider the addition of stratified IRD and IRR procedures.
Appendix
Table 5 demonstrates the need to perform analyses stratified by study when comparing event rates between treatments. A similar argument would apply to the comparison of risks. The principle demonstrated, among epidemiologists, would be called "confounding by study," and among statisticians might be more familiar as an example of "Simpson's Paradox." In brief, we have generated a hypothetical example, in the table, of a situation in which the baseline (control) rates differ markedly between studies. In addition, the feature that generates the problem is that there is imbalance in the amount of persontime in the treatment and control groups in the second study, perhaps as a result of unequal allocation of subjects to the two conditions.
Table 5. Example of Confounding by Study.
Within each study, the estimate of the relative risk is 0.5. Thus, any reasonable analysis that takes stratification by study into account (and averages the withinstudy treatment effects) would necessarily produce an average treatment effect of 0.5. Because of the associations noted above, the analysis ignoring study produces an estimated treatment effect of 0.32. This result clearly is not at all representative of the results within either of the individual studies. Note that this concept is not the same as the usual concept of "heterogeneity," which is generally used to refer to situations in which the treatment effect varies across studies. In our example, the treatment effect is constant across studies (on the relative rate scale), although the baseline rate varies dramatically between the studies.
Competing interests
None declared.
Authors' Contributions
JG conceived of the study, participated in the design and analysis of the study, wrote the manuscript. JB participated in the design of the study, performed the main statistical analysis, and participated in writing the manuscript. FW participated in the design and analysis of the study. All authors read and approved the final manuscript.
Acknowledgments
We would like to thank Doug Altman for his critical review of the manuscript. We would also like to acknowledge Russell Localio for sharing his STATA program on implementing stratified incidence rate differences for fixed and randomeffects models. This paper was presented at the XI Cochrane Colloquium, Barcelona, Spain, on October 31, 2003.
References

Egger Matthias, Smith George Davey, O'Rourke Keith: Principles of and procedures for systematic reviews. In Systematic reviews in health care: metaanalysis in context. 2nd edition. Edited by Egger Matthias, Smith George Davey and Altman Douglas G. London, BMJ Publishing Group; 2001:2342.

Thacker Stephen B: Metaanalysis: a quantitative approach to research integration.
JAMA 1988, 259:16851689. PubMed Abstract  Publisher Full Text

Jadad Alejandro R, Moore R Andrew, Carroll Dawn, Jenkinson Crispin, Reynolds D John M, Gavaghan David J, McQuay Henry J: Assessing the quality of reports of randomized clinical trials: is blinding necessary?
Control Clin Trials 1996, 17:112. PubMed Abstract  Publisher Full Text

Egger Matthias, Smith George Davey, Schneider Martin, Minder Cristoph: Bias in metaanalysis detected by a simple, graphical test.
BMJ 1997, 315:629634. PubMed Abstract  Publisher Full Text

Begg CB, Mazumdar M: Operating characteristics of a rank correlation test for publication bias.
Biometrics 1994, 50:10881099. PubMed Abstract

Victor S, Ryan SW: Drugs for preventing migraine headaches in children (Cochrane Review). The Cochrane Library, Issue 4, 2003. Chichester, UK, John Wiley & Sons; 2003.

Shea B, Wells G, Cranney A, Zytaruk N, Robinson V, Griffith L, Hamel C, Ortiz Z, Peterson J, Adachi J, Tugwell P, Guyatt G, The Osteoporosis Methodology Group, The Osteoposrosis Research Advisory Group: Calcium supplementation on bone loss in postmenopausal women (Cochrane Review), Issue 4, 2003. Chichester, UK, John Wiley & Sons; 2003.

Nannini L, Lasserson TJ, Poole P: Combined corticosteroid and longacting betaagonist in one inhaler for chronic obstructive pulmonary disease (Cochrane Review), Issue 4, 2003. Chichester, UK, John Wiley & Sons; 2003.

Gray OM, McDonnell GV, Forbes RB: Intravenous immunoglobulins for multiple sclerosis (Cochrane Review), Issue 4, 2003. Chichester, UK, John Wiley & Sons; 2003.

Deeks Jonathan J, Altman Douglas G, Bradburn Michael J: Statistical methods for examining heterogeneity and combining results from several studies in metaanalysis. In Systematic reviews in health care: metaanalysis in context. 2nd edition. Edited by Egger Matthias, Smith George Davey and Altman Douglas G. London, BMJ Publishing Group; 2001:285312.

Wolf Frederic M, Guevara James P, Grum Cyril M, Clark Noreen M, Cates Christopher J: Educational interventions for asthma in children (cochrane review), in the Cochrane Library, Issue 1. [http://www.updatesoftware.com/abstracts/ab000326.htm] webcite

DerSimonian R, Laird N: Metaanalysis in clinical trials.
Control Clin Trials 1986, 7:177188. PubMed Abstract  Publisher Full Text

Laird NM, Mosteller F: Some statistical methods for combining experimental results.

Rothman KJ, Greenland S: Modern epidemiology. 2nd edition. Philadelphia, LippincottRaven; 1998.

White H: A heteroskedasticityconsistent covariance matrix estimator and a direct test for heteroskedasticity.

Guevara James P, Wolf Frederic M, Grum Cyril M, Clark Noreen M: Effects of educational interventions for self management of asthma in children and adolescents: systematic review and metaanalysis.
BMJ 2003, 326:1308. PubMed Abstract  Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: