Faculty of Pharmaceutical Sciences, University of British Columbia, Vancouver BC, Canada

Alberta Research Centre for Health Evidence, University of Alberta, Edmonton Alberta, Canada

Evidence-Based Medicine, Department of Family Medicine, University of Alberta, Room 1706 College Plaza, 8215 - 112 Street NW, Edmonton AB, Canada

Abstract

Background

Controversies are common in medicine. Some arise when the conclusions of research publications directly contradict each other, creating uncertainty for frontline clinicians.

Discussion

In this paper, we review how researchers can look at very similar data yet have completely different conclusions based purely on an over-reliance of statistical significance and an unclear understanding of confidence intervals. The dogmatic adherence to statistical significant thresholds can lead authors to write dichotomized absolute conclusions while ignoring the broader interpretations of very consistent findings. We describe three examples of controversy around the potential benefit of a medication, a comparison between new medications, and a medication with a potential harm. The examples include the highest levels of evidence, both meta-analyses and randomized controlled trials. We will show how in each case the confidence intervals and point estimates were very similar. The only identifiable differences to account for the contrasting conclusions arise from the serendipitous finding of confidence intervals that either marginally cross or just fail to cross the line of statistical significance.

Summary

These opposing conclusions are false disagreements that create unnecessary clinical uncertainty. We provide helpful recommendations in approaching conflicting conclusions when they are associated with remarkably similar results.

Background

Most published reports of clinical studies begin with an abstract – likely the first and perhaps only thing many clinicians, the media and patients will read. Within that abstract, authors/investigators typically provide a brief summary of the results and a 1–2 sentence conclusion. At times, the conclusion of one study will be different, even diametrically opposed, to another despite the authors looking at similar data. In these cases, readers may assume that these individual authors somehow found dramatically different results. While these reported differences may be true some of the time, radically diverse conclusions and ensuing controversies may simply be due to tiny differences in confidence intervals combined with an over-reliance and misunderstanding of a “statistically significant difference.” Unfortunately, this misunderstanding can lead to therapeutic uncertainty for front-line clinicians when in fact the overall data on a particular issue is remarkably consistent.

A key concept of science is the formulation of hypotheses and the testing these hypotheses by observing a set of data. Typically in medicine one starts with an idea that a therapy will have an effect. A statistical test assumes that an intervention has no effect and this is called the null hypothesis. A statistical evaluation simply provides information as to how likely that the finding of a particular difference could be due to chance and if there really was no difference between the treatment groups.

We can NEVER prove a null hypothesis, meaning the intervention has absolutely no effect. However, we design clinical studies with the hope they will provide information to help decide if we should reject or fail to reject the null hypothesis. Rejecting the null hypothesis is often interpreted to mean the intervention has an effect; failing to reject the null hypothesis is interpreted to mean the intervention does not have an effect. These simplistic interpretations ignore important factors such as clinical importance, precision of the estimate, and statistical power.

A well-designed randomized controlled trial (RCT) is usually the least biased way to evaluate the difference between different therapeutic interventions. Unless an RCT has studied the entire population of interest, the observed difference or ratio that is found is called a point estimate because only a small sample of the entire population has been evaluated. A point estimate is typically presented with a 95% (or less commonly 99%) confidence interval (CI). A CI, while it has other interpretations, can be thought of as a range of numeric outcomes that we are reasonably confident includes the actual result.

The choice of a specific CI typically comes from the convention of a p-value of 0.05 representing a statistical significance. This threshold has been discussed as being arbitrary but has also been suggested to represent a reasonable threshold

We have chosen three examples of this problem – a potential benefit of a medication, a comparison between new medications, and a medication with a potential harm. We will show this problem occurs with the highest-level evidence – randomized controlled trials and meta-analyses. We have framed these into three clinical questions:

1) In patients without a history of cardiovascular disease, do statins reduce mortality?

2) In patients with atrial fibrillation, when compared to warfarin, is apixaban more effective than dabigatran at reducing mortality?

3) In patients who smoke, does the use of varenicline increase the risk of serious cardiovascular adverse events?

Statins

Example 1. In patients without a history of cardiovascular disease, do statins reduce mortality?

Statins are widely used in patients with and without established cardiovascular disease. An important clinically relevant question is: do statins have an effect on overall mortality in patients who have not experienced a cardiovascular event? Because of the relatively low baseline 5-year risk of mortality in this population (roughly 5% over 5 years), no single study has been powered sufficiently to provide a clear answer. For that reason, at least five different meta-analyses examining this question have been published

The authors of these meta-analyses concluded the following:

•Studer

•Thavendiranathan

•Mills

•Brugts

•Ray

Three groups of investigators felt they had found the pooled clinical trial evidence sufficient to state that statins reduce overall mortality; yet two others felt their evidence did not support statins reducing overall mortality. Figure

Comparison of 5 meta-analyses examining relative risk of overall mortality with statin use in primary prevention

**Comparison of 5 meta-analyses examining relative risk of overall mortality with statin use in primary prevention.** Footnote: Brugts 2009 point estimate and confidence intervals are odds ratios (not relative risks).

Novel oral anticoagulants

Example 2. In patients with atrial fibrillation, when compared to warfarin, is apixaban more effective than dabigatran at reducing mortality?

A new class of oral anticoagulants (OACs) has recently been released on the market. An important clinical question is which one of these new agents is the most effective; does either agent reduce mortality more than the “gold-standard” warfarin? Two separate studies compare two of the new OACs and warfarin

Connolly

In contrast, Granger

Figure

Comparison of 2 randomized controlled trials examining the relative risk of overall mortality with 2 novel oral anticoagulants versus warfarin in atrial fibrillation

**Comparison of 2 randomized controlled trials examining the relative risk of overall mortality with 2 novel oral anticoagulants versus warfarin in atrial fibrillation.**

The Granger paper

Varenicline

Example 3. In patients who smoke, does the use of varenicline increase the risk of serious cardiovascular adverse events?

Varenicline is a new smoking cessation medication that is widely used. As with most new medications, the long-term or rare side effects are unknown. For that reason investigators have conducted meta-analyses of many small trials to try to identify any previously unknown adverse effects. Two meta-analyses have examined a possible risk of serious cardiovascular events with varenicline

Singh

In contrast a later meta-analysis by Prochaska

Figure

Comparison of 2 meta-analyses examining peto odds ratio of serious cardiovascular events with varenicline use in smoking cessation

**Comparison of 2 meta-analyses examining peto odds ratio of serious cardiovascular events with varenicline use in smoking cessation.**

Pragmatic interpretation of the included studies

Based on the evidence presented, we believe the following represents a reasonable and pragmatic interpretation of the results and how a clinician might use the information.

Statins’ effect on overall mortality in primary prevention

If you were a betting person, you should bet that statins likely reduce mortality in primary prevention. The average point estimate in these meta-analyses was around 0.90 or a 10% relative reduction

Novel anti-coagulants effect on overall mortality in atrial fibrillation

The evidence does not suggest any differences between apixaban and dabigatran and their effect on mortality compared to warfarin. As with the statins, dabigatran and apixaban likely do reduce mortality, approximately 10% with a CI of 0% to 20%

Varenicline effect on serious cardiovascular outcomes

If you were a betting person, you should bet that varenicline likely does increase the risk of cardiovascular events. The point estimate was roughly 60-70% but the effect could be as high as a 176% increase or an actual 10% reduction in cardiovascular events

Discussion

It appears that medical authors feel the need to make black and white conclusions when their data almost never allows for such dichotomous statements. This is particularly true when comparing results to similar studies with largely overlapping CIs. Virtually all of the conclusion confusion discussed in this paper can be linked to slavish adherence to an arbitrary threshold for statistical significance. Even if the threshold is reasonable, it still cannot be used to make dichotomous conclusions.

Although we have selected three examples here, these are certainly not the only ones. In another example we considered, the authors of two meta-analyses of primary prevention with aspirin report the exact same point estimate and confidence interval 0.94 (0.88-1.00) but had differing conclusions

We are not the first authors to write about the misinterpretation of CIs and statistical significance. About 60 years ago, RA Fischer introduced the p-value for hypothesis and significance testing

We encourage authors to avoid statements like “X has no effect on mortality” as they are likely to be both untrue and misleading. This is especially true as results get “close” to being statistically significant. Results should speak for themselves. For that to happen, readers (clinicians and science reporters) need to understand the language of statistics and approach authors’ conclusions with a critical eye. We are not trying to say that the reader should not review the abstract but when authors’ conclusions differ from others, readers must examine and compare the actual results. In fact, all but one of the meta-analyses provided point estimates and CIs in the abstracts. This facilitates quick comparisons to other studies reported to be “completely different,” and to determine if the CIs demonstrate clinically important differences. The problem lies in the authors’ conclusions, which often have little to do with their results but rather what they want the results to show. We encourage journal editors to challenge authors’ conclusions, particularly when they argue they have found something unique or different than other researchers but the difference is based solely on tiny variations in CIs or p-value (statistically significant or not).

We are not suggesting the elimination of statistical testing or statistical significance, but rather that all people (authors, publishers, regulators etc.) who write about medical interventions use common sense and good judgment when presenting results that differ from others and not be so beholden to the “magical” statistical significance level of 0.05. We urge them to consider the degree to which the results of the “differing” study overlap with their own, the true difference in the point estimates and range of possible effects, where the preponderance of the effect lies and how clinicians might apply the evidence.

It appears that readers of the papers discussed here would be better served by reviewing the actual results than reading the authors’ conclusions. To do that, clinicians need to be able to interpret the meaning of CIs and statistical significance.

Summary

Dogmatic adherence to statistical significance thresholds can lead authors to write dichotomized absolute conclusions while ignoring the broader interpretations of very consistent findings. These opposing conclusions are false disagreements that create unnecessary clinical uncertainty. Authors are encouraged to report data more pragmatically. Readers and clinicians are encouraged to compare the actual data and precision of the results rather than rely on the conclusions of the authors.

Abbreviations

CI: Confidence intervals; RCT: Randomized controlled trial.

Competing interests

We do not have any competing interest related to this article.

Authors’ contributions

GMA conceived of the paper, collected the examples, completed the first draft of the figures, managed the manuscript, edited the article substantially, and is the guarantor. JM helped refine the idea, completed the first draft, edited the figures and edited the article. BV contributed to early discussions and crafting of the article and edited the article. All authors read and approved the final manuscript.

Authors’ information

GMA is general practitioner and academic physician with focus in evidence-based medicine and knowledge translation. JM is doctor of pharmacy and an academic with a focus on evidence-based medicine and knowledge translation. BV is biostatistician and has written on statistical methodology and interpretation.

Acknowledgments

We would like the acknowledge Lynda Cranston, a medical writer, who reviewed a near-final draft of the article for grammatical errors and typos.

Pre-publication history

The pre-publication history for this paper can be accessed here: