|Home | About | Journals | Submit | Contact Us | Français|
John Ioannidis, Nikolaos Patsopoulos, and Evangelos Evangelou argue that, although meta-analyses often measure heterogeneity between studies, these estimates can have large uncertainty, which must be taken into account when interpreting evidence
An important aim of systematic reviews and meta-analyses is to assess the extent to which different studies give similar or dissimilar results.1 Clinical, methodological, and biological heterogeneity are often topic specific, but statistical heterogeneity can be examined with the same methods in all meta-analyses. Therefore, the perception of statistical heterogeneity or homogeneity often influences meta-analysts and clinicians in important decisions. These decisions include whether the data are similar enough to combine different studies; whether a treatment is applicable to all or should be “individualised” because of variable benefits or harms in different types of patients; and whether a risk factor affects all people exposed or only select populations. How uncertain is the extent of statistical heterogeneity in meta-analyses? Moreover, is this uncertainty properly factored in when interpreting the results?
Many statistical tests are available for evaluating heterogeneity between studies.2 3 Until recently, the most popular was Cochran's Q, a statistic based on the χ2 test.4 Cochran's Q usually has only low power to detect heterogeneity, however. It also depends on the number of studies and cannot be compared across different meta-analyses.2 3 Higgins and colleagues, in two highly cited papers,5 6 proposed the routine use of the I2 statistic. I2 is calculated as [(Q−df)/Q]×100%, where df is degrees of freedom (number of studies minus 1). Values of I2 range from 0% to 100%, and it tells us what proportion of the total variation across studies is beyond chance. This statistic can be used to compare the amount of inconsistency across different meta-analyses even with different numbers of studies.7 I2 is routinely implemented in all Cochrane reviews (standard option in RevMan) and is increasingly used in meta-analyses published in medical journals.
Higgins and colleagues suggested that we could “tentatively assign adjectives of low, moderate, and high to I2 values of 25%, 50%, and 75%.”6 Like any metric, however, I2 has some uncertainty, and Higgins and Thompson provided methods to calculate this uncertainty.5 Recently, other investigators compared the performance of I2 and Q in Monte-Carlo simulations across diverse simulated meta-analytic conditions. They found that I2 also has low statistical power with small numbers of studies and its confidence intervals can be large.8
Inferences about the extent of heterogeneity must be especially cautious when the 95% confidence intervals around I2 are wide, ranging from low to high heterogeneity. Such uncertainty is usually ignored in systematic reviews, however. This can result in misconceptions. For example, a systematic review of corticosteroids for Kawasaki disease found a point estimate I2=59%.9 The authors decided to exclude the two studies that were most different, saying that their removal eliminated all of the across study heterogeneity (Q=5.59, P=0.588, I2=0.00). In fact, the 95% confidence interval for this I2=0% estimate still extends from 0% to 56%. With two small randomised trials and six non-randomised comparisons remaining, the meta-analysis concluded that corticosteroids consistently halve the risk of coronary aneurysms. However, the two largest randomised trials on this topic were published after the meta-analysis. Heterogeneity resurfaced: the largest trial found no effect on coronary dimensions,10 while the other trial showed an 80% reduction in the risk of coronary artery abnormalities.11
Eight systematic reviews published in the BMJ between 1 July 2005 and 1 January 2006 performed meta-analyses of randomised trials and seven of them performed some statistical analysis of heterogeneity between studies (table on bmj.com).12 13 14 15 16 17 18 Each review stated that they had tried to interpret heterogeneity, and seven meta-analyses provided enough information for us to calculate the 95% confidence interval of I2. The lower 95% confidence interval was always as low as 0% (rounded to integer percentage), with one exception. The upper 95% confidence interval always exceeded the 50% threshold, and in four cases it also exceeded the 75% threshold. A conclusive statement was feasible in only one case, where I2 was 69%, the 95% confidence interval was 40% to 80%, the Q statistic had P<0.001, and the authors justifiably concluded that “there was significant heterogeneity among these trials.”13 This meta-analysis had 15 studies, so the power of both Q and I2 was good. In all other meta-analyses (two to 12 studies each), strong statements in interpreting heterogeneity would be difficult to make. Only one review presented 95% confidence intervals for an I2 estimate.12 The authors concluded that “we could not observe significant heterogeneity.” Indeed the Q statistic had P=0.19. However, with only five studies, the power to detect heterogeneity was negligible. The I2 statistic was 35% and the 95% confidence interval ranged from 0% (no heterogeneity) to 76% (high heterogeneity).
This limitation is not confined to the selected examples presented here—it is probably the rule rather than the exception. We used two large datasets of meta-analyses to evaluate empirically the extent of uncertainty in I2 estimates. Firstly, we looked at meta-analyses of the Cochrane Database of Systematic Reviews (Issue 4, 2005) that had four or more synthesised studies and binary outcomes. Because each Cochrane review may include several meta-analyses, we looked only at the one with the highest number of studies; in the case of ties, we used the one with the largest sample size. We did not look at meta-analyses of two or three studies. Such studies form a sizeable proportion of the Cochrane Library,19 but their 95% confidence intervals of I2 almost always span a wide range of heterogeneity, unless the studies are large and they give very different results. In total, we calculated the I2 statistic and its 95% confidence intervals for 1011 meta-analyses. The second dataset was a previously described database of 50 meta-analyses of gene-disease associations that had found a nominally statistically significant effect (P<0.05) for the proposed genetic risk factors.20
Figure 11 shows the upper and lower 95% confidence intervals of I2 for the two sets of meta-analyses. The pattern is similar. Of the meta-analyses where I2 is ≤25% (low heterogeneity), 83% of the Cochrane meta-analyses and 73% of the genetic risk factor meta-analyses have upper 95% confidence intervals that cross into the range of large heterogeneity (I2 ≥50%). Of the meta-analyses where I2 is ≥50% (large heterogeneity), 67% of the Cochrane meta-analyses and 52% of the genetic risk factor meta-analyses have lower 95% confidence intervals that cross into the range of low heterogeneity (I2 ≤25%).
Meta-analyses where I2 is estimated at 0% are affected by an especially important misconception. Many reviews interpret this as absence of heterogeneity, but the upper 95% confidence interval may be substantial (as in the Kawasaki example discussed above9). Figure 22 shows the uncertainty for the upper 95% confidence interval of I2 for the two sets of meta-analyses, limited to those with I2=0% (n=373 for Cochrane reviews, n=12 genetic studies). The upper 95% confidence interval exceeds 33% in all these meta-analyses. For 81% of the meta-analyses with I2=0%, the 95% confidence intervals are 50% or higher. Because of the way that research is currently reported, considerable heterogeneity between studies cannot be excluded with confidence in most meta-analyses. Some heterogeneity between studies is probably present in most meta-analyses. Claims for homogeneity may sometimes be stronger than the evidence allows. Trusting a non-significant P value for the Q statistic and an I2 estimate of 0% may sometimes lead to spurious certainty about the comparability and similarity of study results.
The confidence interval of I2 can be calculated by several methods.5 Two methods, a test based approach and a non-central χ2 based approach have been implemented in Stata (heterogi module). The performance of these two methods is comparable, although the test based approach often gives lower values for lower and upper confidence intervals, so that the non-central χ2 based approach may be preferable.
All statistical tests for heterogeneity are weak, including I2. The clinical implications of this are considerable and must be examined on a case by case basis. Putting too much trust in homogeneity of effects may give a false sense of reassurance that one size fits all. Lack of evidence of heterogeneity is not evidence of homogeneity. Conversely, putting too much trust in the presence of heterogeneity of effects may lead to spurious subgroup and exploratory analyses. Given that I2 is not precise, 95% confidence intervals should always be given.
Contributors and sources: JPAI has a long standing interest in meta-analyses and heterogeneity and had the original idea for this article. NAP and EE collected the data. NAP performed statistical analyses with help from JPAI and EE. JPAI wrote the manuscript and NAP and EE commented on it. JPAI is guarantor.
Competing interests: None declared.
Provenance and peer review: Not commissioned; externally peer reviewed.