A test for heterogeneity examines the null hypothesis that all studies are evaluating the same effect. The usual test statistic (Cochran's
Q) is computed by summing the squared deviations of each study's estimate from the overall meta-analytic estimate, weighting each study's contribution in the same manner as in the meta-analysis.
7 P values are obtained by comparing the statistic with a χ
2 distribution with
k-1 degrees of freedom (where
k is the number of studies).
The test is known to be poor at detecting true heterogeneity among studies as significant. Meta-analyses often include small numbers of studies,
6,8 and the power of the test in such circumstances is low.
9,10 For example, consider the meta-analysis of randomised controlled trials of amantadine for preventing influenza ().
11 The treatment effects in the eight trials seem inconsistent: the reduction in odds vary from 16% to 93%, with some of the confidence intervals not overlapping. But the test of heterogeneity yields a P value of 0.09, conventionally interpreted as being non-significant. Because the test is poor at detecting true heterogeneity, a non-significant result cannot be taken as evidence of homogeneity. Using a cut-off of 10% for significance
12 ameliorates this problem but increases the risk of drawing a false positive conclusion (type I error).
10Conversely, the test arguably has excessive power when there are many studies, especially when those studies are large. One of the largest meta-analyses in the
Cochrane Database of Systematic Reviews is of clinical trials of tricyclic antidepressants and selective serotonin reuptake inhibitors for treatment of depression.
13 Over 15 000 participants from 135 trials are included in the assessment of comparative drop-out rates, and the test for heterogeneity is significant (P = 0.005). However, this P value does not reasonably describe the extent of heterogeneity in the results of the trials. As we show later, a little inconsistency exists among these trials but it does not affect the conclusion of the review (that serotonin reuptake inhibitors have lower discontinuation rates than tricyclic antidepressants).
Since systematic reviews bring together studies that are diverse both clinically and methodologically, heterogeneity in their results is to be expected.
6 For example, heterogeneity is likely to arise through diversity in doses, lengths of follow up, study quality, and inclusion criteria for participants. So there seems little point in simply testing for heterogeneity when what matters is the extent to which it affects the conclusions of the meta-analysis.