The evaluation of a methodological test is directly analogous to the evaluation of a clinical diagnostic test. Fryback and Thornbury have proposed a six level model for evaluating a diagnostic test.
8 This provides a good discussion framework. The six expectations of a clinical diagnostic test are technical feasibility, diagnostic accuracy, diagnostic effect, treatment effect, effect on patient outcome, and societal effect. If the conclusions of evidence based medicine are based on poor tests, the negative effect eventually may be considerable. So we must examine closely at least the technical feasibility and diagnostic accuracy of these methods.
An evaluation of the technical feasibility of the funnel plot shows many problems that are difficult to solve. Strong empirical evidence exists that the appearance of the plot may be affected by the choice of the coding of the outcome (binary versus continuous),
9 the choice of the metric (risk ratio, odds ratio, or logarithms thereof), and the choice of the weight on the vertical axis (inverse variance, inverse standard error, sample size, etc).
10,11 gives an example of how these choices can make a difference.
Even in the unlikely event that agreement is reached on what metric and what expression of weight to use on the axes, enormous uncertainty and subjectivity remains in the visual interpretation of the same plot by different researchers. Our team recently designed a survey to examine this question using simulated plots with or without publication bias.
12 The ability of researchers to identify publication bias using a funnel plot was practically identical to chance (53% accuracy).
Formal statistical tests may eliminate the subjectivity in visual inspection of asymmetry. Investigators commonly use the rank correlation test
13 or one of many tests based on regression.
2,7,10,11,14 The validity of these tests depends on assumptions often unmet in practice, however, and the choice of test introduces further subjectivity into the procedure. The methods theoretically require a considerable number of available studies, generally at least 30 for sufficient power. But the number needed depends on the size of the studies and on the true treatment effect—for example, for an odds ratio of 0.67, even 60 studies are not adequate.
7 Most meta-analyses of clinical trials, however, have far fewer studies. For instance, the average Cochrane meta-analysis has fewer than 10.
15 Thus the tests typically have low power
16 and may be inappropriate.
Even ignoring statistical concerns of power and choice of metric and weights, it is still unclear if funnel plots really diagnose publication bias. Strictly speaking, funnel plots probe whether studies with little precision (small studies) give different results from studies with greater precision (larger studies). Asymmetry in the funnel plot may therefore result not from a systematic under-reporting of negative trials but from an essential difference between smaller and larger studies that arises from inherent between-study heterogeneity. For example, small studies may focus on high risk patients, for whom the treatment is more effective because such patients have more events that could potentially be prevented
17; or studies with small weight may generally have shorter follow-up and differ because the treatment effect decreases with time.
18 Early studies may target different populations (with different effect sizes) than subsequent studies,
19 and subsequent studies may be much larger, trying to test the concept on less selected patients. Variation in quality can affect the shape of the funnel plot, with smaller, lower quality studies showing greater benefit of treatment.
20Summary points
Methods used by evidence based medicine should be evaluated with rigorous standards
The funnel plot is widely used in systematic reviews and meta-analyses as a test for publication bias
Asymmetry of the funnel plot, either visually interpreted or statistically tested, does not accurately predict publication bias
Inappropriate or misleading use of funnel plot tests may do more harm than good
Heterogeneity may sometimes be both statistically and clinically obvious—that is, studies may be examining different questions.
21 Yet the authors of a meta-analysis, such as the one investigating the relation between garlic consumption and cancer,
21 may still pool all studies together when it comes to the funnel plot, even though they have analysed them separately for the main analysis. In other cases, it may not be possible to identify a source for the existing heterogeneity.
22 Simulation studies of funnel plots have found that bias may be incorrectly inferred if studies are heterogeneous.
21,23For example, shows the funnel plot for a meta-analysis of inhaled disodium cromoglicate as maintenance therapy in children with asthma.
24 The authors found both statistical and clinical heterogeneity, yet they published a funnel plot (, top), stating: “Studies with low precision and negative outcome are under-represented, indicating publication bias.” Grouping the studies according to age of participants (middle) and study design (bottom) creates a different impression.
Finally, we have no gold standard against which to compare the results of funnel plot tests. A true standard measure of publication bias would require prospective registries of trials with detailed knowledge of which studies have been published and which are unpublished. It would then be feasible to test whether tests of publication bias capture accurately the presence of unpublished studies and whether one variant performs better than others. Given that efforts for study registration have only recently started,
25 this evaluation is currently difficult. Although a large number of alternative tests for publication bias exist,
26 none has been validated against a standard.