|Home | About | Journals | Submit | Contact Us | Français|
This article proposes several new indices that measure the heterogeneity for individual studies in a meta-analysis. These indices directly assess how inconsistent an individual study is compared to the rest of studies used in the meta-analysis, that is, how much impact the specific study has on the scientific conclusion of the meta-analysis and further on the generalization of the conclusion. The proposed indices can be intuitively interpreted as the proportion of total variance from all studies in a meta-analysis that can be accounted for by the heterogeneity from specific studies. Further, each proposed index over all the studies sums to the collective measure of heterogeneity for the meta-analysis. Therefore our proposed study-specific indices of heterogeneity can be regarded as a generalization of the collective index of heterogeneity in meta-analyses proposed by various authors. We examine the difference among the proposed study-specific measures of heterogeneity and assess the variation associated with each proposed index of heterogeneity through a large simulation study. Finally, we demonstrate the proposed methodology by assessing the effect of individual studies on the overall estimate to the difference of an antecedent biomarker of Alzheimer’s disease (AD) between different Apolipoprotein E (ApoE) genotypes.
Making decisions on medicine and health care policies is a very complicated process in which existing scientific evidence plays a crucial role. Medical practitioners and their patients make decisions within the context of a rapidly changing body of scientific evidence on medicine and a health care system that influences the availability, accessibility, and cost of diagnostic tests and therapies (Sackett and Haynes 1995). Timely, useful evidence from the biomedical literature should be an integral component of clinical and medical decision making. The importance of basing medical practice more firmly on the results of existing scientific evidence through systematic reviews was starkly demonstrated by a paper in the early 1990s, which compared the results of meta-analyses of treatment trials for people who have suffered a heart attack with the recommendations of experts published in review articles and textbooks over the same time period. This showed a significant divergence between the recommendations and the meta-analytic summaries of the trials. Ineffective treatments were being recommended, and highly effective treatments were not. There were also significant time delays between the publication of the studies and changes in the recommendations of the experts (Antman et al. 1992). As a result, lives that could have been saved were lost, and resources were wasted.
Systematic reviews are very useful medical decision-making tools because they objectively summarize large amounts of information, identify gaps in medical research and evidence, and identify beneficial or harmful interventions. Clinicians can use systematic reviews to guide their patient care. Consumers and patients as well as policymakers can use systematic reviews to help make health care decisions. Systematic reviews provide convincing and reliable evidence relevant to many aspects of medical and biological research and health care (Egger and Smith 1997), especially when the results of individual studies they include show clinically important effects of comparable magnitude. Such reviews aim to comprehensively identify and assess all studies relevant to a given scientific question, and meta-analysis has been the major statistical methodology for the quantitative synthesis of study results. Many methods for meta-analysis are available, and the most popularly applied in the medical research focus on the optimum combination of published summary statistics in some form of weighted averages (DerSimonian and Laird 1986; Egger, Smith, and Phillips 1997; Whitehead and Whitehead 1991). Usually, each study is given a weight according to the precision of its results on summary statistics. Studies with good precision are weighted more heavily than studies with greater uncertainty. The variance for the overall estimate of the parameter under study in meta-analyses is in general from two different sources, one is associated with the individual studies (i.e., the within-study variance), and the other is associated with the possible difference between different studies (i.e., between-study variance). When the between-study variance is assumed to be 0, each study is simply weighted according to its own variance. This approach characterizes a fixed effects model which is exemplified by the Mantel-Haenszel method (Mantel and Haenszel 1959; Laird and Mosteller 1990) or the Peto method (Yusuf 1985). When the between-study variance is not zero, methods which incorporate a between-study component of variation for the overall effect under estimation are based on random effects models (Laird and Mosteller 1990). The between-study variance represents the excessive variation in observed individual study effects over that expected from the imprecision of results within each study. Heterogeneity in a meta-analysis refers to the between-study variance of each individual study when the overall mean of the random effects is estimated. Fixed effects and random effects model for general continuous outcome and specific survival outcomes have been described by Hedges and Olkin (1985); Earle and Wells (2000); Srinivasan and Zhou (1993); and Parmar, Torri, and Stewart (1998).
When individual studies used in a meta-analysis have very differing results, however, the results from systematic reviews may be less convincing and reliable. In an attempt to establish whether study results are consistent, reports on a meta-analysis commonly present a statistical test of heterogeneity among studies used in the meta-analysis. This test seeks to determine whether there are genuine differences underlying the results of the studies, or whether the variation in these results is compatible with chance alone (i.e., homogeneity). A common statistical test used for this purpose is the Cochran’s chi-squared test or the Q-test (Whitehead and White-head 1991; Cochran 1954). It has been widely realized, however, that this test has poor power when the number of studies in a meta-analysis is small, and excessive power to detect clinically insignificant heterogeneity when there are too many studies (Higgins and Thompson 2002; Hardy and Thompson 1998).
Addressing statistical heterogeneity of studies is one of the most fundamental aspects of many systematic reviews. The interpretative aspects of statistical inferences from a meta-analysis depend on the degree of heterogeneity of the studies used in the meta-analysis. Because the heterogeneity may determine the extent to which the conclusions of a meta-analysis can be generalized, it is important to quantify the extent of heterogeneity among a collection of studies. Realizing the potential limitations of statistical tests to characterize the degree of heterogeneity in a meta-analysis, Higgins and Thompson (2002) proposed new measures of the extent of heterogeneity in a meta-analysis that overcome the shortcomings of existing measures. Their focus is on the impact of heterogeneity on the results of a meta-analysis and therefore, on the degree to which scientific conclusions might be generalized to situations outside those investigated in the studies at hand. Their measures are easily interpreted by nonstatisticians as the proportion of variation that was explained by the difference among studies. Further, these measures do not intrinsically depend on the number of studies or the type of outcome data, therefore offering the possibility that statistical heterogeneity can be compared across different meta-analyses with differing numbers of studies and types of outcome data. Because of the fact that their proposed measures of heterogeneity in a meta-analysis measure the overall or collective heterogeneity within the group of studies used in a meta-analysis, the interpretation of the index on heterogeneity has to refer to the collection of studies used in the meta-analysis.
Often times, however, a scientifically important question to be answered in a meta-analysis is how inconsistent one specific study is compared to the rest of studies used in the meta-analysis, that is, how much impact each individual study has on the scientific conclusion of the meta-analysis and further on the generalization of the conclusion. Because heterogeneity comes about due to the fact that the effects under study in the population which the studies represent are not the same, it is important to understand the sources and possible explanations of the heterogeneity, including study sample characteristics, the design and analytic features used to report results, and the scientific interpretations of the study results. All these can only be facilitated when heterogeneity of individual studies can be directly measured in comparison to the rest of the studies in the meta-analyses.
In this article, we propose several new indices that measure the specific inconsistency for an individual study as compared to the rest of studies used in a meta-analysis. We seek to develop indices that will measure the study-specific degree of inconsistency in such a way that sheds light on the degree of contribution of this specific study to the overall conclusion of the meta-analysis. The proposed methodology can be regarded as a generalization of the collective index of heterogeneity proposed by Higgins and Thompson (2002). We also examine the difference among the proposed study-specific measures of heterogeneity and study the variation of each proposed measure when a large number of simulated meta-analyses are conducted. Finally, we demonstrate our proposed methodology by presenting an example to study possible biomarkers that can be used to identify subjects with high risk of developing Alzheimer’s disease (AD) when they are still cognitively normal.
We assume that a total of k studies are used in a meta-analysis to address a scientific question as represented by parameter θ. Let i be the estimate from the i th study and be the associated estimate to the variance. Let denote the precision of the estimate. In a classic fixed effect meta-analysis, θi’s are assumed identical and a summary estimate, , is computed to the common parameter as a weighted average of the study specific estimates, using the precisions as weights:
The variance of the summary estimate is given by
A random effects meta-analysis can be conceptualized by incorporating a random effect to account for the between-study variation, N(0, τ2), into the estimated study-specific parameters, in addition to the within-study random variation, . The summary estimate to the mean parameter across the distribution of the studies, r, has exactly the same form as above, but with weights replaced by
The estimated variance of the summary estimate is now given by
A test of homogeneity of the θi is given by
which has a chi-squared distribution with k − 1 degrees of freedom under the assumption of homogeneity in the fixed effects model. A method of moment estimate to τ2 can be obtained as
Notice that in the fixed effect model, the assumptions of known sampling variances and normally distributed effect size estimates are usually approximations based on the large sample theory of maximum likelihood estimates. Further, the random-effects model weights ignore the uncertainty in 2. The meta-analyses results are therefore only valid with large within-study sample sizes to approximate known sampling variances and normally distributed estimates and large number of studies (i.e., k) to reduce the imprecision in the estimate of τ2.
Higgins and Thompson (2002) proposed a simple index to quantify the overall heterogeneity among studies in a meta-analysis:
where σ2 is the shared within-study variance among individual studies, or when the studies have differing within-study variations, the “typical” within-study variance in the terms of Higgins and Thompson (2002). This intuitive definition of the heterogeneity has several major advantages as compared to the standard statistical test based on Q. First, the measure does not inherently depend on the number of studies in the meta-analysis. Second, the measure is not specific to a particular metric of treatment effect and therefore can be applied similarly irrespective of the type of outcome data (e.g., dichotomous, continuous, and survival). Third, the measure is easy to compute and has a very appealing interpretation as the percentage of the total variation across studies due to heterogeneity.
The estimation of overall heterogeneity among studies in a meta-analysis requires the estimate to both the between-study variation and the “typical” within-study variance. Higgins and Thompson (2002) used the following estimator
to estimate the “typical” within-study variance, and derived the index of overall heterogeneity
Takkouche, CadarsoSurez, and Spiegelman (1999) suggested another estimate to the “typical” within-study variance σ2 by taking the reciprocal of the arithmetic mean weights:
This gives another index of overall heterogeneity
Taking the simple arithmetic average of the within-study variances
to estimate the “typical” within-study variance results in one more index of overall heterogeneity
We follow the convention that in all these proposed indices of heterogeneity, they are set to 0 if Q ≤ (k − 1).
For a specific study i, we accordingly propose three different indices to measure its heterogeneity from the collection of studies used in the meta-analysis:
If all within-study variations are exactly the same, then δi = (k − 1)/k and . We also follow the convention that if the numerator is negative in these indices, that is, ωi (i − )2 ≤ δi, then .
From Equation (1) of Higgins and Thompson (2002), the expected value of Q statistic is
If there is no heterogeneity among studies, that is, τ2 = 0, then E(Q) = k − 1. A similar mathematical derivation gives
If there is no heterogeneity among studies, that is, τ2 = 0, then
This also results in another method of moment estimate of τ2 as
Notice that the denominator of all the proposed overall and study-specific measures of heterogeneity is the unconditional variance of the estimated effect from a typical study in the meta-analysis, which contains additive components due to the within-study variance (i.e., from between-patient variation within the study) and the between-study variation (i.e., heterogeneity).
By Schwartz’s inequality (Noble and Daniel 1977),
It then follows that
Similarly, it is clear that for any study i,
Notice that if I2(i) > 0 for all i, then
that is, the total heterogeneity in a meta-analysis can be partitioned as the simple sum of these from individual studies. Therefore, the intuitive interpretation of overall heterogeneity I2 can be inherited to interpret the study-specific measures of heterogeneity I2(i) as the proportion of total variance that can be accounted for by the heterogeneity from study i. Notice also that the study specific measures , and depend on k and decrease when k increases. Therefore, as k increases, less and less of the total heterogeneity can be attributed to any single study.
Although mathematically, for each study i, it is impor tant to understand how different these measures are when they are used to measure the study-specific heterogeneity in a meta-analysis and how much variation each index has when a large number of meta-analyses are conducted. Given the fact that when all studies have exactly the same degree of within-study variation, that is, when all ωi’s are the same, these measures are identical to each other, we anticipate that these measures will be close to each other when the difference among within-study variations is relatively small.
We performed a simulation study to look at the performance of our proposed indices of study-specific heterogeneity. For this purpose, we first generated a specific study whose parameter estimate is generated by the random effect model with the between-study component following the normal distribution N(5, τ2) with τ 2=0, 1, 4, through a linear transformation of the SAS function RANNOR (SAS 1999). The within-study precision for the specific study is among one of the three possible values: 0.5 + υ, or 0.5 + 2υ, or 0.5 + 3υ for a range of υ = 0, 0.5, and 2.0. The proposed study-specific measures of heterogeneity are computed for this specific study in each simulated meta-analysis. In addition to this specific study, other 3s (for s = 4 and 8) studies in the meta-analysis are generated by the same random effect model but with within-study precisions equally distributed among the three possible values: 0.5 + υ, or 0.5 + 2υ, or 0.5 + 3υ for a range of υ, that is, s studies have one of the three possible within-study precision values. Therefore, the total number of studies used in each meta-analysis is k = 3s + 1 where s was chosen as 4 and 8. For each possible value of τ2, s, and v, 1000 independent simulated meta-analyses were performed such that study estimates for the specific study and the other 3s studies were independently generated across 1000 meta-analyses. Table 1 presents the mean and standard error for the three proposed measures of study-specific heterogeneity over 1000 simulated meta-analyses as a function of τ2, k, and v (notice that v is a measure of heterogeneity among the study precisions). In addition, Table 1 also presents the true overall measures of heterogeneity for each scenario.
Notice that our simulation results in Table 1 cover a wide range of true degree of heterogeneity with the true index from 0% to almost 95%. From our simulated meta-analyses, it is clear that three different measures of overall and study specific heterogeneity are very consistent within the specified ranges of parameters. In fact, under the assumption that the three measures of heterogeneity are estimating the same underlying heterogeneity, we computed the intraclass correlation coefficient (ICC) (Shrout and Fleiss 1979) over 1000 simulated meta-analyses for each choice of τ2, k, and v. All these computed ICCs were at least 0.99, indicating extremely high consistency among these measures. When τ2 = 0, there is no heterogeneity across studies in the meta-analyses, that is, , which should then imply that for each individual study i. However, because of a positive probability that ωi (i − )2 ≤ δi, we made the convention to define in this case. This truncation therefore leads to a possible positive bias. The results in Table 1 when τ2 = 0 present the estimates to the degree of the positive bias due to the truncation to 0.
Alzheimer’s disease (AD) is a highly complex and multi-factorial progressive neurological disease that results in the irreversible loss of neurons in one or multiple regions of the brain. We present an application to our proposed overall and study-specific measures of heterogeneity to study possible biomarkers that can be used to identify individuals with high risk of developing Alzheimer’s disease (AD) when they are still cognitively normal. Recent research advances in Alzheimer’s disease have found Apolipoprotein E4 (ApoE4) alleles as a genetic risk factor of AD (Myers 1996). Although the pathological hallmarks of AD are the neurofibrillary tangles and the senile plaques in the brain (Braak and Braak 1991; McKeel et al. 2004; Fagan et al. 2007), the diagnosis of AD in living patients is still largely a clinical judgment based on careful neurological and/or neuropsychological examination combined with results from other clinical tests. Therefore, the search for biomarkers that can be used to differentiate AD from normal aging remains one of the primary research activities in AD. In several publications (Fagan et al. 2007; Sunderland et al. 2003), individuals with AD were found to have decreased level of cerebrospinal fluid (CSF) β-amyloid42 as compared to individuals with normal aging. Because AD is a progressive neurodegenerative disorder that leads to the irreversible death of brain cells, it is important to assess the potential of the CSF biomarker to identify individuals that are at high risk of AD while they are still cognitively normal. The importance of such antecedent biomarkers is further highlighted by the fact that no pharmaceutical treatments are effective for the disease’s later stages. We chose to study whether CSF β-amyloid42 is decreased among individuals of normal aging who are ApoE4 positive as compared to these who are ApoE4 negative. Although many publications have compared CSF β-amyloid42 level between individuals with AD and these with normal aging (Fagan et al. 2007; Sunderland et al. 2003), very few have actually reported CSF β-amyloid42 as a function of ApoE4 status among subjects who were still cognitively normal. As a matter of fact, our comprehensive MEDLINE search identified a total of only six published studies on CSF β-amyloid42 during the period of 1990 to 2007 which actually reported summary statistics as a function of ApoE4 status for individuals who were not demented (Sunderland 2004; Jensen et al. 1999; Andreasen et al. 1999; Tapiola et al. 2000; Riemenschneider et al. 2000; Prince et al. 2004). The summary statistics reported from these six published studies are presented in Table 2 [summary statistics from the study by Prince et al. (2004) was obtained through eye-balling because only a graphical presentation on summary statistics was available in the publication].
Based on our proposed methodology and a random effect model, the pooled estimate to the mean difference of CSF β-amyloid42 between individuals of normal aging who are ApoE4 positive and those who are ApoE4 negative is −31.69 pg/mL, and an asymptotic 95% confidence interval estimate to the mean difference of CSF β-amyloid42 is from −128.93 pg/mL to 65.56 pg/mL, suggesting a nonsignificant difference at a 5% significance level. The measures of overall heterogeneity from this meta-analysis are estimated as , and , respectively, indicating from low to moderate degree of heterogeneity among studies used in the meta-analysis (Higgins et al. 2003). If the heterogeneity is ignored in the meta-analysis, that is, the between-study variance τ2 is assumed as 0 (therefore ), then a fixed effect model would be used for the meta-analysis. The estimated overall mean difference of CSF β-amyloid42 between individuals of normal aging who are ApoE4 positive and those who are ApoE4 negative under the fixed effect model is −45.35 pg/mL. An asymptotic 95% confidence interval estimate to the mean difference of CSF β-amyloid42 under the fixed effect model is from −74.89 pg/mL to −15.82 pg/mL, suggesting a statistically significant difference at a 5% significance level on CSF β-amyloid42 between individuals of normal aging who are ApoE4 positive and those who are ApoE4 negative. This discrepancy on the statistical inference between the fixed effect model and the random effect model is partly due to the fact that one approach (i.e., the random effect model) takes into account of heterogeneity across studies whereas the other (i.e., the fixed effect model) ignores such heterogeneity, suggesting the importance to measure the heterogeneity in meta-analyses when it does exist. The fixed-effects model provides a conditional inference about the set of studies included in the meta-analysis, while the random-effects model provides an unconditional inference about a hypothetical population of studies (from which the included studies are assumed to be a random sample). Either model provides the appropriate inferences under the specific assumptions under the model (Hedges and Vevea 1998).
Columns 3 to 5 of Table 3 display the study-specific measures of heterogeneity for all six studies. All three indices indicated that the study by Prince et al. (2004) has the largest heterogeneity from the rest of studies. In fact, the study by Prince et al. (2004) alone accounts for from 12% to 40% of overall heterogeneity in the meta-analysis. The last column of Table 3 presents the pooled estimate to the mean difference of CSF β-amyloid42 between individuals of normal aging who are ApoE4 positive and those who are ApoE4 negative when one study is excluded from the meta-analysis with a random effect model. When the study by Prince et al. (2004) was excluded from the meta-analysis, the pooled estimate to the mean difference of CSF β-amyloid42 was −9.13 pg/mL, giving the largest deviation from the pooled estimate when all six studies were included in the meta-analysis.
We proposed several new indices that measure the heterogeneity for individual studies as compared to the rest of studies used in a meta-analysis. By estimating the “typical” within-study precisions, we developed these indices that measure the degree of inconsistency among studies by their impact to the overall conclusion of the meta-analysis. The proposed methodology can be regarded as a generalization of the collective index of heterogeneity proposed by Higgins and Thompson (2002). We assessed the variation associated with each proposed index of heterogeneity through a large simulation study. We also examined the difference among the proposed study-specific measures of heterogeneity and found that these indices provided quite consistent results in measuring the study-specific heterogeneity in the simulated meta-analyses. Finally, we demonstrated our proposed methodology by presenting a real world application to study a CSF biomarker that can be used to identify individuals with high risk of developing Alzheimer’s disease (AD) when they are still cognitively normal. We further identified these studies that have the most heterogeneity in this example, and assessed their individual effect to the overall estimate on the effect size of ApoE4 genotypes.
Our proposed study-specific measures of heterogeneity directly assess how inconsistent one specific study is compared to the rest of studies used in the meta-analysis, that is, how much impact the specific study has on the scientific conclusion of the meta-analysis and further on the generalization of the conclusion. Further each proposed index has another simple appealing property that its sum over all the studies used in the meta-analysis is the same as the overall measure of heterogeneity for the meta-analysis. This simple property allows the interpretation of study-specific measures of heterogeneity within the context of overall measures of heterogeneity in a meta-analysis and therefore inherits the appealing conceptualization that the study-specific measures represent the proportion of total variance across studies that can be accounted for by the heterogeneity from specific studies.
Addressing statistical heterogeneity of studies is one of the most important aspects of many systematic reviews. The interpretative aspects of statistical inferences from a meta-analysis depend on the degree of heterogeneity of the studies used in the meta-analysis. Because heterogeneity comes about due to the fact that the effects under study in the populations which the studies represent are not the same, it is important to understand the sources and possible explanations of the heterogeneity. When individual studies used in a meta-analysis have very differing results, knowing the exact contribution of individual studies to the total heterogeneity becomes the first step to understand the sources of heterogeneity. This information can not only identify studies with the largest heterogeneity in a meta-analysis but also help more careful assessments on the individual studies to make sure they are consistent in patient characteristics and study designs as well as analytic approaches. If there is enough evidence suggesting that the heterogeneity of a specific study is extremely large compared to other studies and mainly due to different patient populations or different study designs or less-than-optimal analytic approaches, the protocol of the meta-analysis may be revised to exclude the study, or meta-analytic results with and without the study may be both reported to allow an assessment on the impact of the single study on the scientific conclusions. In fact, with our proposed study-specific indices of heterogeneity, it becomes possible that future meta-analyses report the study-specific heterogeneity indices to give an estimate to the proportion of total variance in the reported effect sizes that can be accounted for by individual studies.
Dr. Xiong’s work was supported by grant K25 AG025189 from the National Institute on Aging. Financial support for this study was also provided in part by National Institute on Aging grants AG003991, AG005681, and AG026276 for Chengjie Xiong, J. Philip Miller, and John C. Morris.
Chengjie Xiong, TKKK, Division of Biostatistics, Washington University, St. Louis, MO 63110.
J. Philip Miller, Professor of Biostatistics, Washington University, St. Louis, MO 63110.
John C. Morris, Friedman Distinguished Professor of Neurology, Washington University, St. Louis, MO 63110.