In a re-analysis of the data from 3 GWA studies on type 2 diabetes, we found that for 5 of the 11 genetic variants that are considered “confirmed” susceptibility loci for type 2 diabetes there was still moderate to very large between-study heterogeneity across the different GWA investigations. Given the between-study heterogeneity, the level of statistical significance was more conservative with random effects calculations. Further examination of these potentially heterogeneous associations suggested possible explanations for the observed inconsistency. In several cases, this probably reflected either the fact that the identified marker was not the culprit polymorphism, but had a different linkage disequilibrium pattern with the culprit polymorphism across different studies. In the case of FTO, it probably reflected the fact it was associated with type 2 diabetes through its effect on the correlated phenotype of obesity; the phenotype correlation varied across different studies. Additional possibilities may need to be considered also for the heterogeneity, as discussed below. Conversely, we should caution that homogeneity of effects for the other 6 variants provides limited information on whether a causative locus has been identified. Lack of heterogeneity is not proof of causality.
Overall, detection of heterogeneity is very useful. Some polymorphisms are shown to reach genome-wide statistical significance by fixed effects calculations, but not by random effects calculations due to large between-study heterogeneity. In these cases, priority should be given to the consideration of other, correlated phenotypes and fine mapping for identifying linked, true culprit polymorphisms that yield less heterogeneous association signals. These situations are likely to be very common in the GWA setting. Tag markers are not selected based on “candidate gene variant” considerations 
. Thus it is more likely that one may hit upon a variant that is a linked marker rather than hit directly upon the culprit causative variant. Markers will often have variable linkage disequilibrium across different populations. This will result in heterogeneous genetic effects across studies.
Correlated phenotypes are also a major issue. Many common diseases and traits (e.g. diabetes, myocardial infarction, obesity, hypertension, metabolic syndrome) are modestly or even highly correlated. Inconsistent susceptibility signals for one of them may reflect consistent associations with another correlated phenotype. Moreover, most common diseases that are assumed to have a complex genetic background are probably a complex mix of different phenotypes in terms of their molecular pathogenesis. Genetic variants may have specific molecular functional effects that cumulatively build a complex clinical phenotype. However, depending on their molecular background, the relative representation of these phenotypes may vary in different people and populations with seemingly the same clinical disease. The case definition of this broad clinical phenotype may not do justice to the underlying molecular complexity. Molecular and clinical phenotypes may exhibit some correlation pattern, but this may vary in different sub-populations depending on the presence of other gene variants. Again, statistical heterogeneity may offer a window to this complexity.
Another possibility is bias. Incorporating between-study heterogeneity in the summary calculations has the advantage to penalize associations where results are inconsistent across studies due to population-specific biases and gives higher ranks to the consistent associations 
. The 3 GWA investigations on type 2 diabetes paid meticulous attention to methodological detail and their design was exemplary. Careful genotyping controls were set and population stratification was controlled with principal component analysis 
. Nevertheless, minute biases affecting particular polymorphisms with minute odds ratios around 1.12 cannot be excluded. Even if some major systematic errors (e.g. population stratification, genotyping error, phenotype misclassification) are controlled, not all biases are foreseeable. Moreover, minimized average biases do not exclude much larger differential biases for a few polymorphisms. P-values for testing the observed genetic effects against the null effect hypothesis account for random, not systematic, error.
Another potential reason for heterogeneity is the winner's curse, a manifestation of chance and regression-to-the-mean, especially under circumstances of multiple testing with limited power. The first study that claims an association that passes a very demanding required significance threshold may exhibit a genetic effect that is larger than the true average effect of this association.
Finally, another possibility is gene-environment interactions (e.g. as proposed for rs1801282 and low physical activity 
) with differential non-genetic environmental exposures across different populations. Moreover, genuine genetic heterogeneity in effect sizes across different ethnic backgrounds and population-specific gene-gene epistatic effects are sometimes postulated. However, interaction effects (effect modification) may require huge studies to confirm 
, much larger than even the very large consortia that have been put together in the genetics of type 2 diabetes.
We should stress that estimation of between-study heterogeneity carries considerable uncertainty and in the typical situation it would be impossible to have a large number of large studies to fully power detection and accurate estimation of heterogeneity. Moreover, breaking down populations to sub-studies may sometimes lead to loss of estimated between-study heterogeneity, if the sub-studies are small and their confidence intervals of effects are very large. However, this would offer misleading reassurance that no heterogeneity exists. While the number of datasets may increase by such splitting, each dataset would have very limited, inconclusive information about the magnitude of the effect and it would again be very difficult to show the between-study heterogeneity, even if present.
In general, when between-study heterogeneity is demonstrated or cannot be excluded, random effects models have been accepted as the default across different applications of meta-analysis and this should be accepted also for GWA investigations 
. Fixed effects may sometimes result into misleading inferences. In the presence of heterogeneity, the main assumption of fixed effects is violated and their application is inappropriate. However, a caveat for random effects is that they tend to diminish the difference in the relative weighting of small vs. larger studies. This is a drawback in situations where small studies may suffer more from errors or biases than larger studies. Disproportionate weighting of the biased small studies would then lead to erroneous results. This situation may typically arise when the data to be synthesized have been collected retrospectively from published information and publication bias is operating in the field 
. Small studies may have been published preferentially when they show significant results while the evidence from larger studies may be available regardless of the results. Thus the total available evidence from larger studies may be more unbiased, even if single larger studies may not necessarily be more unbiased than single smaller studies.
While this bias is a concern for retrospective meta-analyses, it should not be an issue for a prospective collaborative GWA investigation performed within a consortium of investigators. In this setting, there is no reason why investigators would select to include in the calculations only the most impressive results. However, a particular threat for the credibility of GWA results occurs, if several GWA investigations are performed and results are made available only for the most significant p-values in each GWA investigation. While this deficit will hopefully be remedied by quick release of genome-wide data in the future, the majority of studies have not done so yet.
We should also mention that there are different models that can incorporate between-study heterogeneity in the calculations. We used a conservative approach, the DerSimonian and Laird model, that is the most frequently used random effects model in the literature. Other fully Bayesian approaches may also be used 
, including hierarchical Bayesian models. Some of these models may incorporate also other parameters such as minor deviations from Hardy-Weinberg equilibrium in the observed genotyping data 
. These models usually tend to give even larger uncertainty and they widen the 95% credibility intervals of the estimates 
In all, heterogeneity is a useful aspect of the data, rather than a nuisance, as it can often point to leads that can clarify better the nature of postulated associations in the context of meta-analysis 
. Heterogeneity should not be ignored and should be carefully factored in the interpretation of emerging genetic associations from GWA studies. Heterogeneity has implications also for the epidemiological design of GWA studies and their replication efforts. Consistency in the definition of phenotypes and meticulous attention to quality control in genotyping and avoidance of population stratification is warranted, so as to avoid heterogeneity due to bias. However, heterogeneity due to genuine differences should not be avoided. Thus one should encourage diversity in secondary aspects of the study design across studies, such as the use of matching or not for other population characteristics, and targeting of populations of diverse racial descent with different linkage disequilibrium patterns. Finally, proper evaluation of between-study heterogeneity would ideally require complete and transparent individual-level information on genotype results from all conducted GWA investigations. Ensuring full public data availability would enhance the credibility of GWA evidence.