A key ingredient for any imputation-based approach is to ensure that alleles are consistently labeled across studies. In our evaluation of FUSION and HGDP samples, using the HapMap as a reference, we were fortunate that a subset of the HapMap individuals were genotyped in each study for quality control. Contrasting the genotypes for these quality control samples with those generated by the HapMap Consortium made the usually laborious process of ensuring consistent allele labeling across labs much easier. We strongly recommend that all labs conducting GWAS genotype a small number of HapMap individuals for this purpose.
Another practical consideration arises when integrating data from studies that use diverse genotyping platforms. Superficially, it is tempting to first impute missing genotypes in each sample and to then conduct a pooled analysis of all available data. However, this is almost never a good idea, as illustrated by a particularly extreme case where a set of cases and controls have been genotyped on two different platforms and a marker of interest has been genotyped in cases but must be imputed in controls. If the marker of interest cannot be well predicted by flanking markers, imputation will default to suggesting that the genotype distribution at that marker matches the reference panel—but this could be a very poor assumption if the reference panel and study sample have drifted apart, potentially resulting in spurious association. Even if the marker can be well predicted by flanking markers, it is possible that the reference panel and the case sample used different genotyping assays that, for technical reasons such as the presence of a polymorphism that overlaps assay primers, give consistently distinct results—again resulting in spurious association. To avoid these sources of spurious association, we recommend that, when analyzing genotype data generated using different platforms, different versions of the same platform, or using the same platform but with experiments carried out at different labs, an initial round of association analysis should be carried out using data from each platform/version/site combination. The results from this initial round of analysis can then be meta-analyzed, minimizing the risk of artifacts. This recommendation does preclude analyses where all cases are genotyped at one site, and all controls are genotyped at a different site.
In the experiments described so far, we illustrated the accuracy of genotype imputation that relies on existing resources (such as the Phase II HapMap) and genotyping technologies (including a variety of commercial genotyping chips). It is likely that both these resources and technologies will continue to evolve rapidly and it is interesting to consider how these developments might impact imputation-based approaches. For example, it is clear that genotyping chips of the future will be able to examine an ever larger number of tag SNPs in a cost-effective manner. Extrapolating from , it is clear these should provide improved genomic coverage, eventually allowing investigators to impute nearly all HapMap SNPs with near perfect accuracy. Nevertheless, it is also clear from that when coupled with imputation-based analyses even relatively low-density SNP chips can provide excellent coverage of the genome in populations with LD patterns similar to the CEU, JPT, and CHB. Thus, we expect the main advantages of new higher-density chips will be in the study of populations with less extensive LD, such as the YRI, and in the analysis of rarer variants.