The advance of high-throughput technology makes it possible to genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) simultaneously which allows researchers to examine genetic variation across the whole genome in genome-wide association studies (GWAS). By testing the association between SNPs and complex traits and diseases, GWAS have successfully uncovered hundreds of novel susceptibility loci to date [Hindorff et al. (2009
Even though current GWAS platforms include markers for hundreds of thousands or even millions of SNPs, they still only directly assay a proportion of the whole genome. Obviously, if only directly genotyped SNPs are considered, this can lead to associated SNPs undetected. Another drawback of the partial coverage is that the selected SNP panel often varies for different platforms[Barrett et al. (2006
)]. When different studies use different platforms, combining across studies will lead to a much reduced set of SNPs genotyped in all the studies. For example, the overlap between the Affymetrix SNP Array 6.0 and Illumina OmniExpress genotyping array is less than 30%. An effective approach to overcome the aforementioned problems is to impute the untyped SNPs based on a common reference panel.
The basic idea behind genotype imputation is to take advantage of the linkage disequilibrium (LD) information among SNPs. Because of the LD and haplotype structure, genotyped variants can provide information about untyped SNPs. It is feasible to use data on genotyped SNPs along with an appropriate reference panel containing information on a larger set of SNPs to predict the genotypes of the ungenotyped SNPs. Currently the HapMap project [The International HapMap Consortium (2005
)] provides such reference panels, and future studies are likely to extend to the 1000 Genomes Project [The 1000 Genomes Project Consortium (2010)
] or other whole genome or exome sequence data. The most popular imputation programs include MACH [Li et al. (2010
)], IMPUTE [Marchini et al. (2007
)], and Beagle [Browning and Browning (2009
)], among others.
There are several approaches to using imputed values in the association analysis. Suppose a SNP of a given subject i
has genotype gi
, where gi
takes one of the three values 0, 1 and 2, the number of copies of one of the alleles (typically the “minor” or lower frequency allele). The output of an imputation program usually includes three probabilities: pi0
=2). One method is to use the most likely genotype (the genotype with the highest probability) as if it were the true genotype. However, it has been shown in Lin and Huang (2007)
that this method leads to intrinsically biased estimates because of the unavoidable discrepancy between the most likely genotype and the true genotype. Another popular approach is the so-called expectation-substitution method. Instead of using the most likely genotype, this method uses the dosages, expected number of minor alleles = pi1
, as if it were the true genotype. In the haplotype analysis framework, several studies [Kraft et al. (2005
); Kraft and Stram (2007
); Cordell (2007)] have shown through a series of simulation experiments that the expectation-substitution method has no noticeable bias under practical settings. It is also possible to use Bayesian methods [Marchini et al. (2007
); Servin and Stephens (2007
)] to perform the imputation and the association test at the same time, however, these methods are usually computationally intensive and hence not feasible on a genome wide scale. Therefore, in the remaining of the paper, we will focus on the expectation-substitution method.
If multiple studies are imputed using the same reference, then the different studies have data on a common set of SNPs, making meta-analysis across studies possible. Because combining studies increases sample size, meta-analysis increases power and allows detection of loci not found in individual studies. One way of performing meta-analysis is to use the regular z-score meta-analysis (MetaZ), which combines z-scores weighted by square root of sample sizes. Alternatively, the effect size meta-analysis (MetaBeta) combines effect sizes by computing a weighted average of the estimates. For meta-analysis that involves imputed genotypes, the imputation quality is an important factor. Hence, it seems natural that the imputation quality should also be reflected in the weight for meta-analysis.
For MetaZ, de Bakker et al. (2008)
suggested scaling the weighted sum of z
-scores by the imputation quality measure. Based on this idea, Zaitlen and Eskin (2010)
have recently proposed an “imputation aware” method to combine z
-scores. In the “imputation aware” method, the weight for the z
-score of each study is proportional to
, where R2
is the imputation quality measure and n
is the sample size. Results has shown the “imputation aware” method is more powerful than the regular z
-score meta-analysis when the imputation quality varies among studies [Zaitlen and Eskin (2010
For MetaBeta, most studies use the traditional inverse variance weighting to combine estimates from imputed and genotyped SNPs in current practice [Soranzo et al. (2009
); Willer et al. (2008
)]. However, it is unknown whether the inverse variance weighting is the optimal weighting scheme under this situation. In this paper, we address this question. For imputed SNPs, we find that the optimal weight is proportional to both the expected value and inverse variance of estimates given by the expectation-substitution method. While the expectation-substitution method does not give unbiased estimators in general, the bias is usually very small under practical situations of GWAS. Based on this finding, we show that the inverse-variance weighting scheme is a good approximation of the optimal weight for the meta-analysis of imputed SNPs. These results are important, because they validate that the expectation-substitution method and the inverse variance weighting scheme currently being used in GWAS meta-analysis are adequate and close to be optimal in GWAS settings.