Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3201718

Formats

Article sections

Authors

Related links

Genet Epidemiol. Author manuscript; available in PMC 2012 November 1.

Published in final edited form as:

Genet Epidemiol. 2011 November; 35(7): 597–605.

Published online 2011 July 18. doi: 10.1002/gepi.20608PMCID: PMC3201718

NIHMSID: NIHMS308065

See other articles in PMC that cite the published article.

In Genome-wide association studies (GWAS), it is common practice to impute the genotypes of untyped single-nucleotide polymorphism by exploiting the linkage disequilibrium structure among SNPs. Use of imputed genotypes improves genome coverage and makes it possible to perform meta-analysis combining results from studies genotyped on different platforms. A popular way of using imputed data is the “expectation-substitution” method, which treats the imputed dosage as if it were the true genotype. In current practice, the estimates given by the expectation-substitution method are usually combined using inverse variance weighting scheme in meta-analysis. However, the inverse variance weighting is not optimal as the estimates given by the expectation-substitution method are generally biased. The optimal weight is, in fact, proportional to the inverse variance and the expected value of the effect size estimates. We show both theoretically and numerically that the bias of the estimates is very small under practical conditions of low effect sizes in GWAS. This finding validates the use of the expectation-substitution method, and shows the inverse variance is a good approximation of the optimal weight. Through simulation, we compared the power of the inverse variance weighting method with several methods including the optimal weight, the regular z-score meta-analysis and a recently proposed “imputation aware” meta-analysis method [Zaitlen and Eskin (2010)]. Our results show that the performance of the inverse variance weight is always indistinguishable from the optimal weight and similar to or better than the other two methods.

The advance of high-throughput technology makes it possible to genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) simultaneously which allows researchers to examine genetic variation across the whole genome in genome-wide association studies (GWAS). By testing the association between SNPs and complex traits and diseases, GWAS have successfully uncovered hundreds of novel susceptibility loci to date [Hindorff et al. (2009)].

Even though current GWAS platforms include markers for hundreds of thousands or even millions of SNPs, they still only directly assay a proportion of the whole genome. Obviously, if only directly genotyped SNPs are considered, this can lead to associated SNPs undetected. Another drawback of the partial coverage is that the selected SNP panel often varies for different platforms[Barrett et al. (2006)]. When different studies use different platforms, combining across studies will lead to a much reduced set of SNPs genotyped in all the studies. For example, the overlap between the Affymetrix SNP Array 6.0 and Illumina OmniExpress genotyping array is less than 30%. An effective approach to overcome the aforementioned problems is to impute the untyped SNPs based on a common reference panel.

The basic idea behind genotype imputation is to take advantage of the linkage disequilibrium (LD) information among SNPs. Because of the LD and haplotype structure, genotyped variants can provide information about untyped SNPs. It is feasible to use data on genotyped SNPs along with an appropriate reference panel containing information on a larger set of SNPs to predict the genotypes of the ungenotyped SNPs. Currently the HapMap project [The International HapMap Consortium (2005, 2007)] provides such reference panels, and future studies are likely to extend to the 1000 Genomes Project [The 1000 Genomes Project Consortium (2010)] or other whole genome or exome sequence data. The most popular imputation programs include MACH [Li et al. (2010)], IMPUTE [Marchini et al. (2007)], and Beagle [Browning and Browning (2009)], among others.

There are several approaches to using imputed values in the association analysis. Suppose a SNP of a given subject *i* has genotype *g _{i}*, where

If multiple studies are imputed using the same reference, then the different studies have data on a common set of SNPs, making meta-analysis across studies possible. Because combining studies increases sample size, meta-analysis increases power and allows detection of loci not found in individual studies. One way of performing meta-analysis is to use the regular *z*-score meta-analysis (MetaZ), which combines *z*-scores weighted by square root of sample sizes. Alternatively, the effect size meta-analysis (MetaBeta) combines effect sizes by computing a weighted average of the estimates. For meta-analysis that involves imputed genotypes, the imputation quality is an important factor. Hence, it seems natural that the imputation quality should also be reflected in the weight for meta-analysis.

For MetaZ, de Bakker et al. (2008) suggested scaling the weighted sum of *z*-scores by the imputation quality measure. Based on this idea, Zaitlen and Eskin (2010) have recently proposed an “imputation aware” method to combine *z*-scores. In the “imputation aware” method, the weight for the *z*-score of each study is proportional to
$R\sqrt{n}$, where *R*^{2} is the imputation quality measure and *n* is the sample size. Results has shown the “imputation aware” method is more powerful than the regular *z*-score meta-analysis when the imputation quality varies among studies [Zaitlen and Eskin (2010)].

For MetaBeta, most studies use the traditional inverse variance weighting to combine estimates from imputed and genotyped SNPs in current practice [Soranzo et al. (2009); Willer et al. (2008)]. However, it is unknown whether the inverse variance weighting is the optimal weighting scheme under this situation. In this paper, we address this question. For imputed SNPs, we find that the optimal weight is proportional to both the expected value and inverse variance of estimates given by the expectation-substitution method. While the expectation-substitution method does not give unbiased estimators in general, the bias is usually very small under practical situations of GWAS. Based on this finding, we show that the inverse-variance weighting scheme is a good approximation of the optimal weight for the meta-analysis of imputed SNPs. These results are important, because they validate that the expectation-substitution method and the inverse variance weighting scheme currently being used in GWAS meta-analysis are adequate and close to be optimal in GWAS settings.

Consider a case-control study of n individuals. For a given SNP, suppose for subject *i*, *i* = 1, … *n*, the true genotype is *g _{i}*= 0, 1, or 2 and the disease status is

$$\mu ({g}_{i};{b}_{0},{b}_{1})\equiv P({d}_{i}=1;{b}_{0},{b}_{1})=\frac{exp({b}_{0}+{b}_{1}{g}_{i})}{1+exp({b}_{0}+{b}_{1}{g}_{i})}.$$

(1)

Note that model (1) is designed for a prospective study where subjects are first selected, then followed up for disease development. However, in many GWAS, the study design is retrospective. In a seminal paper by Prentice and Pyke (1979), the authors showed it is valid to apply model (1) to a case-control study as if the data were prospectively collected and the resulting estimators of *b*_{1} are consistent to the true values and asymptotically normal. Because of its simplicity and the appealing interpretation of exp(*b*_{1}) which approximates relative risk in rare disease, model (1) has been widely used in practice and will be used throughout this paper.

If the genotype for this given SNP is unknown, the expectation-substitution method replaces the unknown genotype by the dosage from the imputation * _{i}* =

$$\mu ({\overline{g}}_{i};{b}_{0},{b}_{1})=P({d}_{i}=1;{b}_{0},{b}_{1})=\frac{exp({b}_{0}+{b}_{1}{\overline{g}}_{i})}{1+exp({b}_{0}+{b}_{1}{\overline{g}}_{i})}.$$

(2)

The likelihood function can be written as:

$$L({b}_{0},{b}_{1})=\prod _{i=1}^{n}\mu {({\overline{g}}_{i};{b}_{0},{b}_{1})}^{{d}_{i}}{\{1-\mu ({\overline{g}}_{i};{b}_{0},{b}_{1})\}}^{1-{d}_{i}}.$$

(3)

By Taylor’s expansion, the maximum likelihood estimator (_{0}, _{1}) for (*b*_{0}, *b*_{1}) satisfies

$${({\widehat{b}}_{0},{\widehat{b}}_{1})}^{\prime}={({b}_{0},{b}_{1})}^{\prime}+I{({b}_{0}^{\ast},{b}_{1}^{\ast})}^{-1}U({b}_{0},{b}_{1}),$$

(4)

where

$$U({b}_{0},{b}_{1})={n}^{-1}{[\sum _{i=1}^{n}\{{d}_{i}-\mu ({\overline{g}}_{i};{b}_{0},{b}_{1})\},\sum _{i=1}^{n}{\overline{g}}_{i}\{{d}_{i}-\mu ({\overline{g}}_{i};{b}_{0},{b}_{1})\}]}^{\prime},$$

(5)

$$I({b}_{0}^{\ast},{b}_{1}^{\ast})=-{n}^{-1}\left(\begin{array}{cc}{\scriptstyle \frac{{\partial}^{2}\mathit{logL}({b}_{0}^{\ast},{b}_{1}^{\ast})}{\partial {{b}_{0}}^{2}}}& {\scriptstyle \frac{{\partial}^{2}\mathit{logL}({b}_{0}^{\ast},{b}_{1}^{\ast})}{\partial {b}_{0}\partial {b}_{1}}}\\ {\scriptstyle \frac{{\partial}^{2}\mathit{logL}({b}_{0}^{\ast},{b}_{1}^{\ast})}{\partial {b}_{1}\partial {b}_{0}}}& {\scriptstyle \frac{{\partial}^{2}\mathit{logL}({b}_{0}^{\ast},{b}_{1}^{\ast})}{\partial {{b}_{1}}^{2}}}\end{array}\right),$$

(6)

and (
${b}_{0}^{\ast},{b}_{1}^{\ast}$) is on the line segment joining (_{0}, _{1}) and (*b*_{0}, *b*_{1}).

Taking the expectation of *U* (*b*_{0}, *b*_{1}) in equation (4), we have

$$E\{U({b}_{0},{b}_{1})\}=\left(\begin{array}{c}{n}^{-1}{\sum}_{i=1}^{n}\{E({d}_{i}\mid {p}_{i0},{p}_{i1},{p}_{i2})-\mu ({\overline{g}}_{i};{b}_{0},{b}_{1})\}\\ {n}^{-1}{\sum}_{i=1}^{n}{\overline{g}}_{i}\{E({d}_{i}\mid {p}_{i0},{p}_{i1},{p}_{i2})-\mu ({\overline{g}}_{i};{b}_{0},{b}_{1})\}\end{array}\right),$$

(7)

When *b*_{1} = 0 (no association) or one of *p _{i}*

Suppose for a given imputed SNP, the *b*_{1}(≠ 0) estimate from (4) in the *i*th study (*i* = 1, …, *M*) is
${\widehat{b}}_{1}^{i}$; the estimated variance for
${\widehat{b}}_{1}^{i}$ is * _{i}*; the weight for the

$${\widehat{b}}_{1}^{\mathit{meta}}=\sum _{i=1}^{M}{w}_{i}{\widehat{b}}_{1}^{i}.$$

Denote
$E({\widehat{b}}_{1}^{i})$ by *μ _{i}*, the test statistic is

$${\widehat{b}}_{1}^{\mathit{meta}}/se({\widehat{b}}_{1}^{\mathit{meta}})=\frac{{\sum}_{i=1}^{M}{w}_{i}{\widehat{b}}_{1}^{i}}{\sqrt{{\sum}_{i=1}^{M}{w}_{i}^{2}{\widehat{V}}_{i}}}\to N(\frac{{\sum}_{i=1}^{M}{w}_{i}{\mu}_{i}}{\sqrt{{\sum}_{i=1}^{M}{w}_{i}^{2}{V}_{i}}},1)$$

(8)

Based on (8), the optimal weight to maximize the power to detect the association is equivalent to maximizing

$${(\frac{{\sum}_{i=1}^{M}{w}_{i}{\mu}_{i}}{\sqrt{{\sum}_{i=1}^{M}{w}_{i}^{2}{V}_{i}}})}^{2}.$$

(9)

A simple derivation shows that *w _{i}* needs to be proportional to

Fortunately, we can show theoretically that the bias of _{1} is very small when the true *b*_{1} is small, regardless of the imputation quality. For example, when *b*_{0} = 0 and *b*_{1}=log(1.2), the bias of _{1}=|*E*(_{1}) − *b*_{1}| < 0.002; when *b*_{1}=log(1.5), |*E*(_{1}) − *b*_{1}| < 0.02. Further theoretical details showing the upperbound of bias are provided in the Appendix. The theoretical results about the approximate unbiasedness are also verified by extensive simulations in the Results section.

Given the approximate unbiasedness of *b*_{1} estimators, the optimal weight can therefore be approximated by the regular inverse variance weight.

We have shown the inverse variance weight can approximate the optimal weight. For imputed SNPs, it seems natural that the weight for _{1} should increase as imputation quality increases. For this reason, we will explore whether the inverse variance weighting scheme incorporates imputation quality. In the expectation-substitution method, the variance of (_{0}, _{1})′ can be estimated by *I*^{−1}(_{0}, _{1}). Let *h*(*g; *_{0}, _{1}) = *μ*(*g; *_{0}, _{1}){1 − *μ*(*g; *_{0}, _{1})}, we have

$$\widehat{var}({\widehat{b}}_{1})=\frac{{\sum}_{i=1}^{n}h({\overline{g}}_{i};{\widehat{b}}_{0},{\widehat{b}}_{1})}{{\sum}_{i=1}^{n}h({\overline{g}}_{i};{\widehat{b}}_{0},{\widehat{b}}_{1}){\sum}_{i=1}^{n}{\overline{g}}_{i}^{2}h({\overline{g}}_{i};{\widehat{b}}_{0},{\widehat{b}}_{1})-{\{{\sum}_{i=1}^{n}{\overline{g}}_{i}h({\overline{g}}_{i};{\widehat{b}}_{0},{\widehat{b}}_{1})\}}^{2}}.$$

(10)

The first derivative of *h*(*g; *_{0}, _{1}) with respect to *g* is _{1} exp(_{0} + _{1}*g*){1 − exp(_{0} + _{1}*g*)}/[(1 + exp{_{0} + _{1}*g*)}^{3}], which is approximately 0 when _{1} is sufficiently small. Hence, we can consider *h*(* _{i}*;

$$\widehat{var}({\widehat{b}}_{1})\approx {(nc)}^{-1}\frac{1}{{n}^{-1}{\sum}_{i=1}^{n}{\overline{g}}_{i}^{2}-{({n}^{-1}{\sum}_{i=1}^{n}{\overline{g}}_{i})}^{2}}\approx {(nc)}^{-1}var({g}_{i}){({R}^{2})}^{-1},$$

(11)

where *R*^{2} is the imputation quality measure in MACH [Li et al. (2010)] defined as the ratio of the sample variance of * _{i}* and the expected variance of

Another interesting observation is that there is a connection between the inverse variance weighting scheme and the “imputation aware” method in Zaitlen and Eskin (2010) through (11). Note that the inverse variance weighting estimator can be written as

$$\sum _{i=1}^{M}{V}_{i}^{-1}{\widehat{b}}_{1}^{i}\approx \sum _{i=1}^{M}\frac{{nR}^{2}}{var({g}_{i})/c}{\widehat{b}}_{1}^{i},$$

(12)

and the “imputation aware” method can be written as

$$\sum _{i=1}^{M}R\sqrt{n}{z}^{i}=R\sqrt{n}{\widehat{b}}_{1}^{i}/\sqrt{{\widehat{V}}_{i}}\approx \sum _{i=1}^{M}\frac{{nR}^{2}}{\sqrt{var({g}_{i})/c}}{\widehat{b}}_{1}^{i},$$

(13)

We can see the only difference between (12) and (13) is the var(*g _{i}*) part. Since var(

In this section, we first use simulation to demonstrate the finite sample properties of _{1} given by the expectation-substitution method, such as the approximate unbiasedness and relationship between var(_{1}) and imputation quality. Then, we compare the power of the inverse variance weighting method in the meta-analysis with various other methods.

We generated the genotypes of two SNPs, considering a range of minor allele frequency (MAF) combinations (*f*_{1}, *f*_{2}) of the two SNPs, and a range of linkage disequilibrium (LD) measure as *D*′. To mimic the imputation scenario, we assume that genotypes of the second SNP are unknown, and imputed its dosage based on the genotypes of the first SNP. We varied the imputation quality by changing the LD measure *D*′. A population of 10,000 was generated based on the logistic regression model in equation (1) with genotypes at the second SNP as the *g _{i}*’s,

Simulation Results of the Expectation-Substitution Method Under Various Parameter Settings based on 10,000 simulated data sets, each has 1000 cases and 1000 controls. *b*_{1} is the true value; *f*_{1} and *f*_{2} are the MAFs for SNP 1 (the marker) and 2 (the disease **...**

When *b*_{1}=0, all the estimated type I error rates are well controlled at the nominal *α* level 0.05. When *b*_{1} = log(1.2) and log(1.5)), the relative bias of _{1} is very small (< 2%) regardless of the MAF of both SNPs. In contrast, when *b*_{1} is larger, log(2), _{1} slightly underestimates the true *b*_{1} and the bias is greater as the imputation quality worsens. Under the simulation settings in Table I, for any given MAF combinations (*f*_{1}, *f*_{2}), *b*_{0}, *b*_{1}, and *D*′, we obtained a numeric solution of
${b}_{1}^{\ast}$, where
${b}_{1}\to {b}_{1}^{\ast}$ by solving the following system of equations:

$$\{\begin{array}{c}E\{{d}_{i}-{\scriptstyle \frac{exp({b}_{0}^{\ast}+{b}_{1}^{\ast}{\overline{g}}_{i})}{1+exp({b}_{0}^{\ast}+{b}_{1}^{\ast}{\overline{g}}_{i})}}\}=0\\ {E\overline{g}}_{i}\{{d}_{i}-{\scriptstyle \frac{exp({b}_{0}^{\ast}+{b}_{1}^{\ast}{\overline{g}}_{i})}{1+exp({b}_{0}^{\ast}+{b}_{1}^{\ast}{\overline{g}}_{i})}}\}=0\end{array}$$

(14)

Figure 1 shows that even with the worst imputation quality in Table I (when *D*′=0.5), the bias of _{1} is still less than 5% for the odds ratio as large as 2. Since it is less common for the associated alleles identified by GWAS to have an odds ratio greater than 2 [Hindorff et al. (2009); NHGRI GWAS Catalog (2011)], this bias is not really problematic in GWAS settings.

In Table I, the mean of standard errors (SE) and the standard deviation (SD) of the estimates over 10,000 simulated data sets agree with each other very well, suggesting that the standard error estimates are reliable. Furthermore, the standard errors of _{1} decrease as the imputation quality *R*^{2} increases; as a result, the power (Power) increases. As a comparison, we also show the standard deviations of parameter estimates (SD^{*}) and power (Power^{*}) if the genotypes for SNP 2 are known. As we can see, SD^{*} is always less than SD and Power^{*} is always greater than Power, which implies that there is efficiency loss using imputed genotypes. For example, when *b*_{1}=log(1.2) and *f*_{1}=*f*_{2}=0.2, the power loss decreases from 67% to 0.6% as the imputation quality increases. Taken together, we can see that even with very small *R*^{2}, the power is still acceptable in many cases using imputed genotypes. The estimated coverage probabilities are all very close to the nominal value 0.95, indicating that the confidence interval estimates are very accurate.

In order to explore the performance of the expectation-substitution method in a more realistic setting, we used GWAS scans from Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) [Prorok et al. (2000); Hayes et al. (2000)]. PLCO is a randomized, two-arm trial coordinated by the NCI in ten U.S. centers.

The PLCO data include 2520 samples, genotyped on Illumina Human Hap 300k&240k, 550k and 610k platforms. We randomly selected 1000 genotyped SNPs on chromosome 22 and masked their genotypes. Then we used MACH to impute the genotypes of the 1000 SNPs as if they were untyped, using HapMap II release 24 as the reference panel. In this way, we have both the true genotypes and the imputed dosages. Similarly, as in the previous section, case-control samples were generated based on model (1) using true genotypes and _{1} was estimated by fitting the model (2) with imputed dosages. We set *b*_{1} to be 0, log(1.2), log(1.5) and log(2). For each value of *b*_{1}, we replicate the procedure 50 times for each of the 1000 SNPs. Figure 2 shows a boxplot of the percent of bias of _{1} of SNPs grouped by MAF and *R*^{2}. We can see that _{1} is approximately unbiased regardless of the imputation quality *R*^{2}, which agree with the theoretical results. On the other hand, the variability of the estimates is much greater when *R*^{2} < 0.3 and *M AF* < 0.05.

We generated the data in the same way as the previous section. Here we let *b*_{1} take 10 equally spaced values from 0.05 to log(2), MAFs (*f*_{1}, *f*_{2}) of the two SNPs be (0.2,0.2) for both studies and the LD measure *D*′ = 0.5 and 0.99 for two studies, respectively. We conducted meta-analysis for the two studies using the following four methods and compared their power:

- The optimal weighting, which is proportional to
*μ*/_{i}. In practice, it is usually impossible to estimate_{i}*μ*. However, with_{i}*b*_{1},*f*_{1},*f*_{2}and*D*′ known in the simulation, we can compute*μ*from (14). Hence, we can estimate the optimal weight for the purpose of comparison._{i} - The inverse variance weighting method, which is an approximation of the optimal weighting under practical situations in GWAS.
- The “imputation aware” method by Zaitlen and Eskin (2010).
- The regular z-score meta-analysis (MetaZ) method without correcting for imputation quality.

As we can see from Figure 3, the optimal weighting, inverse variance weighting and the “imputation aware” method have indistinguishable performance. In addition, they are all more powerful than the regular MetaZ method which does not account for imputation quality. This confirms that the inverse variance weighting method is a good approximation of the optimal weight and it automatically incorporates the imputation quality.

The power of optimal weighting (optimal), inverse variance weighting (IVW) method, “imputation aware” method (‘Imputation Aware’ Z) and the regular *z*-score meta-analysis without imputation quality (MetaZ) from the meta-analysis **...**

We also simulated a situation where the MAFs are different between the two studies, which results in different var(*g _{i}*). Instead of letting the MAF = 0.2 for both studies, we let the MAF =0.1 for the first study and 0.4 for the second study. The power comparison is shown in Figure 4. As we expected, the inverse variance weighting method has better performance than the “imputation aware” method in this case because it is an approximation to the optimal weight. In practice, we would not expect MAFs differ substantially for studies of similar populations. However, for a cross-ethnicity meta-analysis, the inverse variance weighting is superior to the “imputation aware” method since it accounts for the MAF variation among different ethnic groups.

As imputation has been widely used to recover information from GWAS data, the expectation-substitution method is the most commonly used method to analyze imputed SNPs while accounting for genotype uncertainty. Our work shows, both numerically and theoretically, that the expectation-substitution method gives approximately unbiased estimates under practical conditions of low effect sizes for GWAS studies of common diseases. We also show the inverse variance weighting scheme approximates the optimal weight well and always has the best power among different meta-analysis methods compared.

Two recent papers have outlined the advantages of using meta-analysis, and discussed study design, quality control and analysis issues to consider when implementing meta-analysis of GWAS data [Cantor et al. (2010); Zeggini and Ioannidis (2009)]. These papers address weighting schemes for combining results, but focus more on random-effects vs. fixed-effects analysis, rather than on methods to include imputation quality.

The different imputation software packages provide information not only on the probability of each genotype, but also an overall imputation quality measure. This measure is typically defined as the ratio of the sample variance of the genotype to the expected variance, with lower scores indicating less well imputed SNPs. Studies often exclude SNPs with either low *R*^{2} or low MAF. A threshold of imputation *R*^{2}=0.3 has been recommended by MACH as the imputation quality cut-off for estimates [MACH Homepage (2010)]. Our results show that in terms of bias, the combination of imputation quality and MAF seems to be most relevant. In particular, we show that the variability of estimates is large for lower imputation quality and lower MAF. In current practice, rare variants (MAF<0.05) are often excluded from imputation and subsequent meta-analysis. In this situation, either not using a filter, or using a filter based only on *R*^{2} is likely sufficient. However, as meta-analysis grows larger and data becomes available to impute rare variants, we recommend using both the imputation quality and the MAF to set filtering criterion. For example, in our simulation results (Figure 2) the optimal filter appears to be excluding SNPs with both a MAF<0.05 AND a *R*^{2} <0.3, rather than all SNPs with *R*^{2} < 0.3. In this paper, we used the imputation quality measure *R*^{2} defined by MACH [Li et al. (2010)], which is the squared correlation between true genotypes and imputed dosages. In Beagle [Browning and Browning (2009)], *R*^{2} is defined as the squared correlation between true and the most likely genotypes. To investigate whether the choice of different quality measures makes much difference, we randomly chose 10,000 imputed SNPs on chromosome 22 in the PLCO data [Prorok et al. (2000); Hayes et al. (2000)] and computed their MACH *R*^{2} and Beagle *R*^{2}. It turns out that the two *R*^{2}’s are highly correlated (r>0.99). Thus, although the cut-offs for the two *R*^{2}’s could be slightly different, the general conclusion should still hold.

As we move into the post-GWAS era, our results provide important guidance for investigators on how to optimally conduct meta-analysis in the presence of imputed genotypes for marginal SNP associations. We support the current practice of using the expectation-substitution method and the inverse variance weighting in meta-analysis. Additional theoretical and numerical work is needed to evaluate the use of imputed data in more sophisticated analysis, including proposed methods for gene-gene and gene-environment interactions.

We thank two reviewers for their helpful comments. This work was supported by the National Institutes of Health (5R01 CA059045 and 5U01 CA137088, R01AG14358, P01CA53996).

Genotype data included in these analyses from the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial was supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics and supported by contracts from the Division of Cancer Prevention, National Cancer Institute, National Institutes of Health, Department of Health and Human Services. The authors thank Drs. Christine Berg and Philip Prorok, Division of Cancer Prevention, National Cancer Institute, the Screening Center investigators and staff or the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial, Mr. Tom Riley and staff, Information Management Services, Inc., Ms. Barbara O’Brien and staff, Westat, Inc., and Drs. Bill Kopp, Wen Shao, and staff, SAIC-Frederick. Most importantly, we acknowledge the study participants for their contributions to making this study possible.

Data included in these analyses were also generated from the GWAS of Lung Cancer and Smoking. Funding for this work was provided through the National Institutes of Health Genes, Environment and Health Initiative [NIH GEI] (Z01 CP 010200). The human subjects participating in the GWAS derive from The Environment and Genetics in Lung Cancer Etiology (EAGLE) case-control study and the Prostate, Lung, Colon and Ovarian Screening Trial and these studies are supported by intramural resources of the National Cancer Institute. Assistance with genotype cleaning, as well as with general study coordination, was provided by the Gene Environment Association Studies, GENEVA Coordinating Center (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NHI GEI (U01 HG 004438). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number ph000093 v2.p2.c1.

In addition, data generated from the Cancer Genetic Markers of Susceptibility (CGEMS) [CGEMS (2010)] prostate cancer scan were also included in this analysis. The datasets used for the analyses described in this manuscript were accessed with appropriate approval through the dbGaP online resource (http://www.cgems.cancer.gov/data) through dbGaP accession number 000207 v.1p1.c1.

First, we introduce some notation. Let *μ*(* _{i}*,

Lemma 1 shows that the extrema of |*E*(*d _{i}*|

Lemma 2 shows that there exists some *δ*(*b*_{0}, *b*_{1}) (which depends on the upperbound given by Lemma 1), such that when _{1} ≥ *b*_{1} + *δ*(*b*_{0}, *b*_{1}), *μ*(* _{i}*,

|E(d_{i}|p_{i0}, p_{i1}, p_{i2}) − μ(_{i}; b_{0}, b_{1})| ≤ M (b_{0}, b_{1})min(_{i}, 2 − _{i}), where M (b_{0}, b_{1}) = max(D_{U} [0, 2], D_{I} [0, 2], D_{U} [0, 1], D_{I} [0, 1], D_{U} [1, 2], D_{I} [1, 2])

We can rewrite *E*(*d _{i}*|

When *p _{i}*

Similarly, we can show that when *p _{i}*

Combining all the results above we have

$$\mid E({d}_{i}\mid {p}_{i0},{p}_{i1},{p}_{i2})-\mu ({\overline{g}}_{i};{b}_{0},{b}_{1})\mid \phantom{\rule{0.16667em}{0ex}}\le M({b}_{0},{b}_{1})\mathit{min}({\overline{g}}_{i},2-{\overline{g}}_{i}).$$

(15)

Let δ(b_{0}, b_{1}) = sup_{i[0,2]}M (b_{0}, b_{1})/[(1 − μ(_{i}, b_{0}, b_{1}))μ(_{i}, b_{0}, b_{1})]. Then when _{1} ≥ b_{1}+δ(b_{0}, b_{1}), μ(_{i}, b_{0}, _{1})−E(d_{i}|p_{i0}, p_{i1}, p_{i2}) > 0 for any _{i} [0, 2]; when _{1} ≤ b_{1} − δ(b_{0}, b_{1}), μ(_{i}, b_{0}, _{1})−E(d_{i}|p_{i0}, p_{i1}, p_{i2}) < 0 for any _{i} [0, 2].

Consider the following equation of _{1}

$$\mu ({\overline{g}}_{i},{b}_{0},{\stackrel{\sim}{b}}_{1})-E({d}_{i}\mid {p}_{i0},{p}_{i1},{p}_{i2})=0$$

(16)

The root for this equation would be

$${\stackrel{\sim}{b}}_{1}=\frac{-log(E{({d}_{i}\mid {p}_{i0},{p}_{i1},{p}_{i2})}^{-1}-1)-{b}_{0}}{{\overline{g}}_{i}}$$

(17)

Denote *E*(*d _{i}*|

|E(_{1}) − b_{1}| < δ(b_{0}, b_{1})

As
${b}_{1}^{\ast}$ is the root of the equation of _{1}:

$$\sum _{i=1}^{n}{\overline{g}}_{i}\{\mu ({\overline{g}}_{i},{b}_{0},{\stackrel{\sim}{b}}_{1})-E({d}_{i}\mid {p}_{i0},{p}_{i1},{p}_{i2})\}=0$$

(18)

Applying Lemma 2, when _{1} ≥ *b*_{1} + *δ*(*b*_{0}, *b*_{1}), the LHS of equation (18) will be positive; when _{1} ≤ *b*_{1} − *δ*(*b*_{0}, *b*_{1}), the LHS of equation (18) will be negative. As the LHS of equation (18) is also an increasing function of _{1}, then the root of equation (18)
${b}_{1}^{\ast}$ must lie between [*b*_{1} − *δ*(*b*_{0}, *b*_{1}), *b*_{1} + *δ*(*b*_{0}, *b*_{1})]. Given that
${\widehat{b}}_{1}\to {b}_{1}^{\ast}$, we have |*E*(_{1}) − *b*_{1}| < *δ*(*b*_{0}, *b*_{1}).

To show the magnitude of *δ*(*b*_{0}, *b*_{1}), which is the upperbound of the bias of _{1}, we tried a few different values of *b*_{1}. For example, when *b*_{0} = 0, *b*_{1} = log(1.2), *δ*(*b*_{0}, *b*_{1}) = *D _{U}* ([0, 2])/[(1−

- Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat Genet. 2006;38:659–662. [PubMed]
- Browning BL, Browning SR. A unified approach to genotype imputation and haplotype phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84:210–223. [PubMed]
- Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application. Am J Hum Genet. 2010;86(1):6–22. [PubMed]
- Cancer Genetic Markers of Susceptibility (CGEMS) Data. 2009 http://cgems.cancer.gov/data/. 10-5-2009.
- Chapman K, Takahashi A, Meulenbelt I, et al. A meta-analysis of European and Asian cohorts reveals a global role of a functional SNP in the 5′ UTR of GDF5 with osteoarthritis susceptibility. Hum Mol Genet. 2008;17(10):1497–504. [PubMed]
- Cordell HJ. Estimation and testing of genotype and haplotype effects in case-control studies: comparison of weighted regression and multiple imputation procedures. Genet Epidemiol. 2006;30:259–275. [PubMed]
- de Bakker PIW, Ferreira MAR, Jia X, Neale BM, Raychaudhuri S, Voight BF. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum Mol Genet. 2008;17(R2):R122–R128. [PMC free article] [PubMed]
- Hayes RB, Reding D, Kopp W, et al. Etiologic and early marker studies in the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control Clin Trials. 2000;21(6 Suppl):349S–355S. [PubMed]
- Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009;106(23):9362–7. [PubMed]
- Hindorff LA, Junkins HA, Hall PN, Mehta JP, Manolio TA. [Accessed [03/29/2011]];A Catalog of Published Genome-Wide Association Studies. Available at: www.genome.gov/gwastudies.
- Kraft P, Cox DG, Paynter RA, Hunter D, De Vivo I. Accounting for haplotype uncertainty in matched association studies: a comparison of simple and flexible techniques. Genet Epidemiol. 2005;28:261–272. [PubMed]
- Kraft P, Stram OD. RE: The Use of Inferred Haplotypes in Downstream Analysis. Am J Hum Genet. 2007;81:863–865. [PubMed]
- Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010;34(8):816834. [PMC free article] [PubMed]
- Lin DY, Huang BE. The use of inferred haplotypes in downstream analyses. Am J Hum Genet. 2007;80:577–579. [PubMed]
- MACH Homepage. http://www.sph.umich.edu/csg/yli/mach/tour/imputation.html.
- Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics. 2007;39:906–913. [PubMed]
- Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411.
- Prorok PC, Andriole GL, Bresalier RS, et al. Design of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. Control Clin Trials. 2000;21(6 Suppl):273S–309S. [PubMed]
- Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. [PMC free article] [PubMed]
- Soranzo N, Rivadeneira F, Chinappen-Horsley U, et al. Meta-analysis of genome-wide scans for human adult stature identifies novel loci and associations with measures of skeletal frame size. PLoS Genet. 2009;5(4):e1000445. [PMC free article] [PubMed]
- The International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;427:1299–1320. [PMC free article] [PubMed]
- The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
- The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:10611073. [PMC free article] [PubMed]
- Willer CJ, Speliotes EK, Loos RJF, et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet. 2008;41:2534. [PMC free article] [PubMed]
- Xiong DH, Liu XG, Guo YF, et al. Genome-wide Association and Follow-Up Replication Studies Identified ADAMTS18 and TGFBR3 as Bone Mass Candidate Genes in Different Ethnic Groups. Am J Hum Genet. 2009;84(3):388398. [PubMed]
- Zaitlen N, Eskin E. Imputation aware meta-analysis of genome-wide association studies. Genet Epidemiol. 2010;34(6):537–42. [PMC free article] [PubMed]
- Zeggini E, Ioannidis JP. Meta-analysis in genome-wide association studies. Pharmacogenomics. 2009;10(2):191–201. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |