PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Genet Epidemiol. Author manuscript; available in PMC 2012 November 1.
Published in final edited form as:
PMCID: PMC3201718
NIHMSID: NIHMS308065

The Use of Imputed Values in the Meta-Analysis of Genome-Wide Association Studies

Abstract

In Genome-wide association studies (GWAS), it is common practice to impute the genotypes of untyped single-nucleotide polymorphism by exploiting the linkage disequilibrium structure among SNPs. Use of imputed genotypes improves genome coverage and makes it possible to perform meta-analysis combining results from studies genotyped on different platforms. A popular way of using imputed data is the “expectation-substitution” method, which treats the imputed dosage as if it were the true genotype. In current practice, the estimates given by the expectation-substitution method are usually combined using inverse variance weighting scheme in meta-analysis. However, the inverse variance weighting is not optimal as the estimates given by the expectation-substitution method are generally biased. The optimal weight is, in fact, proportional to the inverse variance and the expected value of the effect size estimates. We show both theoretically and numerically that the bias of the estimates is very small under practical conditions of low effect sizes in GWAS. This finding validates the use of the expectation-substitution method, and shows the inverse variance is a good approximation of the optimal weight. Through simulation, we compared the power of the inverse variance weighting method with several methods including the optimal weight, the regular z-score meta-analysis and a recently proposed “imputation aware” meta-analysis method [Zaitlen and Eskin (2010)]. Our results show that the performance of the inverse variance weight is always indistinguishable from the optimal weight and similar to or better than the other two methods.

Keywords: GWAS, imputation, bias, meta-analysis, weight

Introduction

The advance of high-throughput technology makes it possible to genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) simultaneously which allows researchers to examine genetic variation across the whole genome in genome-wide association studies (GWAS). By testing the association between SNPs and complex traits and diseases, GWAS have successfully uncovered hundreds of novel susceptibility loci to date [Hindorff et al. (2009)].

Even though current GWAS platforms include markers for hundreds of thousands or even millions of SNPs, they still only directly assay a proportion of the whole genome. Obviously, if only directly genotyped SNPs are considered, this can lead to associated SNPs undetected. Another drawback of the partial coverage is that the selected SNP panel often varies for different platforms[Barrett et al. (2006)]. When different studies use different platforms, combining across studies will lead to a much reduced set of SNPs genotyped in all the studies. For example, the overlap between the Affymetrix SNP Array 6.0 and Illumina OmniExpress genotyping array is less than 30%. An effective approach to overcome the aforementioned problems is to impute the untyped SNPs based on a common reference panel.

The basic idea behind genotype imputation is to take advantage of the linkage disequilibrium (LD) information among SNPs. Because of the LD and haplotype structure, genotyped variants can provide information about untyped SNPs. It is feasible to use data on genotyped SNPs along with an appropriate reference panel containing information on a larger set of SNPs to predict the genotypes of the ungenotyped SNPs. Currently the HapMap project [The International HapMap Consortium (2005, 2007)] provides such reference panels, and future studies are likely to extend to the 1000 Genomes Project [The 1000 Genomes Project Consortium (2010)] or other whole genome or exome sequence data. The most popular imputation programs include MACH [Li et al. (2010)], IMPUTE [Marchini et al. (2007)], and Beagle [Browning and Browning (2009)], among others.

There are several approaches to using imputed values in the association analysis. Suppose a SNP of a given subject i has genotype gi, where gi takes one of the three values 0, 1 and 2, the number of copies of one of the alleles (typically the “minor” or lower frequency allele). The output of an imputation program usually includes three probabilities: pi0=P(gi=0); pi1=P(gi=1); pi2=P(gi=2). One method is to use the most likely genotype (the genotype with the highest probability) as if it were the true genotype. However, it has been shown in Lin and Huang (2007) that this method leads to intrinsically biased estimates because of the unavoidable discrepancy between the most likely genotype and the true genotype. Another popular approach is the so-called expectation-substitution method. Instead of using the most likely genotype, this method uses the dosages, expected number of minor alleles = pi1 + 2pi2, as if it were the true genotype. In the haplotype analysis framework, several studies [Kraft et al. (2005); Kraft and Stram (2007); Cordell (2007)] have shown through a series of simulation experiments that the expectation-substitution method has no noticeable bias under practical settings. It is also possible to use Bayesian methods [Marchini et al. (2007); Servin and Stephens (2007)] to perform the imputation and the association test at the same time, however, these methods are usually computationally intensive and hence not feasible on a genome wide scale. Therefore, in the remaining of the paper, we will focus on the expectation-substitution method.

If multiple studies are imputed using the same reference, then the different studies have data on a common set of SNPs, making meta-analysis across studies possible. Because combining studies increases sample size, meta-analysis increases power and allows detection of loci not found in individual studies. One way of performing meta-analysis is to use the regular z-score meta-analysis (MetaZ), which combines z-scores weighted by square root of sample sizes. Alternatively, the effect size meta-analysis (MetaBeta) combines effect sizes by computing a weighted average of the estimates. For meta-analysis that involves imputed genotypes, the imputation quality is an important factor. Hence, it seems natural that the imputation quality should also be reflected in the weight for meta-analysis.

For MetaZ, de Bakker et al. (2008) suggested scaling the weighted sum of z-scores by the imputation quality measure. Based on this idea, Zaitlen and Eskin (2010) have recently proposed an “imputation aware” method to combine z-scores. In the “imputation aware” method, the weight for the z-score of each study is proportional to equation M1, where R2 is the imputation quality measure and n is the sample size. Results has shown the “imputation aware” method is more powerful than the regular z-score meta-analysis when the imputation quality varies among studies [Zaitlen and Eskin (2010)].

For MetaBeta, most studies use the traditional inverse variance weighting to combine estimates from imputed and genotyped SNPs in current practice [Soranzo et al. (2009); Willer et al. (2008)]. However, it is unknown whether the inverse variance weighting is the optimal weighting scheme under this situation. In this paper, we address this question. For imputed SNPs, we find that the optimal weight is proportional to both the expected value and inverse variance of estimates given by the expectation-substitution method. While the expectation-substitution method does not give unbiased estimators in general, the bias is usually very small under practical situations of GWAS. Based on this finding, we show that the inverse-variance weighting scheme is a good approximation of the optimal weight for the meta-analysis of imputed SNPs. These results are important, because they validate that the expectation-substitution method and the inverse variance weighting scheme currently being used in GWAS meta-analysis are adequate and close to be optimal in GWAS settings.

MATERIALS AND METHODS

Models

Consider a case-control study of n individuals. For a given SNP, suppose for subject i, i = 1, … n, the true genotype is gi= 0, 1, or 2 and the disease status is di=0 or 1, where 0 indicates control and 1 indicates case, then the standard logistic model for modeling the association between the SNP and disease status is:

equation M2
(1)

Note that model (1) is designed for a prospective study where subjects are first selected, then followed up for disease development. However, in many GWAS, the study design is retrospective. In a seminal paper by Prentice and Pyke (1979), the authors showed it is valid to apply model (1) to a case-control study as if the data were prospectively collected and the resulting estimators of b1 are consistent to the true values and asymptotically normal. Because of its simplicity and the appealing interpretation of exp(b1) which approximates relative risk in rare disease, model (1) has been widely used in practice and will be used throughout this paper.

If the genotype for this given SNP is unknown, the expectation-substitution method replaces the unknown genotype by the dosage from the imputation gi = pi1 + 2pi2. In this case, model (1) becomes

equation M3
(2)

The likelihood function can be written as:

equation M4
(3)

By Taylor’s expansion, the maximum likelihood estimator (b0, b1) for (b0, b1) satisfies

equation M5
(4)

where

equation M6
(5)

equation M7
(6)

and ( equation M8) is on the line segment joining (b0, b1) and (b0, b1).

Taking the expectation of U (b0, b1) in equation (4), we have

equation M9
(7)

When b1 = 0 (no association) or one of pi0, pi1, pi2 is 1 (perfectly imputed), it is obvious that E(di|pi0, pi1, pi2) − μ(gi; b0, b1) = pi0μ(0; b0, b1) + pi1μ(1; b0, b1) + pi2μ(2; b0, b1) − μ(pi1 + 2pi2; b0, b1) = 0 and b1 is unbiased. Therefore, the expectation-substitution method does not cause potential inflation in type I error rate. On the other hand, if b1 ≠ 0 and the imputation is imperfect, b1 from (4) is biased, which as we show below, could cause potential problems.

Optimal weight for meta-analysis with imputed values

Suppose for a given imputed SNP, the b1(≠ 0) estimate from (4) in the ith study (i = 1, …, M) is equation M10; the estimated variance for equation M11 is Vi; the weight for the ith study is wi, then the estimate for b1 from the meta-analysis is

equation M12

Denote equation M13 by μi, the test statistic is

equation M14
(8)

Based on (8), the optimal weight to maximize the power to detect the association is equivalent to maximizing

equation M15
(9)

A simple derivation shows that wi needs to be proportional to μi/Vi in order to maximize (9). Hence, even if the effect size is the same across studies, μi may still vary among studies because variation in imputation quality between studies will yield a different degree of bias in b1 estimates. This contrasts to the directly genotyped data where μi = b1 for all studies so wi needs to only be proportional to 1/V. However, this optimal weight which incorporates both the variance and μi is hard to estimate in practice, because of the difficulty in estimating μi.

Fortunately, we can show theoretically that the bias of b1 is very small when the true b1 is small, regardless of the imputation quality. For example, when b0 = 0 and b1=log(1.2), the bias of b1=|E(b1) − b1| < 0.002; when b1=log(1.5), |E(b1) − b1| < 0.02. Further theoretical details showing the upperbound of bias are provided in the Appendix. The theoretical results about the approximate unbiasedness are also verified by extensive simulations in the Results section.

Given the approximate unbiasedness of b1 estimators, the optimal weight can therefore be approximated by the regular inverse variance weight.

Inverse variance incorporates imputation quality

We have shown the inverse variance weight can approximate the optimal weight. For imputed SNPs, it seems natural that the weight for b1 should increase as imputation quality increases. For this reason, we will explore whether the inverse variance weighting scheme incorporates imputation quality. In the expectation-substitution method, the variance of (b0, b1)′ can be estimated by I−1(b0, b1). Let h(g; b0, b1) = μ(g; b0, b1){1 − μ(g; b0, b1)}, we have

equation M16
(10)

The first derivative of h(g; b0, b1) with respect to g is b1 exp(b0 + b1g){1 − exp(b0 + b1g)}/[(1 + exp{b0 + b1g)}3], which is approximately 0 when b1 is sufficiently small. Hence, we can consider h(gi; b0, b1) as a constant c, and write equation (10) as

equation M17
(11)

where R2 is the imputation quality measure in MACH [Li et al. (2010)] defined as the ratio of the sample variance of gi and the expected variance of gi, which is equivalent to the squared correlation between true and imputed genotypes. From equation (11), we can see the inverse variance of b1 is approximately proportional to the imputation quality. Thus, we show that the current inverse variance weighting scheme automatically incorporates imputation quality in the meta-analysis. Simulation results confirm the positive correlation between the imputation quality and inverse variances (see Results section).

Another interesting observation is that there is a connection between the inverse variance weighting scheme and the “imputation aware” method in Zaitlen and Eskin (2010) through (11). Note that the inverse variance weighting estimator can be written as

equation M18
(12)

and the “imputation aware” method can be written as

equation M19
(13)

We can see the only difference between (12) and (13) is the var(gi) part. Since var(gi) depends on MAF, we expect those two methods perform similarly when the MAFs of the SNP across studies are similar. Generally, we do not expect the MAF varies much for studies with similar ethnicity. However, if meta-analysis was conducted across different ethnic groups [Xiong et al. (2009); Chapman et al. (2008)], the MAF variation can be substantial. In such cases, we expect the inverse variance weighting method to have better power.

RESULTS

In this section, we first use simulation to demonstrate the finite sample properties of b1 given by the expectation-substitution method, such as the approximate unbiasedness and relationship between var(b1) and imputation quality. Then, we compare the power of the inverse variance weighting method in the meta-analysis with various other methods.

Finite sample properties of b1

Simulation Situations

We generated the genotypes of two SNPs, considering a range of minor allele frequency (MAF) combinations (f1, f2) of the two SNPs, and a range of linkage disequilibrium (LD) measure as D′. To mimic the imputation scenario, we assume that genotypes of the second SNP are unknown, and imputed its dosage based on the genotypes of the first SNP. We varied the imputation quality by changing the LD measure D′. A population of 10,000 was generated based on the logistic regression model in equation (1) with genotypes at the second SNP as the gi’s, b0 = 0, and b1 = log(1.2), log(1.5), log(2), corresponding to odds ratios 1.2, 1.5, and 2. Then 1000:1000 case-control samples were randomly selected from this population of 10,000. We fit model (2) to the case-control samples with the imputed dosage at the second SNP as gi. For comparison, we also fitted model (1) with the true genotype gi. For each parameter setting, we replicated the above procedure 10,000 times. The results are summarized in Table I.

Table I
Simulation Results of the Expectation-Substitution Method Under Various Parameter Settings based on 10,000 simulated data sets, each has 1000 cases and 1000 controls. b1 is the true value; f1 and f2 are the MAFs for SNP 1 (the marker) and 2 (the disease ...

When b1=0, all the estimated type I error rates are well controlled at the nominal α level 0.05. When b1 = log(1.2) and log(1.5)), the relative bias of b1 is very small (< 2%) regardless of the MAF of both SNPs. In contrast, when b1 is larger, log(2), b1 slightly underestimates the true b1 and the bias is greater as the imputation quality worsens. Under the simulation settings in Table I, for any given MAF combinations (f1, f2), b0, b1, and D′, we obtained a numeric solution of equation M20, where equation M21 by solving the following system of equations:

equation M22
(14)

Figure 1 shows that even with the worst imputation quality in Table I (when D′=0.5), the bias of b1 is still less than 5% for the odds ratio as large as 2. Since it is less common for the associated alleles identified by GWAS to have an odds ratio greater than 2 [Hindorff et al. (2009); NHGRI GWAS Catalog (2011)], this bias is not really problematic in GWAS settings.

Figure 1
The theoretical relative bias (%) of b1 as a function of true b1. The biases are computed from (14) with different f1, f2 and b1. b0 is fixed at 0 and D′ is fixed at 0.5.

In Table I, the mean of standard errors (SE) and the standard deviation (SD) of the estimates over 10,000 simulated data sets agree with each other very well, suggesting that the standard error estimates are reliable. Furthermore, the standard errors of b1 decrease as the imputation quality R2 increases; as a result, the power (Power) increases. As a comparison, we also show the standard deviations of parameter estimates (SD*) and power (Power*) if the genotypes for SNP 2 are known. As we can see, SD* is always less than SD and Power* is always greater than Power, which implies that there is efficiency loss using imputed genotypes. For example, when b1=log(1.2) and f1=f2=0.2, the power loss decreases from 67% to 0.6% as the imputation quality increases. Taken together, we can see that even with very small R2, the power is still acceptable in many cases using imputed genotypes. The estimated coverage probabilities are all very close to the nominal value 0.95, indicating that the confidence interval estimates are very accurate.

Real imputation data

In order to explore the performance of the expectation-substitution method in a more realistic setting, we used GWAS scans from Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) [Prorok et al. (2000); Hayes et al. (2000)]. PLCO is a randomized, two-arm trial coordinated by the NCI in ten U.S. centers.

The PLCO data include 2520 samples, genotyped on Illumina Human Hap 300k&240k, 550k and 610k platforms. We randomly selected 1000 genotyped SNPs on chromosome 22 and masked their genotypes. Then we used MACH to impute the genotypes of the 1000 SNPs as if they were untyped, using HapMap II release 24 as the reference panel. In this way, we have both the true genotypes and the imputed dosages. Similarly, as in the previous section, case-control samples were generated based on model (1) using true genotypes and b1 was estimated by fitting the model (2) with imputed dosages. We set b1 to be 0, log(1.2), log(1.5) and log(2). For each value of b1, we replicate the procedure 50 times for each of the 1000 SNPs. Figure 2 shows a boxplot of the percent of bias of b1 of SNPs grouped by MAF and R2. We can see that b1 is approximately unbiased regardless of the imputation quality R2, which agree with the theoretical results. On the other hand, the variability of the estimates is much greater when R2 < 0.3 and M AF < 0.05.

Figure 2
Boxplot of the bias of b1 of SNPs grouped by different MAF and R2 categories.

Performance of inverse variance weighting in the meta-analysis

We generated the data in the same way as the previous section. Here we let b1 take 10 equally spaced values from 0.05 to log(2), MAFs (f1, f2) of the two SNPs be (0.2,0.2) for both studies and the LD measure D′ = 0.5 and 0.99 for two studies, respectively. We conducted meta-analysis for the two studies using the following four methods and compared their power:

  1. The optimal weighting, which is proportional to μi/Vi. In practice, it is usually impossible to estimate μi. However, with b1, f1, f2 and D′ known in the simulation, we can compute μi from (14). Hence, we can estimate the optimal weight for the purpose of comparison.
  2. The inverse variance weighting method, which is an approximation of the optimal weighting under practical situations in GWAS.
  3. The “imputation aware” method by Zaitlen and Eskin (2010).
  4. The regular z-score meta-analysis (MetaZ) method without correcting for imputation quality.

As we can see from Figure 3, the optimal weighting, inverse variance weighting and the “imputation aware” method have indistinguishable performance. In addition, they are all more powerful than the regular MetaZ method which does not account for imputation quality. This confirms that the inverse variance weighting method is a good approximation of the optimal weight and it automatically incorporates the imputation quality.

Figure 3
The power of optimal weighting (optimal), inverse variance weighting (IVW) method, “imputation aware” method (‘Imputation Aware’ Z) and the regular z-score meta-analysis without imputation quality (MetaZ) from the meta-analysis ...

We also simulated a situation where the MAFs are different between the two studies, which results in different var(gi). Instead of letting the MAF = 0.2 for both studies, we let the MAF =0.1 for the first study and 0.4 for the second study. The power comparison is shown in Figure 4. As we expected, the inverse variance weighting method has better performance than the “imputation aware” method in this case because it is an approximation to the optimal weight. In practice, we would not expect MAFs differ substantially for studies of similar populations. However, for a cross-ethnicity meta-analysis, the inverse variance weighting is superior to the “imputation aware” method since it accounts for the MAF variation among different ethnic groups.

Figure 4
The power of optimal weighting (optimal), inverse variance weighting (IVW) method, “imputation aware” method (‘Imputation Aware’ Z) and the regular z-score meta-analysis without imputation quality (MetaZ) from the meta-analysis ...

Discussion

As imputation has been widely used to recover information from GWAS data, the expectation-substitution method is the most commonly used method to analyze imputed SNPs while accounting for genotype uncertainty. Our work shows, both numerically and theoretically, that the expectation-substitution method gives approximately unbiased estimates under practical conditions of low effect sizes for GWAS studies of common diseases. We also show the inverse variance weighting scheme approximates the optimal weight well and always has the best power among different meta-analysis methods compared.

Two recent papers have outlined the advantages of using meta-analysis, and discussed study design, quality control and analysis issues to consider when implementing meta-analysis of GWAS data [Cantor et al. (2010); Zeggini and Ioannidis (2009)]. These papers address weighting schemes for combining results, but focus more on random-effects vs. fixed-effects analysis, rather than on methods to include imputation quality.

The different imputation software packages provide information not only on the probability of each genotype, but also an overall imputation quality measure. This measure is typically defined as the ratio of the sample variance of the genotype to the expected variance, with lower scores indicating less well imputed SNPs. Studies often exclude SNPs with either low R2 or low MAF. A threshold of imputation R2=0.3 has been recommended by MACH as the imputation quality cut-off for estimates [MACH Homepage (2010)]. Our results show that in terms of bias, the combination of imputation quality and MAF seems to be most relevant. In particular, we show that the variability of estimates is large for lower imputation quality and lower MAF. In current practice, rare variants (MAF<0.05) are often excluded from imputation and subsequent meta-analysis. In this situation, either not using a filter, or using a filter based only on R2 is likely sufficient. However, as meta-analysis grows larger and data becomes available to impute rare variants, we recommend using both the imputation quality and the MAF to set filtering criterion. For example, in our simulation results (Figure 2) the optimal filter appears to be excluding SNPs with both a MAF<0.05 AND a R2 <0.3, rather than all SNPs with R2 < 0.3. In this paper, we used the imputation quality measure R2 defined by MACH [Li et al. (2010)], which is the squared correlation between true genotypes and imputed dosages. In Beagle [Browning and Browning (2009)], R2 is defined as the squared correlation between true and the most likely genotypes. To investigate whether the choice of different quality measures makes much difference, we randomly chose 10,000 imputed SNPs on chromosome 22 in the PLCO data [Prorok et al. (2000); Hayes et al. (2000)] and computed their MACH R2 and Beagle R2. It turns out that the two R2’s are highly correlated (r>0.99). Thus, although the cut-offs for the two R2’s could be slightly different, the general conclusion should still hold.

As we move into the post-GWAS era, our results provide important guidance for investigators on how to optimally conduct meta-analysis in the presence of imputed genotypes for marginal SNP associations. We support the current practice of using the expectation-substitution method and the inverse variance weighting in meta-analysis. Additional theoretical and numerical work is needed to evaluate the use of imputed data in more sophisticated analysis, including proposed methods for gene-gene and gene-environment interactions.

Acknowledgments

We thank two reviewers for their helpful comments. This work was supported by the National Institutes of Health (5R01 CA059045 and 5U01 CA137088, R01AG14358, P01CA53996).

Genotype data included in these analyses from the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial was supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics and supported by contracts from the Division of Cancer Prevention, National Cancer Institute, National Institutes of Health, Department of Health and Human Services. The authors thank Drs. Christine Berg and Philip Prorok, Division of Cancer Prevention, National Cancer Institute, the Screening Center investigators and staff or the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial, Mr. Tom Riley and staff, Information Management Services, Inc., Ms. Barbara O’Brien and staff, Westat, Inc., and Drs. Bill Kopp, Wen Shao, and staff, SAIC-Frederick. Most importantly, we acknowledge the study participants for their contributions to making this study possible.

Data included in these analyses were also generated from the GWAS of Lung Cancer and Smoking. Funding for this work was provided through the National Institutes of Health Genes, Environment and Health Initiative [NIH GEI] (Z01 CP 010200). The human subjects participating in the GWAS derive from The Environment and Genetics in Lung Cancer Etiology (EAGLE) case-control study and the Prostate, Lung, Colon and Ovarian Screening Trial and these studies are supported by intramural resources of the National Cancer Institute. Assistance with genotype cleaning, as well as with general study coordination, was provided by the Gene Environment Association Studies, GENEVA Coordinating Center (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NHI GEI (U01 HG 004438). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number ph000093 v2.p2.c1.

In addition, data generated from the Cancer Genetic Markers of Susceptibility (CGEMS) [CGEMS (2010)] prostate cancer scan were also included in this analysis. The datasets used for the analyses described in this manuscript were accessed with appropriate approval through the dbGaP online resource (http://www.cgems.cancer.gov/data) through dbGaP accession number 000207 v.1p1.c1.

Appendix

First, we introduce some notation. Let μ(gi, b0, b1) = exp(b0+b1gi)/(1+exp(b0+b1gi)). For convenience, we will interchangeably use the notation μ(., b0, b1) and μ(.) in the Appendix. Denote the first derivative of μ(g) with respect to g as μ′(g). Define Uint = supg[set membership]int(μ′(g)); Lint = infg[set membership]int(μ′(g)); DU [x1, x2] = U[x1x2] − {μ(x2) − μ(x1)}/(x2x1)|; DL[x1, x2] = |L[x1,x2] − {μ(x2) − μ(x1)}/(x2x1)|.

Lemma 1 shows that the extrema of |E(di|pi0, pi1, pi2) − μ(gi; b0, b1)| can only be achieved on the boundary. It also computes the extrema of |E(di|pi0, pi1, pi2) − μ(gi; b0, b1)| on each boundary condition and chooses the maximum one as the upperbound for |E(di|pi0, pi1, pi2) − μ(gi, b0, b1)|.

Lemma 2 shows that there exists some δ(b0, b1) (which depends on the upperbound given by Lemma 1), such that when b1b1 + δ(b0, b1), μ(gi, b0, b1) − E(di|pi0, pi1, pi2) > 0 for any gi [set membership] [0, 2]; when b1b1δ(b0, b1), μ(gi, b0, b1) − E(di|pi0, pi1, pi2) < 0 for any gi [set membership] [0, 2]. As a result, equation M23, the root of score equation equation M24, lies between b1δ(b0, b1) and b1 +δ(b0, b1). Given that equation M25, the theorem is proved.

Lemma 1

|E(di|pi0, pi1, pi2) − μ(gi; b0, b1)| ≤ M (b0, b1)min(gi, 2 − gi), where M (b0, b1) = max(DU [0, 2], DI [0, 2], DU [0, 1], DI [0, 1], DU [1, 2], DI [1, 2])

Proof

We can rewrite E(di|pi0, pi1, pi2) − μ(gi) in terms of pi0 and gi by following the constraints pi0 + pi1 + pi2 = 1 and pi1 + 2pi2 = gi. This gives f (gi, pi0) = E(di|pi0, pi1, pi2) − μ(gi) = pi0μ(0) + (2 − 2pi0gi)μ(1) + (pi0 + gi − 1)μ(2) − μ(gi). The extrema of f (gi, pi0) occur when the derivative equals 0 or at the boundary. Taking the first derivative of f (gi, pi0) w.r.t pi0 we can see that there is no solution for the derivative equaling 0. So the extrema can only occur at the boundary: pi0 = 1 − gi/2 or pi0 = 1 − gi or pi0 = 0. We can calculate the extrema for each boundary condition.

When pi0 = 1 − gi/2, f (gi, pi0) = (1 − gi/2)μ(0) + (gi/2)μ(2) − μ(gi). We can see that the value of μ(gi) is between [μ(0) + L[0,2]gi, μ(0) + U[0,2]gi] and also [μ(2) − U[0,2](2 − gi), μ(2) − L[0,2](2 − gi)]. Plugging the upper and lower bounds of μ(gi) into f (gi, pi0), we have |f (gi, pi0)| < max(DU [0, 2], DI [0, 2])min(gi, 2 − gi).

Similarly, we can show that when pi0 = 1 − gi, |f (gi, pi0)| < max(DU [0, 1], DI [0, 1])min(gi, 2 − gi); when pi0 = 0, |f (gi, pi1)| < max(DU [1, 2], DI [1, 2])min(gi, 2 − gi).

Combining all the results above we have

equation M26
(15)

Lemma 2

Let δ(b0, b1) = supgi[set membership][0,2]M (b0, b1)/[(1 − μ(gi, b0, b1))μ(gi, b0, b1)]. Then when b1 ≥ b1+δ(b0, b1), μ(gi, b0, b1)−E(di|pi0, pi1, pi2) > 0 for any gi [set membership] [0, 2]; when b1 ≤ b1 − δ(b0, b1), μ(gi, b0, b1)−E(di|pi0, pi1, pi2) < 0 for any gi [set membership] [0, 2].

Proof

Consider the following equation of b1

equation M27
(16)

The root for this equation would be

equation M28
(17)

Denote E(di|pi0, pi1, pi2) − μ(gi) by Δi. When Δi is small, log(E(di|pi0, pi1, pi2)−1− 1) in equation (17) can be approximated by log(μ(gi)−1− 1) + Δi/[(1 − μ(gi))μ(gi)] following the Taylor’s expansion. As a result, b1b1 + (Δi/gi)/[(1 − μ(gi))μ(gi)]. From equation (15), |Δi| ≤ M (b0, b1)min(gi, 2−gi). It follows that | b1b1| < M (b0, b1)/[(1−μ(gi))μ(gi)]. Let δ(b0, b1) = supgi[set membership][0,2]M (b0, b1)/[(1 − μ(gi))μ(gi)]. As μ(gi, b0, b1) is an increasing function of b1, combining with the fact that the root for equation (16) is between [b1δ(b0, b1), b1 + δ(b0, b1)], we can see that when b1b1 + δ(b0, b1), μ(gi, b0, b1) − E(di|pi0, pi1ppi2) > 0 for any gi [set membership] [0, 2]; and when b1b1δ(b0, b1), μ(gi, b0, b1) − E(di|pi0, pi1, pi2) < 0 for any gi [set membership] [0, 2].

Theorem

|E(b1) − b1| < δ(b0, b1)

Proof

As equation M29 is the root of the equation of b1:

equation M30
(18)

Applying Lemma 2, when b1b1 + δ(b0, b1), the LHS of equation (18) will be positive; when b1b1δ(b0, b1), the LHS of equation (18) will be negative. As the LHS of equation (18) is also an increasing function of b1, then the root of equation (18) equation M31 must lie between [b1δ(b0, b1), b1 + δ(b0, b1)]. Given that equation M32, we have |E(b1) − b1| < δ(b0, b1).

To show the magnitude of δ(b0, b1), which is the upperbound of the bias of b1, we tried a few different values of b1. For example, when b0 = 0, b1 = log(1.2), δ(b0, b1) = DU ([0, 2])/[(1−μ(2, b0, b1))μ(2, b0, b1)] = 0.002; when b1 = log(1.5), δ(b0, b1)=DU ([0, 2])/[(1 − μ(2, b0, b1))μ(2, b0, b1)] =0.02. Those upper-bounds of bias have also been confirmed by the simulation studies.

References

  • Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat Genet. 2006;38:659–662. [PubMed]
  • Browning BL, Browning SR. A unified approach to genotype imputation and haplotype phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84:210–223. [PubMed]
  • Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application. Am J Hum Genet. 2010;86(1):6–22. [PubMed]
  • Cancer Genetic Markers of Susceptibility (CGEMS) Data. 2009 http://cgems.cancer.gov/data/. 10-5-2009.
  • Chapman K, Takahashi A, Meulenbelt I, et al. A meta-analysis of European and Asian cohorts reveals a global role of a functional SNP in the 5′ UTR of GDF5 with osteoarthritis susceptibility. Hum Mol Genet. 2008;17(10):1497–504. [PubMed]
  • Cordell HJ. Estimation and testing of genotype and haplotype effects in case-control studies: comparison of weighted regression and multiple imputation procedures. Genet Epidemiol. 2006;30:259–275. [PubMed]
  • de Bakker PIW, Ferreira MAR, Jia X, Neale BM, Raychaudhuri S, Voight BF. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum Mol Genet. 2008;17(R2):R122–R128. [PMC free article] [PubMed]
  • Hayes RB, Reding D, Kopp W, et al. Etiologic and early marker studies in the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control Clin Trials. 2000;21(6 Suppl):349S–355S. [PubMed]
  • Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009;106(23):9362–7. [PubMed]
  • Hindorff LA, Junkins HA, Hall PN, Mehta JP, Manolio TA. [Accessed [03/29/2011]];A Catalog of Published Genome-Wide Association Studies. Available at: www.genome.gov/gwastudies.
  • Kraft P, Cox DG, Paynter RA, Hunter D, De Vivo I. Accounting for haplotype uncertainty in matched association studies: a comparison of simple and flexible techniques. Genet Epidemiol. 2005;28:261–272. [PubMed]
  • Kraft P, Stram OD. RE: The Use of Inferred Haplotypes in Downstream Analysis. Am J Hum Genet. 2007;81:863–865. [PubMed]
  • Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010;34(8):816834. [PMC free article] [PubMed]
  • Lin DY, Huang BE. The use of inferred haplotypes in downstream analyses. Am J Hum Genet. 2007;80:577–579. [PubMed]
  • MACH Homepage. http://www.sph.umich.edu/csg/yli/mach/tour/imputation.html.
  • Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics. 2007;39:906–913. [PubMed]
  • Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411.
  • Prorok PC, Andriole GL, Bresalier RS, et al. Design of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. Control Clin Trials. 2000;21(6 Suppl):273S–309S. [PubMed]
  • Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007;3:e114. [PMC free article] [PubMed]
  • Soranzo N, Rivadeneira F, Chinappen-Horsley U, et al. Meta-analysis of genome-wide scans for human adult stature identifies novel loci and associations with measures of skeletal frame size. PLoS Genet. 2009;5(4):e1000445. [PMC free article] [PubMed]
  • The International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;427:1299–1320. [PMC free article] [PubMed]
  • The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
  • The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:10611073. [PMC free article] [PubMed]
  • Willer CJ, Speliotes EK, Loos RJF, et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet. 2008;41:2534. [PMC free article] [PubMed]
  • Xiong DH, Liu XG, Guo YF, et al. Genome-wide Association and Follow-Up Replication Studies Identified ADAMTS18 and TGFBR3 as Bone Mass Candidate Genes in Different Ethnic Groups. Am J Hum Genet. 2009;84(3):388398. [PubMed]
  • Zaitlen N, Eskin E. Imputation aware meta-analysis of genome-wide association studies. Genet Epidemiol. 2010;34(6):537–42. [PMC free article] [PubMed]
  • Zeggini E, Ioannidis JP. Meta-analysis in genome-wide association studies. Pharmacogenomics. 2009;10(2):191–201. [PMC free article] [PubMed]