|Home | About | Journals | Submit | Contact Us | Français|
Imputing individual-level genotypes (or genotype imputation) is now a standard procedure in genome-wide association studies (GWAS) to examine disease associations at untyped common genetic variants. Meta-analysis of publicly available GWAS summary statistics can allow more disease-associated loci to be discovered, but these data are usually provided for various variant sets. Thus imputing these summary statistics of different variant sets into a common reference panel for meta-analyses is impossible using traditional genotype imputation methods. Here we develop a fast and accurate P-value imputation (FAPI) method that utilizes summary statistics of common variants only. Its computational cost is linear with the number of untyped variants and has similar accuracy compared with IMPUTE2 with prephasing, one of the leading methods in genotype imputation. In addition, based on the FAPI idea, we develop a metric to detect abnormal association at a variant and showed that it had a significantly greater power compared with LD-PAC, a method that quantifies the evidence of spurious associations based on likelihood ratio. Our method is implemented in a user-friendly software tool, which is available at http://statgenpro.psychiatry.hku.hk/fapi.
Genome-wide association studies (GWAS) are now widely used to investigate associations of common genetic variants with various human traits and diseases.1, 2 Although >50 million genetic variants have been discovered in the human genome and cataloged in dbSNP,3 no more than a few millions of them are usually typed in a GWAS (often using high-density commercial genotyping arrays). As obtaining whole-genome sequencing data of GWAS samples is still costly, imputing the individual-level genotypes at untyped common variants in a GWAS (or genotype imputation) is common in GWAS and multiple genotype imputation methods have been developed, including IMPUTE2,4 MACH,5 and BEAGLE.6, 7, 8 For a comprehensive review of traditional imputation approaches, see the work by Nothnagel et al9 as well as by Marchini and Howie.10 As most GWAS now release their summary statistics, a meta-analysis of these GWAS is attractive.11 However, these summary statistics were provided for different sets of common variants owing to the type of genotyping platforms and reference panel used for genotype imputation (such as HapMap212 versus 1000G13). Therefore, imputing these summary statistics of different variant sets into a common reference panel for meta-analyses is impossible using these traditional genotype imputation methods.
As researchers are mostly interested in assessing the evidence of disease association at untyped variants but not individual-level genotypes in most GWAS, the traditional two-step imputation approach (ie, genotype imputation followed by association analysis) can be simplified to a single-step approach (ie, imputing the significance of association or P-values at untyped common variants directly) what we call the P-value imputation. As the P-values of associations of two variants in strong linkage disequilibrium (LD) must be correlated, one may impute the P-values of untyped variants from the P-values of nearby typed variants. This allows imputation of summary statistics from different variant sets into a common reference panel without the need of raw genotype data.
Here we determine the relationship between LD r-squared and the covariance of normal test statistics (ie, z-scores) at any two common variants by simulation and curve fitting. Then we compute the distribution of the normal test statistic of an untyped common variant conditional on the P-values at neighboring typed variants and the observed LD structure using a multivariate normal distribution. We compare the speed as well as accuracy of our method, called fast and accurate P-value imputation (FAPI) with IMPUTE2 with prephasing,4 a leading method in traditional genotype imputation, in GWAS data sets of simulated traits as well as schizophrenia. Furthermore, based on the FAPI idea, we develop a metric to detect abnormal association at a variant using the P-values at neighboring variants in LD and compare it with LD-PAC,14 a method that quantifies the evidence of spurious associations based on likelihood ratio, using simulations.
Genotypes of two biallelic common variants were simulated for various sample sizes (4000 and 10000 for a quantitative trait, as well as 2000 cases/2000 controls and 5000 cases/5000 controls for a binary trait) for a range of LD (in terms of r2, from 0 to 1) and allele frequencies (from 0.05 to 0.95), under Hardy–Weinberg equilibrium. We then performed an association analysis for each of the two variants to obtain two normal test statistics (ie, z-scores). We repeated this procedure 10000 times for each set of parameters and computed the covariance of the normal test statistics of the two variants empirically under each scenario. We approximated the covariance of the normal test statistics with a polynomial of r2 in R and considered the most parsimonious model that maximized adjusted R2.
Assume that there are n untyped variants and m typed variants and their normal test statistics are xuntyped and xtyped, respectively, the mean vector of these m+n statistics μ and the covariance matrix of these statistics ∑ can be partitioned as follows:
where μ equals 0 under the null and ∑ is approximated using a polynomial of pair-wise r2 between the corresponding variants determined previously. The distribution of xuntyped given xtyped=a (where a is obtained from the probit function in case only P-values are available) follows a multivariate normal distribution with estimated mean
and estimated covariance matrix
(Note that given μ-, we can compute the P-values for untyped variants from the cumulative distribution function).
We compared the imputation performance of FAPI with IMPUTE2 (with prephasing), a leading method in genotype imputation, and DIST,15 a similar algorithm that imputes summary statistics for untyped common variants, using the genotype data of the 1958 Birth Cohort and National Blood Services (n=2938) in the Wellcome Trust Case Control Consortium 1 (WTCCC1). We used chromosome 1 data only and removed variants with minor allele frequencies <5%, Hardy–Weinberg P-value <0.001, and/or missing rate >1%. We then randomly simulated a binary trait (with an equal number of cases and controls) and a normally distributed trait, masked the data of one out of every five variants, and re-imputed their P-values for each trait using three different analyses:
We measured their running time and maximum RAM usage on a 3.47GHz Intel processor. In addition, we assessed their accuracy by comparing the imputed P-values with the actual P-values for the subset of randomly masked variants analyzed by all three methods. 1000G European Phase I interim data was used as reference, and badly imputed genotypes and P-values with a score <0.6 were removed.
To demonstrate that FAPI also works outside of simulated data, we also analyzed a previously published real in-house GWAS data set of schizophrenia18 using FAPI. We performed the same quality-control procedure in this data set as in the simulated data.
Assume that a variant has an observed P-value po (corresponding to a normal test statistic xo) and m neighboring variants with known normal test statistics or P-values. Based on the FAPI idea, we can construct the conditional normal distribution, N(θ, σ2), which has a cumulative function F(X), of the test statistic at that variant given the normal test statistics of the m variants (which can be transformed from P-values by a probit function in case only P-values are available). We compute a metric, ProbC, which is given by F(xo) if xo<θ; otherwise 1−F(xo). When the variant and its m neighboring variants are in strong LD, xo is expected to be close to θ with high confidence. If genotyping errors or other artifacts are present at that variant, the observed P-value po will deviate from expected and this leads to a small ProbC.
We compared the ProbC criterion of detecting abnormal associations with LD-PAC,14 a method that quantifies the evidence of spurious associations based on likelihood ratio, using computer simulations. Consider a disease-causing bi-allelic variant in LD with five other bi-allelic variants and their genotypes are under Hardy–Weinberg equilibrium. Given the allele frequencies of these variants, say 0.4, a variety of pairwise LD coefficient r among these loci, disease prevalence (5%), inheritance model (multiplicative), and relative allelic risk, the genotypes and case–control status of a population of 5 million individuals were simulated similarly as above. A sample of 2000 cases and 2000 controls were randomly drawn without replacement from the population and with a genotyping error rate of ω% ω% of the genotypes at the risk locus in each simulated data set were altered. Allelic association test was then used to compute the association P-value at the risk locus in the sample. A hypothesis test using the proposed metric was used to assess whether the computed P-value is different from expected (ie, imputed P-value in FAPI). Under the null hypothesis of no abnormal associations, the observed z-statistic of a variant should come from the conditional distribution derived from the P-values of neighboring typed variants, and a large deviation of the observed z-statistic from expected indicates abnormal association. We used ProbC≤0.025, which is expected to give a type-I error rate of 5%, to reject the null hypothesis that the observed association is normal. We repeated this procedure to generate 1000 data sets and counted the proportion of rejected hypotheses. A relative allelic risk of 1.0 and 1.2, as well as a genotype error rate of 0, 10, 30, and 50%, were considered in the simulations. The same simulation procedure was also conducted to evaluate the performance of LD-PAC14 with default parameters except that the allelic association was performed by another method19 as suggested by the LD-PAC authors.
We introduced a FAPI method to directly infer the association P-values at untyped common variants from the association P-values of neighboring typed variants in LD (see Materials and methods). Briefly, we obtained the conditional means and (co)variances of the normal test statistics at untyped variants given the observed normal test statistics (transformed from the observed association P-values in case only P-values are available) at the typed variants and the known covariance among the normal test statistics of typed and untyped variants, assuming that the test statistics follow a multivariate normal distribution (see Materials and methods). The P-values at the untyped variants can then be obtained based on the conditional distribution of the normal test statistics.
We found from simulations that the covariance of normal test statistics of any two common variants can be approximated by a fourth-order polynomial of squared genotype correlation (r2) of the two common variants, when we considered the most parsimonious model that maximized adjusted R2 (see Figure 1b and Supplementary Table S1). It should be noted that Figure 1b includes variants with a range of allele frequencies from 0.05 to 0.95, strongly suggesting that the approximation is independent of allele frequencies. This approximation works for both binary and quantitative traits as well as for different sample sizes (see Supplementary Figure S1).
We compared the imputation performance of FAPI with IMPUTE2 (with prephasing), a leading method in genotype imputation, and DIST,15 a similar algorithm that imputes summary statistics for untyped variants, using the chromosome 1 genotype data (number of sample=2938; number of variants=28369) provided by the WTCCC1. We randomly simulated a binary trait (with an equal number of cases and controls) and a normally distributed trait, masked the data of one out of every five variants, and re-imputed their P-values for each trait using the three imputation approaches (See Materials and methods).
We also assessed the accuracy of the three approaches by comparing their imputed P-values with the actual P-values for the subset of masked variants analyzed by all three methods (Figure 2). As expected, P-values from IMPUTE2 were the most accurate among the three, with a correlation of 0.98 with the actual P-values. Nearly 86% and all of the IMPUTE2-imputed P-values lie within 0.1 and 0.5 log10 units of their corresponding actual P-values, respectively. FAPI was similar in accuracy compared with IMPUTE2, as indicated by a correlation of 0.96 of the imputed P-values with the actual ones. About 76% and all of its imputed P-values lie within 0.1 and 0.5 log10 units of their corresponding actual P-values, respectively. The accuracy of the imputed P-values was similar in a real data set with a real phenotype (see Supplementary Figure S2). DIST performed the worst as the imputed values were modestly correlated with the actual ones (around 0.72 for both traits) and even seriously underestimated (as seen from the large number of points below the diagonal in the scatterplots). Only around 50, 92, and 98% of its imputed P-values could lie within 0.1, 0.5, and 1 log10 units of their corresponding actual P-values, respectively.
Table 1 shows the running time and maximum RAM usage of all three methods measured on a 3.47GHz Intel processor. FAPI was the fastest and finished the task in 10min. The running time of DIST almost doubled that of FAPI, whereas the traditional method was the slowest and ran 200 times slower than others. In terms of RAM usage, DIST was the most memory efficient and used 1G of RAM, whereas FAPI used 8G and was the least efficient.
We also derived a metric ProbC to detect abnormal association at a variant using the association P-values at its neighboring variants in LD. In brief, a conditional normal distribution of its normal test statistic is constructed based on the P-values of its neighboring variants in LD using the FAPI idea and the observed test statistic, zo, at that variant is examined for deviation from the mean of the distribution using the cumulative distribution function (see Materials and methods). A small ProbC suggests that the observed test statistic (or association P-value) is different from expected and thus may require further attention.
We compared our approach of detecting abnormal associations with LD-PAC,14 a method that quantifies the evidence of spurious associations based on likelihood ratio, using simulations. We simulated a series of situations in which genotyping errors were introduced at various rates in a sample of 2000 cases and 2000 controls to see whether the metric can detect the abnormal associations introduced (see Materials and methods). A ProbC threshold of 0.025 was used to reject the null hypothesis of no abnormal associations, while LD-PAC was run with the default parameters. Table 2 shows the type-I error rates and powers of our approach and LD-PAC. When no genotyping error was introduced, at most 5 and 2.5% of the associations were flagged as abnormal by FAPI and LD-PAC, respectively, no matter whether a null or risk locus was examined. However, for a genotyping error rate of 10%, our approach had around 20–40% power of detecting the abnormal associations when a null locus was simulated and around 50–70% power when a risk locus was simulated, whereas LD-PAC only had around 3–5% in both scenarios. The power of FAPI increased further when the genotyping error rate increased, but there was little increase in power for LD-PAC. The conclusion is independent of sample size.
All the methods proposed in this paper are implemented in a software tool freely available at http://statgenpro.psychiatry.hku.hk/fapi. The software tool has three major functions:
It can run on most operating systems and automatically checks for updates and downloads necessary resources before performing the user-specified functions. For a GWAS with around 360000 typed variants, imputation of association P-values at 5.7 million untyped variants found in the 1000G Project using FAPI requires ~72min CPU time and 10GB of memory on a single 3.1GHz Intel processor.
In this paper, we propose a method, FAPI, for assessing the associations at untyped common variants that is fast, accurate, and independent of raw genotype data. In our simulations, it finished imputing chromosome 1 based on HapMap data in less than an hour, but the same task required 3 days for IMPUTE2 even with prephasing. At the same time, the calls produced by FAPI were unbiased and had little loss in accuracy compared with those produced by IMPUTE2. In addition, based on the FAPI idea, we develop a metric to detect abnormal associations at a target variant using the association P-values at neighboring variants in LD. Our method was shown to have a significantly greater power compared with LD-PAC, a method that quantifies the evidence of spurious associations based on likelihood ratio. These methods are implemented in a user-friendly software tool freely available at http://statgenpro.psychiatry.hku.hk/fapi.
Compared with DIST, which also models summary statistics of untyped variants using a multivariate normal distribution, FAPI has some important advantages. First, DIST wrongly approximates the variance–covariance matrix of the summary statistics by the genotype correlation given, whereas FAPI accurately models the relationship using an empirical fourth-order polynomial function. Second, DIST accepts only z-scores as inputs, whereas FAPI accepts P-values and so is more flexible. Finally, FAPI provides more useful functions derived from the imputation function (including reliability checking for P-values and meta-analysis) to facilitate genetic association analysis.
As FAPI relies on LD, so it certainly does not work for low frequency and rare variants that do not exhibit strong LD with either rare or common variants. Also, the results must be sensitive to the provided LD structure, which can be quite different among various reference data sets. The simple ‘best-match' strategy, which requires the ancestry of the GWAS sample matches closely with that of the reference panel, is what we recommend at the moment. However, sometimes a clear perfect match may not be available, especially for GWAS in admixed populations, and a popular strategy in genotype imputation to deal with this is to use a ‘cosmopolitan' reference panel including as many haplotypes as possible. This strategy is able to work in genotype imputation as genotype imputation can search for haplotype segments shared between each GWAS sample and a reference panel of densely typed individuals within a limited genomic region only. But whether this will work in FAPI requires further investigation.
This fast and easy imputation tool may encourage many more studies to make use of publicly available data for meta-analysis studies of many complex diseases. Because of too many missing genotypes according to different genotyping arrays, genotype imputation with raw genotypes is often a prelude to a meta-analysis study. However, because of the difficulties in sharing raw genotypes in practice, this procedure is often very inefficient. With FAPI, one only needs the association P-values, which can even often be directly downloaded from public domains. With the integrated reference data from HapMap or 1000 Genomes Projects on FAPI, the imputation and meta-analysis can be quickly performed on a desktop computer with ordinary configurations. In this way, more reliable genetic risk factors will be suggested to explain part of the missing heritability for a variety of complex diseases.
This work was funded by Hong Kong Research Grants Council GRF HKU 768610M, HKU 776412M, and HKU 777511M; Hong Kong Research Grants Council Theme-Based Research Scheme T12-705/11; European Community Seventh Framework Programme Grant on European Network of National Schizophrenia Networks Studying Gene-Environment Interactions (EU-GEI); the Hong Kong Health and Medical Research Fund 01121436 and 02132236; the HKU Seed Funding Programme for Basic Research 201302159006; the HKU Small Project Funding 201309176244; and The University of Hong Kong Strategic Research Theme on Genomics.
The authors declare no conflict of interest.
Supplementary Information accompanies this paper on European Journal of Human Genetics website (http://www.nature.com/ejhg)