For each of the three methods MACH, IMPUTE and BEAGLE, we have used the following statistical measures of imputation performance: CA
, κ coefficient and power
of detecting imputed SNPs significantly associated with simulated phenotypes. The details of each imputation performance metric are given in the Materials and methods section. Previous studies on imputation of African Americans took into account the admixture of European and African genetic components in African American subjects.6,16,27
Therefore, we used the following reference panels for our imputation experiments: (1) ASW, CEU and YRI data sets from HapMap III; (2) the pilot 1 YRI data June 2010 release; (3) EUR and AFR data from the August 2010 release; and (4) EUR and AFR data from the June 2011 release of the 1000 Genomes Project. The various combinations of these reference panels investigated in this study are presented in . As only IMPUTE allows more than one reference panel as input, it was also run with the combined reference panel of YRI samples from the June 2010 release of the 1000 Genomes Project and all HapMap III panels (denoted 1000 G(YRI) + All HapMap III). Also, only IMPUTE was used for imputation on chromosome 18 using the panel from June 2011 release of the 1000 Genomes Project.
For each imputation method and reference panel combination, 10% of the SNPs on chromosomes 18, 20 and 22 in the study sample were set to missing () and imputation performance was evaluated following a round of imputation with the masked data and using only the SNPs common across the HapMap and 1000 Genomes Project panels. Henceforth we use the phrase masked SNPs to simply refer to this set of SNPs, unless otherwise mentioned.
In this section, we first present the imputation results for the three methods individually, followed by a comparison using the best common reference panel for each method. To determine the extent to which imputation performance metrics are sensitive to the allele frequencies of the SNPs being imputed, we explored the performance results for the three chromosomes after grouping the masked SNPs into four MAF bins
0.05, 0.05–0.1, 0.1–0.3 and >0.3 for each of the three methods. The imputation performance of the SNPs in the bin MAF
0.05 being most sensitive to allele frequency, we focus on the results for this bin for each method. The results for the other MAF bins are presented in the Supplementary Data
For each of the three imputation methods, we determined the distribution of the CA for the minor allele homozygotes (denoted CA(aa)), heterozygotes (denoted CA(Aa)) and the major allele homozygotes (denoted CA(AA)) for masked SNPs with MAF
0.05. The distribution of a statistic (CA or κ) was plotted using the fraction of the masked SNP that had the value of the statistic exceeding a range of cutoffs: 0.0–0.9 in steps of 0.1. Thus, cutoff=0.0 for CA(aa) resulted in the most lenient cutoff including all masked SNPs with MAF
0.05 whereas 0.9 resulted in the most stringent cutoff retaining only imputed masked SNPs of highest quality with MAF
0.05 and CA(aa)
0.9. The imputation performance as obtained using the κ combined the agreements across the three genotypes to provide an overall score of imputation concordance. It measured the concordances between the original and masked genotypes and adjusted these values for the amount of agreement that could be expected due to chance alone. In addition to CA for each genotype category, we computed the total CA (denoted CA(Tot)) by simply summing the concordances of the three genotypes as used in previous studies.6,16
Visualization of the distribution of the concordances and κ demonstrated the trade-off between imputation quality and yield—higher imputation quality threshold resulted in lower yield, that is, fewer SNPs retained after imputation.
Also the ultimate goal of imputation is to boost the power of GWAS by predicting genotypes at markers that are not directly genotyped. Therefore, it is essential to evaluate how imputation quality as measured using CA and κ affected the the ability to detect SNPs associated with a given trait, that is, reference panels achieving higher CA and κ should also attain higher power and vice versa. We therefore computed the statistical power by simulating phenotypes for each masked SNP and regressing the corresponding simulated phenotype on the imputed genotypes (see Materials and methods for details).
Across each imputation method and reference panel used (described in details below), we observed the following:
- For a given method and a reference panel, the concordances increased in the order: minor allele homozygotes, heterozygotes and major allele homozygotes. For masked SNPs with low MAF, CA(aa) was most sensitive to the choice of the reference panel and effectively distinguished the imputation performance of each reference panel followed by CA(Aa) (). At the same time, owing to the presence of a vast majority of major allele homozygotes, CA(AA) and CA(Tot) for the various reference panels differed only slightly from one another (Supplementary Figure 1). This indicates the advantage of estimating the imputation performance for each genotype category separately with focus on the genotypes containing the minor allele, as the performance differences were more visible for minor allele homozygotes and heterozygotes, specially at low MAF. Little or almost no comparative information was provided by the CA(AA) and CA(Tot). Instead, κ is a more informative overall measure ().
Distribution of concordance accuracy (CA) of minor allele homozygotes and heterozygotes of (a, b) MACH, (c, d) IMPUTE and (e, f) for BEAGLE. (M=MACH, I=IMPUTE and B=BEAGLE).
Distribution of kappa for (a) MACH, (c) IMPUTE and (e) BEAGLE. Power is shown in (b) for MACH, (d) for IMPUTE and (f) for BEAGLE. (M=MACH, I=IMPUTE and B=BEAGLE).
- For MACH and BEAGLE, the combined HapMap III reference panel ASW + CEU + YRI III performed better than, or at least as well as, the other HapMap panels whereas the combined 1000 Genomes Project panel EUR + AFR was better than AFR alone. For IMPUTE, EUR + AFR-2011 performs better than or as well as the other panels.
- The proportion of masked SNPs retained at each cutoff for an imputation quality statistic (that is, the imputation yield) decreased with increasing value of the cutoff (for example, see for the κ statistic). This demonstrates the critical choice a researcher has to face—the cutoff to choose for balancing imputation quality with imputation yield.
Figure 3 Kappa vs yield for the three algorithms with ASW + CEU + YRI III for minor allele frequencies (MAF) bins (a) 0.05 (b) 0.05–0.1 (c) 0.1–0.3 and (d) 0.3–0.5. (M=MACH, I=IMPUTE and B=BEAGLE).
- For each reference panel, imputation concordances as measured by CA, κ and power generally increased with increasing MAF, irrespective of the imputation method used (for example, see and for reference panel ASW + CEU + YRI III).
For panel ASW + CEU + YRI III, comparison of (a,b) mean concordance accuracy (CA) for minor allele homozygotes and heterozygotes and (c) mean kappa for each method at different minor allele frequencies (MAF) bins. (M=MACH, I=IMPUTE and B=BEAGLE).
Comparison of power of each method using the panel ASW+CEU+YRI III at the four minor allele frequency bins
- MACH and IMPUTE performed equally well and better than BEAGLE.
Next, we discuss the imputation results in details.
Imputation quality and yield using the concordance and κ statistics
shows the CA obtained with each method for the minor allele homozygotes and the heterozygotes for the masked SNPs with MAF
0.05 that are common across the HapMap and 1000 Genomes Project panels. Using MACH and BEAGLE, for minor allele homozygotes () and heterozygotes (), the combined reference panels ASW + CEU + YRI III (blue) and CEU + YRI III (light green) yielded more SNPs with better CA than the other panels. For both algorithms, the panel ASW + CEU + YRI III (blue) was slightly better than CEU + YRI III (light green) followed by EUR + AFR (turquoise). With IMPUTE (), the panels EUR + AFR-2011 (yellow, on chromosome 18) and 1000 G(YRI) + All HapMap III (brown) were the best-performing panels, followed by ASW + CEU + YRI III (blue), CEU + YRI III (light green) and EUR + AFR (turquoise). We found that at MAF
0.05, the minor allele concordance CA(aa) was slightly improved for EUR + AFR-2011 compared with panels containing HapMap haplotypes, and the increase in CA(aa) was greater compared with EUR + AFR. For panel EUR + AFR-2011, 54% of the masked SNPs had CA(aa) >0.8 compared with 52, 53 and 48% achieved by ASW + CEU + YRI III, 1000 G(YRI) + All HapMap III and EUR + AFR, respectively. For all methods, the purely African panels of ASW III (magenta), YRI III (gray) and AFR (dark green) had poor CA(aa) and CA(Aa).
Supplementary Figure 1
shows the distribution of the CA for the major allele homozygotes and that of the total CA, respectively, which do not provide substantial information to distinguish the performance of the reference panels. Only with BEAGLE, CA(AA) and CA(Tot) were able to distinguish the AFR and YRI III panels from the others as BEAGLE discarded 20–25% of the masked SNPs during imputation when used with the AFR and YRI III panels.
The distribution of the overall agreement across the three genotypes as measured by the κ statistic is shown for MACH, IMPUTE and BEAGLE in , respectively. For the methods MACH and BEAGLE, the combined reference panel ASW + CEU + YRI III (blue) outperformed the others whereas the panels ASW III (magenta), YRI III (gray) and AFR (dark green) performed the worst with all three methods. For IMPUTE, the distribution of κ for the panels EUR + AFR-2011 (yellow) and 1000 G(YRI) + All HapMap III panels (brown) closely followed that of ASW + CEU + YRI III (blue). These were in agreement with what we had observed with CA(aa) and CA(Aa). Imputation accuracies with the 1000 G(YRI) + All HapMap III panel also show that the presence of reference populations unrelated to African American ancestry does not adversely affect imputation performance.
Visualization of the distribution of the concordances and κ statistics also highlights the fact that increasing the threshold for each of these metrics would increase the quality of imputation (as measured by κ and CA), but at the same time, would reduce the imputation or yield. Using MACH with the panel ASW + CEU + YRI III (blue), 67% of the masked SNPs (MAF
0.05) had κ statistic >0.8, but dropped to 58% with BEAGLE (). Using IMPUTE, at a κ
0.8, 67, 66, 70 and 55% of the masked SNPs are retained using EUR + AFR-2011 (yellow), ASW + CEU + YRI III (blue), 1000 G(YRI) + All HapMap III (brown) and EUR + AFR (turquoise), respectively (). These show that the imputation accuracies of EUR + AFR-2011 are improved over EUR + AFR and comparable to that of ASW + CEU + YRI III and 1000 G(YRI) + All HapMap III. Both the 1000 Genomes Project panels EUR + AFR (turquoise) and AFR (dark green) had lower yield in comparison with ASW + CEU + YRI III (blue) indicating poorer quality of imputation with these panels.
show the power of detecting the masked SNPs with simulated phenotypes for each algorithm computed using allele dosage. The performance of the panels follow the same trend as the CA(aa), CA(Aa) and κ statistics, that is, ASW + CEU + YRI III (blue), CEU + YRI III (light green), 1000 G(YRI) + All HapMap III (brown, only for IMPUTE) and EUR + AFR-2011 (yellow, with IMPUTE on chromosome 18) had higher power than the other panels followed by EUR + AFR (turquoise). This demonstrates that reference panels attaining higher concordance and κ computed using maximum likelihood genotypes of the masked SNPs can improve power in subsequent association analysis that uses allele dosage.
The distributions of the CA and κ and the power of each reference panel for the remaining MAF bins 0.05–0.1, 0.1–0.3 and >0.3 are presented in Supplementary Figures 2–7
depicts the fraction of SNPs retained at each cutoff of the κ statistic for the three methods using the reference panel ASW + CEU + YRI III. Based on the imputation performance using the masked SNPs, we estimated that using MACH and IMPUTE with the panel ASW + CEU + YRI III, 68%, 90%, 96% and 95% of the untyped SNPs in African Americans can be imputed with imputation accuracy κ of 0.92 at the MAF bins
0.05, 0.05–0.1, 0.1–0.3 and >0.3, respectively. With BEAGLE, the percentages dropped to 53%, 77%, 87% and 87% for the above MAF bins, respectively. Using IMPUTE with the combined reference panels 1000 G(YRI) + All HapMap III (Supplementary Figure 8
), the yield remained extremely close to that of ASW + CEU + YRI III. This demonstrates that combining reference panels in addition to ASW, CEU and YRI is unlikely to improve the imputation accuracies significantly. Both the 1000 Genomes Project panels EUR + AFR and AFR had lower yield in comparison with ASW + CEU + YRI III due to lower imputation accuracy with these panels.
Comparison between the imputation algorithms
We compared the imputation performances of the three imputation methods using the combined HapMap III panel ASW + CEU + YRI III that achieved high concordance and power with each imputation algorithm. As described in the preceding paragraph, also serves to compare the three methods. Using the κ statistic, MACH (green) and IMPUTE (blue) performs similarly and both are better than BEAGLE (red). compares the imputation performance of the methods using the concordance statistics and κ computed with all the masked SNPs at the different MAF bins. We observed that MACH and IMPUTE consistently achieved higher CA(aa) (), CA(Aa) () and κ () than BEAGLE at all allele frequencies. Additional results using the distributions of CA(aa) and CA(Aa) are presented in Supplementary Figure 9
Using the dosage data, a detailed comparison of the power of each algorithm at all MAF bins using panel ASW + CEU + YRI III is given in . MACH and IMPUTE performed similarly and steadily outperformed BEAGLE at all MAF bins, the difference in power being higher for masked SNPs with lower MAFs.
Our experiments also indicated that BEAGLE is computationally faster than MACH and IMPUTE whereas IMPUTE is computationally much faster than MACH. The details of the runtimes of each method can be found in Supplementary Data
(Supplementary Tables 1 and 2
). Although 1000 G(YRI) + All HapMap III has many more haplotypes than AFR and EUR + AFR, the HapMap III panels have fewer SNPs compared with the 1000 G(YRI) panel. At the same time, only 118 haplotypes are present in 1000 G(YRI) compared with 1914 in the combined HapMap III panel. This results in decreased runtime of IMPUTE with the 1000 G(YRI) + All HapMap III reference panel.
Relationship of imputation quality metrics with the concordance and κ statistics
The quality of imputation for each imputed SNP was measured by different statistical metrics for each method. MACH produced the imputation quality measure (2
) that estimates the squared correlation between the estimated allele dosage and true allele dosage. It represents the ratio of the empirically observed variance of allele dosage to the expected binomial variance at Hardy–Weinberg equilibrium.28
BEAGLE generated a similar imputation quality metric (R2
) that estimates the squared correlation between the most likely allele dosage and the true allele dosage. The output from IMPUTE contained the Info
statistic that represents a measure of the relative statistical information about SNP allele frequency.25
For each method, imputed SNPs with higher values of the corresponding statistic are assumed to be more reliably imputed. Here, we explored the effect of MAF on the imputation quality measures CA and κ when 2
was used to stratify the SNPs. We show the results only for MACH, as the results of IMPUTE and BEAGLE are similar and presented in the Supplementary Data
(Supplementary Figures 10 and 11
). Using the panel ASW + CEU + YRI III, show how CA and κ are affected by allele frequencies when masked SNPs with 2
≥ a cutoff value were used to compute the corresponding statistics. The four MAF bins are denoted as
0.05 with red, 0.05–0.1 with blue, 0.1–0.3 with green and >0.3 with magenta.
Figure 5 With MACH (a–c) mean concordance accuracy (CA) for each genotype and (d) mean kappa using masked single-nucleotide polymorphisms (SNPs) exceeding a given for panel ASW + CEU + YRI III. Four minor allele frequencies (MAF) bins are shown as (more ...)
We observed that 2
is a poor indicator of imputation quality for SNPs with low MAF. shows the CA(aa) against 2
at each MAF bin. The differences in imputation quality score CA(aa) were more pronounced at lower values of 2
and gradually converged as 2
increases. The CA(Aa) and CA(AA) vs 2
are shown in , respectively. We observed that the differences in concordance at lower values of 2
decreased in the order CA(aa), CA(Aa) and CA(AA), suggesting that imputation accuracy for similar 2
scores was more affected by allele frequencies in minor allele homozygotes and heterozygotes. The overall concordance score κ () also highlights this fact and suggests that there is a great deal of uncertainly in imputation quality even when imputed SNPs with 2
≥ 0.3 are considered for subsequent analysis as practiced in some studies.29,30
As a result, we recommend that when maximum likelihood genotypes from imputation are considered, higher 2
quality score cutoffs should be used to retain well-imputed SNPs at lower allele frequencies. Alternatively, κ can be used to ascertain imputation quality more reliably, however, a small fraction of the observed genotypes randomly chosen should be masked before imputation to facilitate computation of the κ statistic.