|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.
Most ongoing genome-wide association studies (GWAS) rely on a commercial SNP genotyping panel that directly assays only a small fraction of SNPs in the human genome [Carlson et al., 2003; The International HapMap Consortium 2005]. In these scans, the majority of SNPs in the genome must be evaluated indirectly using one or more of the genotyped SNPs as proxies [Barrett and Cardon, 2006; Pe’er et al., 2006]. Despite the ability of individual genome-wide association scans to identify common alleles that make large contributions to disease risk and a subset of the loci with smaller effect [Hirschhorn and Daly, 2005], many alleles that contribute to complex disease can only be identified through the meta-analysis of multiple genome-wide scans [for specific examples, see Lettre et al., 2008; Sanna et al., 2008; Willer et al., 2008, 2009]. Although it is possible to assign SNPs genotyped in each study as proxies for SNPs genotyped in the other studies [Carlson et al., 2004; de Bakker et al., 2005; Lin et al., 2004; Nicolae, 2006; Zaitlen et al., 2007], meta-analyses of GWAS conducted in this manner would be cumbersome because of the limited overlap between the different commercial panels and because different choices of proxies for a particular SNP might lead to somewhat different conclusions.
A much more attractive approach for cross study analyses is to combine genotypes generated by the International HapMap Consortium, [The International HapMap Consortium, 2005] with genotypes from individual studies, and then use a haplotyping algorithm that can handle genome scale data to impute genotypes at untyped markers in each study [Scheet and Stephens, 2006]. This strategy results in a situation where all studies are “genotyped” at all the markers examined by the HapMap consortium (albeit some markers are genotyped using conventional means and others are genotyped in silico [Burdick et al., 2006]). The approach relies on the intuition that even two apparently “unrelated” individuals can share short stretches of haplotype inherited from distant common ancestors. Once one of these stretches is identified using genotypes for a few SNPs, alleles for intervening SNPs that are measured in one of the individuals, but not the other, can be imputed. Provided shared haplotype stretches are identified correctly, imputed genotypes will be accurate unless they have been disrupted by gene conversion or mutation events.
Here, we systematically evaluate the genotype imputation approach outlined in the paragraph above using our Markov Chain Haplotyping algorithm (MaCH 1.0; see Appendix for implementation details). To estimate haplotypes, our approach starts by randomly generating a pair of haplotypes that is compatible with observed genotypes for each sampled individual. These initial haplotype estimates are then refined through a series of iterations. In each iteration, a new pair of haplotypes is sampled for each individual in turn using a Hidden Markov Model (HMM) that describes the haplotype pair as an imperfect mosaic of the other haplotypes. Model parameters that characterize the probability of change in the mosaic pattern between every pair of consecutive markers and the probability of observing an imperfection in the mosaic at each specific point are also updated. After many iterations (typically 20–100), a consensus haplotype can be constructed by merging the haplotypes sampled in each round.
Our approach was inspired by the Markov models commonly used for pedigree analysis [for examples, see Abecasis et al., 2002; Kruglyak et al., 1996; Lander and Green, 1987] and shares several features with other HMMs used to describe sampled haplotypes as a mosaic of a set of reference haplotypes [Daly et al., 2001; Li and Stephens, 2003; Mott et al., 2000; Stephens and Scheet, 2005a]. In order to evaluate its performance, we simulated two sets of 100 1 Mb regions that mimic the degree of linkage disequilibrium (LD) in the HapMap CEU and YRI samples [Schaffner et al., 2005]. In each region, we simulated genotypes for ~200 markers, ascertained to mimic HapMap I allele frequency patterns [Marchini et al., 2006], in 90 individuals with 2% of the genotypes missing at random. We then used our method to reconstruct individual haplotypes and tallied three measures of haplotyping quality [Marchini et al., 2006]: (1) the number of incorrectly imputed missing genotypes, (2) among heterozygous sites, the number of consecutive sites that are phased incorrectly with respect to each other (this is the number of “flips” required to transform estimated haplotypes into the true haplotypes, after masking incorrectly imputed sites), and (3) the number of perfectly inferred haplotypes. The three measures were averaged over all 100 regions and the results are summarized in Table I. For comparison, the table also includes results for PHASE [Stephens and Scheet, 2005b; Stephens et al., 2001] and fastPHASE [Scheet and Stephens, 2006], two state of the art haplotyping algorithms [Marchini et al., 2006], and for BEAGLE [Browning, 2006] and PL-EM [Qin et al., 2002], two alternative haplotyping algorithms that are very computationally efficient. Table I clearly shows that our method is competitive in all three measures: our method results in slightly fewer incorrectly imputed genotypes, requires slightly fewer flips to transform imputed haplotypes into the true haplotypes, and produces slightly more correctly inferred haplotypes over the entire 1 Mb stretch than PHASE, which was the second best method. Furthermore, note that estimates of haplotypes and missing genotypes obtained in 5–20 min using our method are comparable in quality to those produced by PHASE runs averaging ~1 day.
Encouraged by these initial results, we proceeded to apply our method to impute genotypes for untyped markers in the Finland United States Investigation of NIDDM genetics (FUSION) GWAS [Scott et al., 2007]. Since a previous analysis suggested LD patterns in the HapMap CEU and in FUSION are similar [Willer et al., 2006], we used genotypes for 290,690 autosomal markers with allele frequency >5% in the Illumina 317K SNP chip and haplotypes for 2.5M polymorphic markers in the phased HapMap CEU chromosomes as input. After running the haplotyping procedure described above, we estimated the most likely genotype at each position (taking a majority vote across all iterations) and the expected number of copies of the minor allele at each position (a fractional value between 0 and 2) for each individual. We obtained similar results running the haplotyping procedure for 50–100 iterations or using only a smaller number of iterations (10–20) to estimate model parameters and then calculating maximum likelihood estimates for the missing genotypes and allele counts. Different chromosomes were analyzed in parallel and, overall, imputing genotypes for 2,335 unrelated individuals took <2 days for each of the largest chromosomes on a 2006 vintage 2.40GHz Pentium Xeon processor. In total, we imputed genotypes for 2,266,562 SNPs per individual. On average, our method used stretches of ~150 kb from the HapMap CEU panel to reconstruct haplotypes for individuals in the FUSION sample.
To evaluate the quality of imputed genotypes, we contrasted our estimates of the most likely genotypes and the expected number of copies of the minor allele with actual genotype data for three sets of markers: 521 SNP markers in a region of chromosome 14 previously examined to fine-map a candidate linkage region [Willer et al., 2006], 1,234 SNP markers selected to augment coverage of the Illumina 317K panel in regions surrounding 222 candidate genes [Gaulton et al., 2008] and 12,702 markers with MAF <5% not included in the set of 290,690 markers used for imputation. We expected the last two panels of markers to be harder to impute, because they represent SNPs that are not well tagged by the Illumina 317K SNP chip or that have lower MAF. We observed that 98.60% of imputed alleles matched actual genotyped alleles in the fine-mapping panel, 96.24% in the candidate gene panel, and 98.73% in the low MAF SNP panel. Furthermore, the average r2 between imputed genotypes and actual genotypes was 90.4, 79.1, and 74.0% in the three SNP panels, respectively. This represents an improvement of 14–39% compared to the best available single marker tags, which provided an average r2 of 76.5, 52.8, and 35.5% in the three SNP panels, respectively.
Our Markov Chain produces three estimates of imputation quality and these can be used to focus analyses on subsets of high-quality genotypes. First, it produces a quality score that estimates the accuracy of each imputed genotype and is simply the proportion of iterations where the final imputed genotype (by taking a majority vote across all iterations) was selected. Second, it produces an overall measure of the accuracy of imputation for each marker, which is the genotype quality score averaged across all individuals. Finally, by comparing the distribution of sampled genotypes in each iteration with the estimated allele counts that result from averaging over all iterations, it produces an estimate of the r2 between imputed and true genotypes (see Methods for more details). Quality measures for individual genotypes were good predictors of imputation accuracy (Supplementary Figure 1, Right Panel) and show that most imputed genotypes are called with a high degree of confidence (Supplementary Figure 1, Left Panel). For example, as measured by their quality scores, the top 95% of genotypes had average quality scores of 98.9% and actually matched experimental genotypes 98.6% of the time. Most of the errors affect a single allele so that, when measured on a per allele basis, concordance increases to 99.3%.
To avoid preferential removal of rare genotypes or alleles at each marker, we recommend using the per marker quality scores to select a subset of imputed SNPs for analysis, instead of the per genotype quality scores. Overall, we saw a correlation of 0.77 between the estimated and actual accuracy of imputed genotypes for each marker. We also saw a correlation of 0.84 between the r2 estimated by our method and the actual r2 that resulted from comparing experimentally derived allele counts with their imputed estimates. Figure 1 shows the ROC curve [Pepe, 2003] for the two quality measures, showing that the estimated r2 measure is a more effective way to identify poorly imputed markers. In the FUSION GWAS scan [Scott et al., 2007], we used an r2 threshold of 0.30 to decide which markers were well imputed and should be included in further analyses, and which were not. At this threshold, we expect to remove 70% of poorly imputed markers (those where r2 with experimental genotypes is <20%) but only 0.50% of better imputed markers (those where r2 with experimental genotypes is >50%).
The results summarized so far compare a variety of imputed genotypes with experimentally derived counterparts. However, a more interesting comparison focuses on imputed genotypes that appear to show strong evidence for association, as those might motivate further downstream experiments. To evaluate the accuracy of imputed genotypes for these “strongly associated SNPs,” we compared imputed and experimental genotypes in regions that were only selected for follow-up genotyping after imputation (for example, because imputed genotypes resulted in strong evidence for association but nearby genotyped markers did not). Table II summarizes the comparison of allele frequencies, association test statistics, and individual genotype calls between imputed genotypes and actual genotypes later determined by genotyping. Overall, it is clear that even among these strongly associated SNPs imputation provided accurate estimates of the true P-values. The largest observed discrepancies were for rs17384005, rs11646114, and rs4812831, which were also the three markers for which our imputation approach estimated lower r2 with actual genotypes. Imputation is particularly useful because it allows evidence for association at SNPs with no reliable proxies to be evaluated more accurately. For instance, after imputation, average r2 increased from 0.22 to 0.66 in the set of SNPs whose best genotyped proxy had r2<0.30 and from 0.33 to 0.75 in the set of SNPs whose best genotyped proxy had r2<0.5 [for specific examples of disease susceptibility loci that would be missed without imputation, see Li et al., 2009b].
Remarkably, we observed that imputed genotypes could also be used to obtain very accurate estimates of LD between pairs of untyped markers, or of LD between a genotyped marker and an untyped marker. As shown in Figure 2, estimates of LD between two SNPs obtained using imputed data are much closer to the results obtained by actually genotyping the two SNPs than estimates obtained by looking up the two markers in the HapMap CEU database (Supplementary Figure 2 shows a similar comparison for D’ estimates). Even with some imprecision in estimates of individual genotypes, the increased sample size compensates to reduce variation in the estimated LD measures.
Our experience with the FUSION GWAS, summarized above, shows that imputation can be an effective way to estimate unobserved genotypes and/or allele counts. These genotypes can then be used in a variety of downstream analyses, including logistic regression analyses for discrete traits and linear regression analyses for quantitative traits, and to facilitate meta-analysis of studies based on different platforms. A key issue when considering imputation-based approaches is whether similarly accurate estimates of unobserved data points can be obtained with different genotyping panels or in different populations [Clark and Li, 2007], and to evaluate this we conducted two additional experiments.
In the first experiment, we used genotype data generated by the International HapMap Consortium. We considered each of the HapMap samples in turn and masked available genotypes so as to mimic an experiment using one of several commercially available chips. For example, to evaluate the Affymetrix 500K SNP chip, we marked genotypes for all markers that are not on the chip as missing for the individual being considered. We then used haplotypes for the remaining individuals on the same HapMap analysis panel (either YRI, CEU, or JPT+CHB) to impute the missing genotypes. The results are summarized in Table III and clearly show that a large number of SNPs can be imputed very accurately using any of the commercially available panels (e.g. with r2>0.80 to experimental genotypes) and that, compared to relying on single marker tagging, imputation results in improved coverage of the genome.
Depending on the commercial panel and population being investigated, coverage of HapMap SNPs (proportion of SNPs with r2>0.80) increased by 10–30% for low MAF alleles (MAF<5%) and by 10–20% for more common alleles (MAF>5%). In agreement with this result, the average r2 between each untyped SNP and imputed genotypes was up to 40% higher on average when using imputed genotypes than when using the best available single marker proxy. Imputation remained valuable even for panels with ~1 million directly genotyped SNPs. In practice, the results shown in Table III are likely to represent an upper bound on the performance of our method in real settings, because additional errors will result from discrepancies in genotyping protocols between individual laboratories and the HapMap and from differences in LD patterns between the HapMap and the samples being studied. Nevertheless, they suggest our method is likely to be helpful for a variety of currently available commercial SNP panels.
In a second experiment, we evaluated the performance of our method in 927 samples from 52 populations in the Human Genome Diversity Project (HGDP). In a previous evaluation of tag SNP portability, these 927 samples were genotyped for 1,864 SNPs in 32 autosomal regions (average minor allele frequency 0.15–0.24, depending on population) [Conrad et al., 2006]. The regions were selected to represent regions of high and low LD across the genome. Each region spanned ~330 kb, including a central “core” region of ~90 kb, where ~60 SNPs were attempted, and two ~120 kb flanking regions on either side, where ~12 SNPs were attempted. To evaluate the performance of genotype imputation across these diverse populations, we selected a thinned marker set including 872 SNPs spaced ~10 kb apart across all 32 regions. We then used these SNPs to impute genotypes for the remaining 992 SNPs and evaluated our approach.
Figure 3 shows the proportion of incorrectly imputed alleles in each of the populations. Results are presented using a single HapMap analysis panel as a reference (either the CEU, YRI, or CHB+JPT) or using all HapMap samples as a larger reference panel. For each of the populations, the reference panel that resulted in the smallest overall error rate is highlighted. Overall, African samples were the most difficult to impute, with error rates ranging between 5.13% for the Yoruba and 11.86% for the San when the HapMap YRI panel was used as a reference. In other parts of the world, we generally observed that the HapMap CEU provided a good reference panel for European populations and that the HapMap CHB+JPT provided a good reference panel for East Asian populations, resulting in error rates of <3.34 and <2.89%, respectively. Outside Europe and East Asia, when imputation was applied to populations from the Middle East, Central and South Asia, the Americas or Oceania, it was generally better to use the combined HapMap sample as a reference than to use any single HapMap analysis panel as a reference. It is interesting to note that, in all cases, combining the three HapMap panels into a single reference set was either the best option or the second best option. Furthermore, in situations where this combined reference panel reduced imputation accuracy, it resulted in an average increase of only 0.15% in error rates. Our results are consistent with those of Huang et al. [Huang et al., 2009] who showed, in a smaller subset of HGDP populations and a different set of genotyped SNPs, that combined reference panels could outperform panels that included only one population. The figure also illustrates that, when a large number of individuals are genotyped in study samples, it may be possible to bypass the HapMap reference panel altogether. In the last panel, rather than using the HapMap genotypes to impute missing data, we used a combined dataset including all other HGDP populations.
Figure 4 focuses on the estimated r2 between imputed and observed allele counts. In each stripe, accuracy of imputation is assessed using a different reference panel. Superimposed in pink is the coverage that would be provided by single marker tagging approaches. Broadly, it is clear that imputation using an appropriate reference panel will improve coverage. Using an inappropriate reference panel (for example using the HapMap CEU to impute genotypes for one of the African populations), can result in imputed genotypes and allele counts that are not as strongly correlated with the true genotypes as the best available single marker tag but, even then, the loss appears to be small. Importantly—in all cases—combining the three HapMap panels resulted in substantial improvements in coverage over single marker tagging—suggesting that this might be a cautious approach when the choice of reference panel is unclear. Combining the three HapMap panels is also a good choice for genotype imputation in admixed populations [Mathias et al., 2010] where, depending on the ancestry of each stretch of the genome, the best matching haplotype will likely originate from a different HapMap reference panel. Our conclusion that the combined panel is a sensible reference for all populations facilitates practical decision making on the choice of reference panel. The conclusion is also supported by Huang et al. [Huang et al., 2009]. Although their aim was to find an optimal population-specific reference panel for each HGDP sample, their Figure 6 shows that a combined panel, including all HapMap haplotypes is the best compromise choice, in the sense that it performs almost optimally in each of the 39 HGDP populations examined. In the future, we expect that imputation methods that weigh the different reference panels could further improve imputation quality.
Our evaluation of imputed genotypes in the FUSION, HapMap, and HGDP samples clearly shows that imputation can be very accurate in a variety of populations. In this way, we believe it will be an important tool for combining results across studies that rely on different marker panels. To investigate whether using imputed genotypes might also improve power in individual studies, we carried out a simulation experiment. As previously described [Schaffner et al., 2005], we simulated 10,000 chromosomes for a series of 1 Mb regions. Within each region, simulated LD patterns mimicked the HapMap CEU or YRI [Schaffner et al., 2005]. We then used a subset of 120 simulated chromosomes to generate a region specific “HapMap.” As described in the methods, we then picked the minor allele for a randomly selected polymorphic site in each region as the “disease susceptibility allele” and simulated a set of 500 case and 500 control individuals using the remaining chromosomes. The susceptibility allele varied in frequency between 2.5 and 50%, with larger simulated effect sizes assigned to rarer alleles to ensure comparable power in a hypothetical fully genotyped sample. We also simulated 2,000 datasets where the disease allele had no effect to calibrate region-wide type I error rates for each approach.
To analyze each region, we thinned SNPs in the simulated HapMap to match the density and allele frequency spectrum of the Phase II HapMap [The International HapMap Consortium, 2007]. Using the thinned data, we selected a panel of 100 tag SNPs for each region that included the 90 tag SNPs with the largest number of proxies and 10 additional SNPs selected at random among the remaining tags. This approach resulted in panels that captured ~78% of the common variants (MAF>5%) in the simulated CEU HapMap, similar to the real life performance of the Illumina 317K SNP genotyping chip. Finally, we analyzed each of the simulated datasets using the selected marker panel and one of three analysis strategies: (a) single marker chi-squared association tests, (b) single and multi-marker association tests [Pe’er et al., 2006] as suggested by the PLINK [Purcell et al., 2007] program based on the simulated HapMap, or (c) tests using imputed allele counts for all the markers in the simulated HapMap. Results are summarized in Table IV. The first row in the table shows the significance thresholds used for each analysis (since approaches (b) and (c) both increase the total number of tests, note that the P-value threshold increases slightly when multi-marker tests are used and increases further when imputation is used). Subsequent rows summarize power for markers of different allele frequencies. In populations with strong LD, it is clear that for common susceptibility alleles the single marker tests provide high power and that imputation or multi-marker analyses provide only small gains in power. However, for rarer alleles (such as those with frequencies <5%), imputation can provide dramatic increases in power. For instance, power increased from 24.4 to 56.2% when the disease allele frequency was 2.5% and imputation was used in the panel with CEU-like LD. As large genome scans and meta-analyses that are well-powered to evaluate rarer variants with modest effects are completed, we believe that imputation will become an increasingly important primary analysis and there are now examples of confirmed disease susceptibility loci that would have been missed without genotype imputation [Li et al., 2009b].
A key ingredient for any imputation-based approach is to ensure that alleles are consistently labeled across studies. In our evaluation of FUSION and HGDP samples, using the HapMap as a reference, we were fortunate that a subset of the HapMap individuals were genotyped in each study for quality control. Contrasting the genotypes for these quality control samples with those generated by the HapMap Consortium made the usually laborious process of ensuring consistent allele labeling across labs much easier. We strongly recommend that all labs conducting GWAS genotype a small number of HapMap individuals for this purpose.
Another practical consideration arises when integrating data from studies that use diverse genotyping platforms. Superficially, it is tempting to first impute missing genotypes in each sample and to then conduct a pooled analysis of all available data. However, this is almost never a good idea, as illustrated by a particularly extreme case where a set of cases and controls have been genotyped on two different platforms and a marker of interest has been genotyped in cases but must be imputed in controls. If the marker of interest cannot be well predicted by flanking markers, imputation will default to suggesting that the genotype distribution at that marker matches the reference panel—but this could be a very poor assumption if the reference panel and study sample have drifted apart, potentially resulting in spurious association. Even if the marker can be well predicted by flanking markers, it is possible that the reference panel and the case sample used different genotyping assays that, for technical reasons such as the presence of a polymorphism that overlaps assay primers, give consistently distinct results—again resulting in spurious association. To avoid these sources of spurious association, we recommend that, when analyzing genotype data generated using different platforms, different versions of the same platform, or using the same platform but with experiments carried out at different labs, an initial round of association analysis should be carried out using data from each platform/version/site combination. The results from this initial round of analysis can then be meta-analyzed, minimizing the risk of artifacts. This recommendation does preclude analyses where all cases are genotyped at one site, and all controls are genotyped at a different site.
In the experiments described so far, we illustrated the accuracy of genotype imputation that relies on existing resources (such as the Phase II HapMap) and genotyping technologies (including a variety of commercial genotyping chips). It is likely that both these resources and technologies will continue to evolve rapidly and it is interesting to consider how these developments might impact imputation-based approaches. For example, it is clear that genotyping chips of the future will be able to examine an ever larger number of tag SNPs in a cost-effective manner. Extrapolating from Table III, it is clear these should provide improved genomic coverage, eventually allowing investigators to impute nearly all HapMap SNPs with near perfect accuracy. Nevertheless, it is also clear from Table III that when coupled with imputation-based analyses even relatively low-density SNP chips can provide excellent coverage of the genome in populations with LD patterns similar to the CEU, JPT, and CHB. Thus, we expect the main advantages of new higher-density chips will be in the study of populations with less extensive LD, such as the YRI, and in the analysis of rarer variants.
Another interesting possibility to consider is the impact of larger HapMap reference panel on imputation or, similarly, the utility of using extra genotype data on a subset of individuals in a study to aid imputation in the remaining individuals in the study. To evaluate these possibilities, we generated a reference panel with varying numbers of Finnish individuals (between 30 and 500, see Table V) and used these reference panels to impute genotypes for 521 SNPs in an independent set of 500 individuals from the FUSION study of type II diabetes. Imputation accuracy and genomic coverage increase noticeably with the larger reference panels, with overall discrepancy rates between typed and untyped alleles as low as 0.40% when a reference panel of 500 unrelated individuals is available. One of the reasons for this increase in accuracy is that the length of haplotypes shared between individuals in the reference panel and those in the study sample increases gradually as the size of the reference panel increases. For example, mosaic fragments used to reconstitute the FUSION samples using the individuals in the 500-sample reference panel were slightly >1 Mb long on average. These long stretches are easier for our Markov model to identify and are also likely to descend from a more recent common ancestor. This means they will have undergone fewer rounds of gene conversion and mutation, which gradually erode haplotype similarities and reduce the quality of our imputed genotypes. Overall, our results suggest that either genotyping a number of the study samples for markers of interest or increasing the size of the public reference panels will greatly improve the quality of genotype imputation.
With the rapid development of very high-throughput re-sequencing technologies [Bentley, 2006], it is oft proposed that genotyping-based approaches will soon become outdated. Re-sequencing-based approaches capture variants that are absent from public databases including, potentially, population specific variants. Our haplotyping approach can use whole genome re-sequencing data as input. In this setting, it uses information from individuals with similar haplotypes to reconstruct patterns of variation in regions where deep coverage is not available. In principle, the approach could be useful to help describe regions that, due to chance, are poorly covered in a particular sequencing experiment or to allow for economical evaluation of many individuals. To evaluate the possibilities, we simulated data for ten 1 Mb regions and simulated shotgun sequence data for each region. We simulated reads that were only 32 base pairs long and with a per base-pair error rate of 0.2%. Very roughly, these correspond to the performance of early versions of next generation re-sequencing technologies; newer versions of these technologies can generate longer and more accurate reads and should thus outperform the simulations presented here. We then re-sequenced between 100 and 400 individuals at different depths and used our approach to reconstruct haplotypes and genotypes for each individual. Note that the simulated reads are typically too short to include useful information on phase (because they will generally include only zero or one sites that truly differ from the reference sequence). In addition, given the large number of bases examined, they will also suggest a large number of false polymorphic sites. To control false-positive variant calls, it is imperative to confirm true polymorphic sites either by examining overlapping similar reads from the same individual or, potentially, from other individuals who share a similar haplotype.
For each site, we counted the number of times that the reference base or an alternative base was sequenced for each individual. For computational convenience, we only considered sites where both bases were observed several times (see Appendix for detailed methods and implementation details) in downstream analyses and assigned the most frequently sampled base to all other sites. On this scale, the shotgun re-sequencing approach typically characterized ~4,000 polymorphic sites across the sampled individuals - ~4 × the SNP density of the Phase II HapMap. Even relatively light shotgun re-sequencing provided very accurate haplotypes for each individual. For example, when 400 individuals were sequenced at 4 × depth, there were only 18.97 errors per individual on average (over 1,000,000 base-pairs). Across ~980,000 sites that were monomorphic in the population only 82 false polymorphisms were called on average. Accuracy was also excellent at sites that were polymorphic in the population. For example, 3,558 of the 3,641 simulated polymorphic sites with MAF>0.5% were identified and, at these sites, alleles were called with an accuracy of 99.93% (see Tables VI and VII). For any given depth, imputed accuracy increased with the number of sequenced individuals (for example, accuracy at sites with MAF >0.5% was ~98.8% when 100 individuals were sequenced at 2 × coverage but increased to ~99.7% when 400 individuals were sequenced at the same depth; the number of errors per individual decreased similarly from 106.3 per individual to 40.3 per individual). In addition, the depth required to achieve a given accuracy decreased as the number of sequenced individuals increased: achieving 99.9% accuracy for sites with population MAF >0.5% requires ~8 × depth for 100 individuals, ~6 × depth in 200 individuals and only 4 × depth in 400 individuals. In each case, note that error rates are higher at heterozygous sites than at homozygous sites. Again, performance of the approach with larger numbers of individuals improves because the mosaic fragments described by our model increase in length and, thus, become easier to find. This is also reflected in the accuracy of estimated haplotypes, which—when compared with simulated haplotypes—have ~1 switch per 50 kb when 100 individuals are examined, but ~1 switch per 500 kb when 400 individuals are examined. We expect that combining shotgun re-sequencing of whole genomes with imputation-based approaches such as ours will allow economical association studies that evaluate SNP variation in large numbers of individuals even more exhaustively than is currently possible. Furthermore, we expect that whatever the characteristics of the re-sequencing technology used, it will be possible to improve the quality of estimated genotypes and haplotypes at each site by combining information across individuals, rather than simply increasing the depth at which each individual is sequenced.
In summary, we have described and evaluated a very effective model for haplotyping and genotype imputation in whole genome studies. The idea of genotype imputation is not new and was outlined as early as 2006 [Scheet and Stephens, 2006]. Here, we evaluate the practical performance of imputation based on a variety of genotyping platforms and populations, using both simulations and real data. We show that our model leads to imputed genotypes whose quality improves as more data becomes available, either because a larger reference panel is used or because study samples are genotyped in finer detail. Similarly, haplotype estimates improve in quality as more individuals are genotyped. Furthermore, we have introduced novel approaches for the analysis of short read shotgun sequencing data, which is likely to become extremely important as human geneticists move beyond chip-based genotyping to resequencing (as in the 1,000 Genomes Project, whose initial design was partly based on the simulations summarized in our Table VI, see http://www.1000genomes.org for more details).
Other approaches for genotype imputation have been developed independently [Marchini et al., 2007; Servin and Stephens, 2007]. We expect that our results demonstrating the utility of larger reference panels, showing that the three HapMap analysis panels can be combined to better impute genotypes in populations that are genetically distant from the HapMap analysis panels, illustrating the ability of imputation-based approaches to estimate LD between untyped markers, and comparing the relative performance of imputation-based approaches for different commercial marker panels will apply when these alternative approaches for genotype imputation are used. The approaches differ in the precise details of how they search for shared haplotype stretches and also in the efficiency of their computational implementations. For example, whereas [Marchini et al., 2007] rely on recombination rates generated by the HapMap Consortium and assume a uniform mutation/error rate for all markers, we estimate “recombination rates” within each dataset and allow “mutation rates” to vary. These parameters capture not only intrinsic characteristics of the markers and regions being examined, but also—for example—the genetic distance between the samples being imputed and the reference panel (which can impact apparent “recombination rates”) and differences in genotyping protocols between the two samples (which can impact apparent “mutation rates”).
We expect that, in small samples, the use of external recombination rate estimates (as in IMPUTE) might be beneficial, but that with large sample sizes or in the presence of genotyping error our approach, which uses available data to model “recombination” and “mutation” rates should become advantageous. We performed two sets of preliminary comparisons of MaCH and IMPUTE. In the first experiment, we applied IMPUTE [Marchini et al., 2007] to the FUSION GWAS data for chromosome 14 and estimated genotypes for 521 previously genotyped markers [Willer et al., 2006]. Genotypes estimated by IMPUTE and MaCH were identical in 99.2% of cases. In the cases where the two estimates differed, IMPUTE matched experimental genotypes 44.6% of the time, MaCH matched experimental genotypes 52.3% of the time, and both estimates were wrong 3.06% of the time. For the second experiment, we applied IMPUTE to the HGDP data of Conrad et al. . Table VIII tabulates the proportion of markers imputed with r2>0.80 in each population using either MaCH or IMPUTE (in each case, we selected the HapMap reference panel that provided the best imputed genotypes). Overall, the two methods perform similarly. MaCH slightly outperforms IMPUTE in 37 out of 52 populations, slightly underperforms in 13 populations and the two methods are tied in the remaining two populations. Our results are consistent with other published comparisons [Biernacka et al., 2009; Pei et al., 2008], which include detailed comparisons of the performance of MaCH and IMPUTE with each other and with alternative imputation approaches such as BEAGLE and fastPHASE.
Our method uses an HMM to describe genetic variation along each haplotype. It is clear that when HMM models are applied to genetic data, many opportunities for identifying computational efficiencies exist [Abecasis et al., 2002; Gudbjartsson et al., 2000; Idury and Elston, 1997; Kruglyak and Lander, 1998; Lander and Green, 1987]. In the methods section we describe several optimizations that we have already implemented, including a general strategy for reducing memory requirements for the Baum algorithm [Baum, 1972; Wheeler and Hughey, 2000]. We expect that further efficiencies will be forthcoming. Our model is implemented in the MaCH package (freely available with C++ source code from our website, see http://www.sph.umich.edu/csg/abecasis/mach/). Our implementation can be used to carry out all the analyses described in this paper. Specifically, it can estimate haplotypes, impute missing genotypes in a variety of populations, using the HapMap sample or another set of densely genotyped individuals as a reference, analyze shotgun re-sequencing data from high-throughput technologies now being developed, and carry out simple tests of association.
We thank Mike Boehnke, Karen Mohlke, and the other FUSION investigators for helpful discussions. This research was supported by research grants from the NIMH, NHLBI, and the NHGRI to GRA.
Our model resolves a set of unphased genotypes G into an imperfect mosaic of several template haplotypes. We assume that H template haplotypes are each genotyped at L loci and let Tj(i) denote the allele observed at locus j in reference haplotype i. Furthermore we define a series of indicator variables S1, S2, …, SL that denote an hypothetical (and unobserved) mosaic state underlying the unphased genotypes. At a specific position j there are H2 possible states. A specific state, such as Sj = (xj, yj), indicates that the first chromosome uses haplotype xj as a template, whereas the second chromosome uses haplotype yj as a template.
We are interested in making inferences about the sequence of mosaic states S that best describe the observed genotypes. Knowledge of S will implicitly order alleles at heterozygous sites and suggest an allele for each untyped location. We define the joint probability of the observed genotypes and an underlying haplotype state as:
In the model above, P(S1) denotes the prior probability of the initial mosaic state and is usually assumed to be equal for all possible configurations, P(Sj | Sj−1) denotes the transition probability between two mosaic states and reflects the likelihood of historical recombination events in the interval between j and j−1, P(Gj|Sj) denotes the probability of observed genotypes at each position conditional on the underlying mosaic state and reflects the combined effects of gene conversion, mutation, and genotyping error. Interestingly, note that, whereas, our model and IMPUTE both use a large number of haplotypes as templates, fastPHASE [Scheet and Stephens, 2006] uses a smaller set of estimated haplotype “groupings” as templates in an otherwise similar HMM, resulting in improved computational efficiency at the cost of some fuzziness in haplotype templates.
To estimate haplotypes in a sample of genotyped individuals we first assign a random pair of haplotypes to each individual, consistent with the observed genotypes. This involves randomly ordering alleles at each heterozygous site and sampling alleles at untyped sites according to population frequencies. Then, we update the haplotypes for each individual in turn by using the current set of haplotype estimates for all individuals as templates and sampling S proportional to the likelihood L(S|G) P(G,S). Note that since the Sj define a Markov Chain this sampling can be done conveniently using Baum’s forward and backward algorithm [Baum, 1972]. A new set of haplotypes for an individual is then defined according to sampled mosaic and edited to ensure it matches the observed genotypes. We repeat the update procedure several times, looping over all individuals (more updates result in gradual refinement of the estimated haplotypes, but very accurate haplotype estimates can often be obtained in ~20 rounds, see Table I). After a pre-specified number of rounds are completed, we generate a pair of consensus haplotypes for each individual. This consensus haplotype pair is defined as the pair that minimizes total switch error when compared to the haplotypes sampled at each round.
Key ingredients in the above procedure are the transition probabilities P(Sj | Sj−1) and emission probabilities P(Gj | Sj). We define the transition probabilities as a function of the crossover parameter θj:
The possible values of P(Sj | Sj−1) reflect both the overall rate of changes in the mosaic for the interval, given by θj, and the fact that when a change occurs a new mosaic state is selected at random among all possible states.
We let T(Sj) = T(xj)+T(yj) denote the genotype implied by state Sj and define the emission probabilities P(Gj | Sj) as a function of the error parameter εj:
Initially, we let set θj = θ = 0.01 and εj = ε = 0.01 or some other suitable constant. As we sample a new mosaic state for each individual we keep track of the number and location of change points in the mosaic and of the number of times that the genotype implied by the sampled mosaic state matches the observed genotype (or not). These quantities are then used to update the θj and εj parameters for the next iteration. It is important to avoid setting either θj = 0 or εj = 0, as that could make it difficult for our Markov sampler to investigate different mosaic configurations. To avoid this, a combined crossover parameter is estimated for intervals with a small number of sampled changes in mosaic state and an analogous procedure is employed for markers with a small number of observed mismatches between the mosaic and observed genotypes.
Overall, we expect the θj will reflect a combination of population recombination rates and the relatedness between the haplotypes being resolved and the true underlying haplotypes (for example, if CEU chromosomes are used as templates to resolve CHB genotypes we expect, on average, higher θ estimates than when other CHB individuals are used as templates). We considered using distance between flanking markers to inform estimates of θj (since θ’s are generally larger in larger intervals), but did not find noticeable improvements. Overall, we expect that εj will reflect a combination of genotyping error, gene conversion events, recurrent mutation and, when genotype data from multiple platforms or laboratories is used, assay inconsistencies between different platforms. We observed slightly lower data quality measures (completeness, duplicate concordance, Hardy–Weinberg test statistics) for markers with large estimates of εj in the FUSION GWAS.
Genotype imputation analyses proceed similarly to the haplotyping analyses described above, but do not require each sampled haplotype configuration to be stored. Instead, after each iteration, a series of counters is updated to indicate the number of times each genotype was sampled at a particular position. Once all iterations are completed, these counters give an indication of the relative probability of observing each possible genotype and can be used to impute the most likely genotype and to calculate various measures of the quality of imputed genotypes.
Without loss of generality, consider a SNP with alleles A and B. Let nA/A, nA/B, and nB/B be the number of times each possible genotype was sampled after I = nA/A + nA/B+nB/B iterations. For downstream analysis of imputed alleles, we typically consider either the most likely genotype or the expected number of copies of allele A. The most likely genotype is simply the genotype that was sampled most frequently. The expected number of counts of allele A is the genotype score g = (2nA/A + nA/B)/I. Both of these quantities can be conveniently incorporated into a variety of analysis, including regression-based association analysis of discrete or quantitative traits.
To measure the accuracy of imputation for a single imputed genotype IG, we define the genotype quality score Q = nIG/I. This quantity can be averaged over all genotypes for a particular marker to quantify the average accuracy of imputation for that marker. We have found that a better measure of imputation quality for a marker is the estimated r2 between true allele counts and estimated allele counts (Fig. 1). This quantity can be estimated by comparing the variance of the estimated genotype scores with what would be expected if genotype scores were observed without error. For a given SNP, let Var(g) be the variance of estimated genotype and let p = Mean(g)/2 be the estimated frequency of allele A. The estimated r2 with true genotypes can then be defined as
An alternative definition is
Empirically, we have found that while both definitions lead to similar conclusions, the first definition appears to be marginally better.
When analyzing the FUSION data [Scott et al., 2007], we included imputed genotype scores as predictors in a logistic regression that also included age, sex, and geographic origin as covariates. For analyzing simulated case-control data, we simply used a t-test to compare the average genotype scores in cases and controls. Other approaches to the analysis of imputed data are possible but, in our experience, the imputed genotype scores provide a good balance between computationally demanding multiple imputation procedures [Servin and Stephens, 2007] and analyses that simply use the most likely genotype.
When shotgun re-sequencing, or another single molecule re-sequencing technology, is used on diploid individuals, genotypes are not directly observed. In this case, we assume the data consists of counts Aj and Bj indicating how many times base A (or B) was observed at site j. We then define our HMM as
Here, we sum over possible genotypes at each site and calculate the probability of the observed traits for each possible genotype set. In addition, we define the probability of observing a specific set of traces given the underlying genotype as
The parameter δ denotes the per base sequencing error rate and can be separated from the effects of mutation and gene conversion captured in ε, unless the re-sequencing depth is very low.
In principle, the method could be applied to all sites where an alternative base call is observed at least once. However, since we simulated many short reads and an error rate of 0.2%, the minor allele was observed at least once at nearly every simulated position. For reasons of computational efficiency, we applied the MaCH 1.0 haplotyper only to positions were the minor allele was observed multiple times. Specifically, we defined mkj as the number of traces where the minor allele was observed at position j in individual k. Then, we defined the score wj = Σk mkj(mkj + 1)/2 and applied our haplotyping algorithm to all sites where wj exceeded a predefined threshold (other sites were assumed to contain the major allele). The score gives higher weight to sites where the minor allele is observed multiple times in the same individual. We used thresholds for wj of 5, 7, 9, 11, 13 depending on whether the total coverage (defined as depth × individuals) was 200, 400, 800, 1,200, or 1,600 ×. When the number of individuals sequenced was 400, these thresholds were reduced to 4, 6, 8, 10, and 12, respectively. This means that, for example, when 400 individuals were re-sequenced at 4 × depth (total depth = 1,600 ×) we considered only sites where the minor allele was observed in at least 12 traces from different individuals or slightly fewer traces concentrated in one or more individuals.
The model described above is convenient for the analysis of simulated data where the per base error rate is constant. For analyses of real data, where base quality scores are associated with individual bases, we adapted our implementation to use P(base calls, quality scores | G) as stored in Genotype Likelihood Files generated by samtools [Li et al., 2008, 2009a].
A number of optimizations are possible to increase the computational efficiency of our model. For example, since haplotype states are unordered only H(H+1)/2 distinct states must be considered at each location, rather than H2 distinct states. Below, we summarize some of the other efficiencies that we identified and how these are implemented in MaCH.
When sampling a mosaic state S conditional on the observed genotypes G, we rely on the Baum algorithm. The algorithm requires a series left and right conditioned probability vectors which provide an indication of the relative probability of a specific state at a given location conditional observed genotypes at markers to its left (or right). For example, the probability of observing state (x,y) at location j conditional on all preceding genotypes is simply:
The calculation of these probabilities can be sped up by taking advantage of the regular patterns in the transition matrices. Specifically, we define the quantities:
Then, the previous definition becomes:
When this updated definition is used to calculate left conditional probabilities for each possible state, computational requirements become O(H2) rather than O(H4) using the original definition, provided that C(a) and C are pre-computed. An analogous speed up is available for right-conditioned probabilities.
One large computational constraint when applying our algorithm on a genomic scale is the storage required to track left-conditioned probabilities. Typically, this requires storage of L vectors each with H2 elements (or, as noted above H(H+1)/2 elements). It is clear that this requirement becomes cumbersome as the number of polymorphic sites increases. We devised a solution that requires storage of only 2*sqrt(L) vectors. For notational convenience let K = sqrt(L). Our algorithm pre-allocates 2K vectors and organizes these into two groups: a framework set of K vectors, and a working set of another K vectors. When left-conditional probabilities are first calculated, proceeding left to right, we store every Kth vector in the framework set and discard other intermediate results. Then, as these vectors are used in the second pass of the chain (which combines left and right conditional probabilities, proceeding right to left), we recalculate K of these vectors at a time (starting from the nearest vector in the framework set) and store them in the working set of vectors. Completing the full chain requires calculation of all L vectors of left conditional probabilities, recalculation of K of these vectors L/K times, and calculation of L vectors of right conditional probabilities. Overall, our solution no more than doubles computing time (since each vector of left conditional probabilities must be calculated twice), but reduces memory requirements from O(L) to O(L1/2). The solution is general and can be applied to many other HMMs (see also [Wheeler and Hughey, 2000]).
If all available chromosomes are used as templates, the complexity of our algorithm will increase cubically with sample size (because the cost of each update increases quadratically and the number of updates increases linearly with sample size). One way to avoid this is to restrict the size of the template pool. When there are more than a pre-specified number of potential templates (say H > 300), we typically select a random subset of these for each update. With this restriction, the complexity of our algorithm increases only linearly with sample size (because the cost of each update now remains fixed and only the number of updates to be performed grows). Furthermore, even though each update is based on only a random sample of the available haplotypes, the overall quality of solutions still increases with sample size. When the focus is on genotype imputation, rather than haplotyping, an alternative is to use as templates individuals who have been genotyped for the markers being imputed (e.g. the HapMap reference samples). Both of the above solutions are heuristics that trade-off some accuracy for computational efficiency. An alternative strategy for reducing the size of the template pool is to consider local similarities and redundancies among the haplotypes in the pool. These redundancies are already exploited to increase computational efficiency in the handling of other Markov models [Abecasis et al., 2002; Markianos et al., 2001], and our preliminary implementations suggest that speed-ups of 5–10 × are possible for our haplotyping model.
Additional Supporting Information may be found in the online version of this article.