In this study, we investigated and compared the performance of five popular genotype imputation methods: MACH, IMPUTE, fastPHASE, PLINK and Beagle, under various conditions. Using both simulated and real data sets, we determined that factors such as LD level, MAF of untyped SNPs, marker density and size of reference haplotypes have varying effects on imputation accuracy rates. Specifically, stronger LD, lower MAF, or higher marker density lead to better ARs; greater size of haplotypes in the reference sample resulted in higher ARs for MACH, IMPUTE, PLINK and Beagle, but had little influence on ARs for fastPHASE. In comparing the different methods to one another, MACH and IMPUTE produced similar results that were generally better than fastPHASE, PLINK and Beagle. In addition, MACH performed better than IMPUTE under low LD levels and high marker densities.
One reason that missing genotypes can be imputed is that unrelated individuals from common ancestors usually share an extended haplotype over short regions 
. The approach by which haplotype sharing is captured differs for the five methods. In the following discussion, we did not summarize the model underlying PLINK since it was not accessible and not available at the time this study was performed. The remaining four methods all infer individual genotypes as mosaics from the set of background haplotypes by an HMM process 
. Despite their conceptual similarities, implemental distinctions between these methods have produced some differences in relative performance. fastPHASE relies on a fixed number of haplotype clusters to form underlying hidden states in the Markov Chain 
. Provided that this number is correctly specified, fastPHASE should give an acceptably good performance. However the cluster number is usually restricted to a small value in real applications as a trade-off against computation cost, which makes this approach slightly inferior to the alternative approaches, under most conditions. Beagle uses a similar haplotype clustering approach to fastPHASE, but it allows the cluster number to dynamically change to better fit localized LD patterns exhibited by the data 
. Nonetheless, empirical estimates of parameters in Beagle may bias specification of the model to some extent, particularly when the sequence exhibits a low average LD level. Both MACH and IMPUTE directly model genotypes on the set of haplotypes without clustering, and both of these methods appear to outperform fastPHASE and Beagle, which adopt haplotype clustering strategies 
. This improvement is probably attributable to their capacity to capture more information on haplotypic variation without clustering. IMPUTE explicitly specifies a set of reference haplotypes (e.g., haplotypes from the HapMap project), as the pool of hidden states of the Markov Chain, and infers haplotypes and missing genotypes in test samples according to these hidden states 
. In contrast, MACH implicitly combines both reference and test samples together to estimate parameters and to update haplotypes for all individuals in turn by the Monte-Carlo procedure 
. Generally, the two approaches have approximately equal performance. However, MACH performs a little better than IMPUTE under certain conditions as we show in the study, probably because it can make better use of the data by combining reference and test samples together to train model parameters.
Among various factors influencing imputation AR, the level of LD plays a central role for all methods. Stronger background LD patterns will improve imputation AR. The effects of marker density are essentially transformed into that of LD by the fact that denser markers usually cause stronger patterns of local LD. Thus, denser markers will also help improve imputation AR. The influence of MAF on imputation AR can be interpreted as ultimately caused by the level of LD. Our results demonstrated that a decrease in the MAF of untyped variants resulted in an increase in imputation AR. A lower MAF usually corresponds to a “younger” ancestral mutation, or a stronger LD with nearby markers, provided recombination plays a primary role in LD decay. To confirm this, we calculated average values of r2 between typed and untyped markers under different MAF interval settings, for different LD regions in our simulation data. However, we did not find obvious relationship between r2 and ARs. For example, in one of the simulated low LD region, average values of r2 changed slightly around 0.006 regardless of MAF intervals. Nonetheless, when the level of LD was measured by D′, the trends in D′ change confirmed our explanations. For example, in the same region as above, average values of D′ decreased from 0.33 to 0.15 as MAF interval increased from 0.05 to 0.45. The discordance between r2 and D′ was likely caused by the fact that calculation of r2 was less sensitive to MAF than that of D′.
An interesting observation from our simulations is that MAF influences imputation AR in different patterns for regions with different LD levels. The influence of MAF was relatively minor for high LD regions, while it was considerably larger for low LD regions. One potential explanation for these findings is that in high LD regions the imputing AR is determined primarily by the high levels of LD between markers; the capacity for MAF to influence AR is greatly diminished under these circumstances. In low LD regions, on the other hand, markers with low MAF likely exhibit locally high levels of LD with nearby markers though the overall LD level across the entire region was low. The locally elevated LD level caused by the low MAF in low LD regions, results in much higher imputation AR than that attained with high MAF. In our simulations, D′ decreased from 0.71 to 0.41 as the MAF interval increased from 0.05 to 0.45 in high LD regions, whereas it decreased from 0.33 to 0.15 in low LD regions.
Larger samples will introduce extra information and will also produce more consistent estimates of measured parameters, resulting in generally improved AR for various methods. However, for fastPHASE, we observed that the number of reference haplotypes had little influence on imputation AR. One potential explanation for fastPHASE's insensitivity to reference sample sizes maybe its fixed small number of clusters. With low cluster numbers, increasing reference samples can only change parameter estimates within each cluster, but may not be able to capture the added haplotypic variation. Consequently, increasing reference samples has only a limited capacity to improve imputation AR. Increasing cluster number may resolve this issue, but that will be time-consuming and our simulations showed that the increase in the AR was not significant even when the cluster number increased from 20 to 100 (Data not shown). An alternative choice, that appears to improve AR, is to let the cluster number be determined dynamically by the data itself from the local context of sequence. This is the approach that Beagle adopted and, under these conditions, increasing the number of reference haplotypes improved imputation AR to a remarkable extent.
In the current study, test data and reference data were sampled from the same population, which is the basic assumption for most of the methods studied here. However, for many practical studies, these conditions do not apply; investigators often obtain their reference data from HapMap, which contains high-resolution haplotype information for a small number of relatively homogenous human populations. Importantly, several previous studies have demonstrated the feasibility of using homogeneous samples for reference data. For example, Marchini et al., imputed genotypes for a UK sample using CEU HapMap haplotypes, and the imputation AR was high 
. Additionally, a worldwide survey of haplotype variation and LD patterns in 52 different populations demonstrated that there is considerable sharing of haplotype structure across groups and that locations of inferred recombination hotspots generally match across groups 
. These studies support the conclusion that imputation can still be accurate even when there is mild heterogeneity between test samples and reference data.
In the current study, phases of the reference haplotype are assumed to be known, even though this is usually not true for real data. Inferring haplotypes from genotypes can introduce additional errors, with a consequent decrease in ARs for imputation using real data. Fortunately, it has been previously demonstrated that current haplotype inference programs (e.g. PHASE) can infer phasing information with high accuracy, thereby minimizing errors in subsequent imputation attributable to these inferred haplotypes 
One remaining issue related to imputed genotypes is how to apply imputed genotypes in subsequent analyses. In this study, the most likely genotypes were set as the imputed genotypes, but it is also possible to infer imputed genotypes from a posterior distribution provided by certain methods (such as the one based on HMM, IMPUTE) 
. Both strategies, selecting the most likely genotype and selecting the posterior distribution of all possible genotypes, have demonstrated the capacity to improve power in follow-up association analysis 
. However, comprehensive analyses appear to be warranted to better evaluate this issue.