Genome-wide association studies have identified many putative disease susceptibility loci in recent years 
. This approach to studying disease has succeeded largely because of improved catalogues of human genetic variation 
and advances in genotyping technology, but it has also been bolstered by the rise of genotype imputation methods 
, which have allowed researchers to tease increasingly subtle signals out of large and complex genetic datasets 
Imputation methods work by combining a reference panel
of individuals genotyped at a dense set of polymorphic sites (usually single-nucleotide polymorphisms, or “SNPs”) with a study sample
collected from a genetically similar population and genotyped at a subset of these sites. shows a schematic example of such a dataset. Imputation methods predict unobserved genotypes in the study sample by using a population genetic model to extrapolate allelic correlations measured in the reference panel. The imputed genotypes expand the set of SNPs that can be tested for association, and this more comprehensive view of the genetic variation in a study can enhance true association signals and facilitate meta-analysis 
Schematic drawing of imputation Scenario A.
To date, most imputation analyses have used reference panels composed of haplotypes from Phase II of the International HapMap Project, together with study samples genotyped on commercial genome-wide SNP arrays. depicts this arrangement, which we call Scenario A. To understand how imputation methods work in this setting, it helps to observe that the SNPs exist in a natural hierarchy, such that they can be partitioned into two disjoint sets: a set T that is typed in both the study sample and the reference panel, and a set U that is untyped in the study sample but typed in the reference panel. Informally, most imputation methods phase the study genotypes at SNPs in T and look for perfect or near matches between the resulting haplotypes and the corresponding partial haplotypes in the reference panel—haplotypes that match at SNPs in T are assumed to also match at SNPs in U. This is the fundamental basis of genotype imputation.
Several important points emerge from this description. First, the accuracy with which the study haplotypes are phased at SNPs in T
should determine how well they can be matched to haplotypes in the reference panel, which should in turn influence the accuracy of imputation at SNPs in U
. Second, accounting for the unknown phase of the SNPs in T
can be computationally expensive; if the haplotypes at these SNPs were known, most methods would be able to impute genotypes at SNPs in U
more quickly. Third, many existing methods do not use all of the available information to phase the study genotypes at SNPs in T
. In principle, a phasing algorithm should be able to “learn” about desirable phasing configurations for a given study individual by pooling information across the reference panel and all other individuals in the study, and the phasing accuracy should increase with the sample size; in standard practice, most imputation methods gain phasing information about each study individual only from the reference panel, and phasing accuracy does not depend on the size of the study sample. (This description applies to imputation methods based on hidden Markov models, or “HMMs” 
; non-HMM methods often discard other kinds of information.) The BEAGLE imputation model 
is one notable exception to this point, and we discuss its alternative modeling strategy in detail in this work.
We have developed a new algorithm that seeks to improve imputation accuracy at untyped SNPs by improving phasing accuracy at typed SNPs, building on the points raised above. Most HMM-based imputation methods simultaneously estimate missing genotypes and analytically integrate over the unknown phase of SNPs in T. By contrast, we propose to alternately estimate haplotypes at SNPs in T and impute alleles at SNPs in U, assuming the haplotype guesses are correct. We account for the phasing uncertainty in the data by iterating these steps in a Markov chain Monte Carlo (MCMC) framework. Separating the phasing and imputation steps allows us to focus more computational effort on phasing and use more of the available information; the extra computation used in this step is largely balanced by the quick haploid imputation in the step that follows.
This approach can improve imputation accuracy in Scenario A, as we show in the Results
section, but another major motivation of this work is to extend IMPUTE 
to handle “next-generation” association datasets. By this, we refer to studies in the near future that will have access to additional reference data that could inform imputation. Next-generation reference panels will present new challenges for imputation, including larger sample sizes; unphased and incomplete genotypes; and multiple reference panels containing different SNP sets. Our method aims to use the principles outlined above to address these challenges and improve imputation accuracy in next-generation studies.
One new data configuration, which we call Scenario B
and explore in detail in the current study, is presented in ; we will address other next-generation reference panels in the Discussion
. In Scenario B, there are different amounts of genotype data in different cohorts of a study. For example, the Wellcome Trust Case Control Consortium (WTCCC) is currently performing an association study in which 6,000 controls will be genotyped on both the Affymetrix 6.0 and Illumina 1 M SNP chips, whereas disease cohorts will be genotyped only on either the Affymetrix 6.0 chip or the Illumina 670 k chip. In other words, a large set of controls will be genotyped at a subset of HapMap SNPs, and each case cohort will be genotyped at a subset of the SNPs typed in the controls. Published studies have already employed this design 
, and it may become more prevalent in the future as common sets of population controls become more widely available.
Schematic drawing of imputation Scenario B.
In Scenario B, the study individuals genotyped on a larger number of SNPs can be used as an unphased, or “diploid”, reference panel for imputation in the remaining samples (which do not necessarily have to be cases). As before, we approach such a dataset by partitioning the SNPs into disjoint sets, named with reference to the study sample: a set U1 that is untyped in the study sample and typed only in the haploid reference panel, a set U2 that is untyped in the study sample and typed in both the haploid and diploid reference panels, and a set T that is typed in all samples.
We apply the same inference principles to Scenario B as to Scenario A: at each MCMC iteration we phase all of the observed data, pooling information across samples typed on common sets of SNPs to estimate each haplotype pair, then perform haploid imputation assuming that all of the haplotype guesses are correct. One novelty of this scenario is that, at SNPs in U2
, the reference panel may contain thousands of chromosomes, in contrast to HapMap Phase II panels that contain only 120–180 chromosomes each. In principle, this added depth should improve imputation accuracy at SNPs in U2
, with notable gains at rare SNPs. The latter point is especially relevant because rare SNPs are an important source of power in imputation analyses 
. Scenario B also introduces the problem of multiple reference panels genotyped on different, hierarchical sets of SNPs. Many next-generation imputation datasets will follow this paradigm, which presents modeling challenges that remain largely unexplored.
In the sections that follow, we describe the details of our new method as applied to the scenarios in and . We then compare the method with other imputation approaches on real datasets from the United Kingdom that emulate Scenarios A and B. We show that our method can attain higher accuracy than existing methods in Scenario A, but that the absolute gains are small, which we attribute to the inherent limitations of a small set of reference haplotypes. In an example of Scenario B, we demonstrate that our method can use a large unphased reference panel to achieve higher accuracy than imputation based on the HapMap alone. We also show that our method can impute genotypes more accurately than other sophisticated 
and simpler 
methods applied to the same dataset, and that our approach has higher sensitivity and specificity to detect copies of the minor allele at rare SNPs. In addition, we present results that highlight important practical advantages of our imputation modeling strategy over the one used by BEAGLE.
We have implemented our new imputation method as an update to our existing software package IMPUTE; the new program is called “IMPUTE version 2” (IMPUTE v2). We refer to our previously published method 
as “IMPUTE version 1” (IMPUTE v1).