Modeling genetic variation in admixed populations Our approach to inferring ancestry segments, implemented in HAPMIX, is based on extending a Hidden Markov Model (HMM) previously developed by Li and Stephens to model linkage disequilibrium in population genetic data

[14]. This model has been employed in recent years in various population genetic and disease mapping settings

[20],

[21]. Informally, given a previous collection of “parental” haplotypes from a reference population, a new “offspring” haplotype drawn from the same population is modeled as a mosaic of these existing haplotypes. This offers a flexible means to account for local linkage disequilibrium (LD), because over short distances, the haplotype that an individual chromosome copies from is unlikely to change.

We extend the Li and Stephens model to allow inference on ancestry segments for individuals drawn from an admixed population. We begin by supposing that we have two previously sampled collections of phased haplotypes,

*P*_{1} and

*P*_{2}, taken from two reference populations. For example, HapMap provides phased haplotypes from the CEU, YRI and JPT+CHB populations genotyped at over 3 million markers

[16]. We further assume that

*P*_{1} and

*P*_{2} have valid data at all sites of interest, with no missing data. In practice, small amounts of missing data in the reference populations can be filled in by a pre-processing imputation step, as has been done for the publicly available phased HapMap data. We label

*P*_{1} and

*P*_{2} as “parental” haplotypes. Next, we sample a new “offspring” haplotype from an admixed population. We assume that this population is created from a single admixture event between two populations which are genetically similar to the two reference populations from which

*P*_{1} and

*P*_{2} are drawn. (The reference populations do not need to

*exactly* match the true ancestral populations, because we allow for some genetic divergence in our approach.) We will initially consider the case where we have haploid chromosomes from the admixed population, and subsequently generalize to the more typical case involving unphased genotype data from the admixed population. Throughout this section, we operate in units of genetic (not physical) distance.

We begin by modeling the ancestry segments. Assume the admixture event occurred at a single time

*T* generations ago, with a fraction

*μ*_{1} of the haplotype's ancestry drawn from population 1, and

*μ*_{2}=

1−

*μ*_{1} from population 2. Because recombination occurs at each generation, it is natural to model ancestry switches as a Poisson process along the genome

[22], at a rate

*T* per unit of genetic distance (i.e.

*T* per Morgan). Conditional on the positions of such switches, each segment is independently drawn from population 1 or 2 with probabilities

*μ*_{1},

*μ*_{2} respectively. In particular, this implies that not all ancestry switch points will actually change the underlying ancestry. This model has been previously used by other authors

[1],

[22]. Since ancestry cannot be directly observed, it is natural to view underlying ancestry status as the “hidden” information in an HMM. Our approach probabilistically infers this hidden state at each position along a chromosome.

To fully specify our model, we must consider the structure of variation conditional on these admixture segments. Our model remains computationally tractable while accommodating important features typical of real data such as mutation, recombination, genotyping error, reference populations that are drifted from the true ancestral populations, and incomplete sampling of diversity in the reference populations reflected in the samples drawn from these populations. We assume that all mutant sites take the form of single nucleotide polymorphisms (SNPs) with two alleles that can be represented as 0 and 1 (however, our approach could be extended to more complex mutation models).

We suppose that sections of the genome with true ancestry from population 1 are formed as mosaics of the haplotypes in the two parental groups. Specifically, at any given position with this ancestry, an individual from

*P*_{1} is copied with probability, and an individual from population

*P*_{2} is copied with probability

(we call this the “miscopying” parameter for population 1). Conditional on the parental group chosen, individuals to copy from are chosen uniformly from the

*n*_{1},

*n*_{2} respective individuals in that group. Switches between individuals occur as a Poisson process with rate

*ρ*_{1}, the “recombination” parameter, and at each switch point a new copy individual is chosen randomly using the above scheme. Finally, at genotyped SNPs, if the “offspring” copies a “parent”

population 1, the offspring carries an identical type to the particular parent it copies from with probability (1−

*θ*_{1}), and carries the other type with probability

*θ*_{1}, the “mutation” parameter. If the offspring instead copies an individual from the other population 2, the corresponding mutation parameter is

*θ*_{3}. In total this approach leads to 4 additional parameters:

*p*_{1},

*ρ*_{1},

*θ*_{1} and

*θ*_{3}.

For sections of the genome with ancestry from population 2, we formulate our model in an analogous way, with corresponding parameters *p*_{2}, *ρ*_{2}, *θ*_{2} and *θ*_{3}. We note that *θ*_{3} is shared for both populations, a choice that is motivated by a genealogical argument, and has the aim of keeping the total number of parameters manageable. In total, our model has 9 independent parameters: *T*, *μ*_{1}, *p*_{1}, *p*_{2}, *ρ*_{1}, *ρ*_{2}, *θ*_{1}, *θ*_{2} and *θ*_{3}.

Some additional remarks about the interpretation of these parameters may be useful. As in the original Li and Stephens implementation,

*ρ*_{1} and

*ρ*_{2} relate to historical recombination parameters. In our parameterization, these parameters depend on both the effective population sizes of the relevant populations, and the sample sizes

*n*_{1} and

*n*_{2} drawn from these populations. Although they are not merely a simple function of these quantities, informal coalescent-based arguments suggest that they will decrease roughly linearly with

*n*_{1} and

*n*_{2}, and increase roughly linearly with the effective population sizes of the reference populations

[14]. In general, because the amount of historical recombination depends on effective population size, we do not expect

*ρ*_{1}=

*ρ*_{2}, even if

*n*_{1}=

*n*_{2}. The mutation parameters

*θ*_{1},

*θ*_{2} and

*θ*_{3} allow for both historical mutation and genotyping error. The miscopying parameters

*p*_{1} and

*p*_{2} allow similar “fuzziness” in the group copied from within ancestry segments. If

, ancestry segments corresponding to population 1 must copy individuals from population 1, and similarly for population 2. However, setting these parameters equal to zero is likely to lead to spurious ancestry breaks, and therefore misestimation of ancestry segments, for at least two reasons. First, because we only sample a finite number of parental chromosomes, incomplete lineage sorting can occur. In some parts of the genome, the offspring chromosome is expected to have a deep coalescence time with the ancestors of the “correct” parental sample, and may instead coalesce first with an ancestor of the other parental sample – and therefore choose a descendant of this ancestor, in the “wrong” parental sample, to copy from. Second, if our reference populations are somewhat inaccurate relative to the true ancestral populations, again it is likely that incomplete lineage sorting will occur, even if our “parental” samples are both large. For these reasons, in practice we believe that incorporating non-zero miscopying parameters is important, and in both real data and simulation we find that it greatly improves our ancestry estimation procedure. Because our miscopying parameter is designed to allow for regions in the genome where the offspring chromosome has an unusually deep coalescence time with the other sample members, allowing the “miscopying” to occur, miscopied regions are likely to have unusually deep genealogies. Therefore, we allow a different mutation rate

*θ*_{3} for such segments, which is typically expected to be higher than

*θ*_{1} or

*θ*_{2}. It might also be desirable to allow a higher recombination rate in such cases. However, this would result in computational complexities, and we have chosen not to allow such an additional parameter.

For a typical application of HAPMIX, we expect to have data from a collection of discrete typed sites. Suppose we have

*S* such sites, and in addition a map giving the genetic distances

*r*_{1},

*r*_{2},

*…r*_{(S−1)} between adjacent pairs of sites. In practice, we interpolate these distances from the genome-wide recombination rates estimated using Phase II HapMap

[16]. Given the above parameters, and for a haploid admixed chromosome, we formalize the transition probabilities as follows. A (hidden) state for position

*s* is represented by a triplet (

*i*,

*j*,

*k*) where

*i*=

1 or 2 represents ancestry drawn from population 1 or population 2,

*j*=

1 or 2 records the population the chromosome copies from at position

*s* (

*j* may be different from

*i* due to miscopying) and

*k* represents the individual from which the chromosomal segment is copied. There are 2(

*n*_{1}+

*n*_{2}) possible states. Let

be the probability of transitioning from state (

*i*,

*j*,

*k*) to state (

*l*,

*m*,

*n*) between adjacent sites

*s* and (

*s*+1). Then we have the following:

Conditional on the underlying hidden state, let

denote the probability of the offspring chromosome being of type 1 at site

*s*, and

*t*_{jk} be the type of parental individual

*k* in reference population

*j*. Then

This probability allows us to calculate the likelihood of the observed data in the offspring for each possible underlying state. At sites with missing data in the offspring chromosome, the appropriate likelihood contribution is simply 1.0.

Inferring probabilistic ancestry segments and sampling from the posterior with HAPMIX It is easy to see that equations (0.1) and (0.2) describe a HMM for the underlying state (which includes information on ancestry) as we move along the genome, and that the underlying Markov process is reversible. Given a set of parameters we can exploit these properties and HAPMIX implements standard HMM techniques to efficiently infer posterior probabilities of underlying states, via the forward-backward algorithm, or sample random state paths from the correct joint posterior distribution, using a standard modification of this algorithm. In addition to parameter values, the software takes as input a recombination map for the regions to be analyzed, phased “parental” chromosomes from the two reference populations, and “offspring” data from the admixed population being analyzed.

A naïve implementation of the forward/backward algorithm would require computation time proportional to 4

*S*(

*n*_{1}+

*n*_{2})

^{2}, in the above notation. For the original Li and Stephens model, it is possible to reduce computation time substantially by using the fact that many pairs of transition probabilities between states are identical, which allows terms to be collapsed in the forward (or backward) algorithm, into expressions involving a single term that is shared among all destination states. Calculating this shared term just once per pair of adjacent sites, and then storing, saves substantial computational effort

[14]. Analogously, in our somewhat more complicated setting we can exploit a similar phenomenon, so that by calculating and storing a somewhat larger number of shared terms – one for each group of states of the form (

*i*,

*j*), giving four in total - HAPMIX can complete the forward/backward algorithm in time proportional to 2

*S*(

*n*_{1}+

*n*_{2}) (with an additional scaling constant).

It is straightforward to extend our approach to allow imputation of missing data, while simultaneously labeling underlying ancestry, in an analogous manner to methods employed in several existing approaches to imputation for samples drawn from panmictic populations

[20],

[21]. We will describe this extension, and its application to disease mapping, in a separate paper.

Diploid genotype data from the admixed population Typically, real data consists of unphased genotypes for individuals drawn from a population, with haplotypic phase unknown. Many approaches already exist to infer phase from such data

[17],

[18]. However, phase switch errors that inevitably result from applying such algorithms are likely to result in spurious ancestry switches within regions of the genome where an individual is heterozygote for ancestry. This would likely lead to considerable overestimation of the time since admixture and a reduction in the accuracy of ancestral inference. To avoid such issues, we have extended our approach to directly analyze diploid genotype data from the admixed population.

The phasing is implemented using a HMM adapted from that described above (0.1) and employing a composite hidden state at each location, of the form (*i*_{1},*j*_{1},*k*_{1},*i*_{2},*j*_{2},*k*_{2}) where (*i*_{1},*j*_{1},*k*_{1}) represents the previously defined “haploid” hidden state for the first chromosome, and (*i*_{2},*j*_{2},*k*_{2}) represents the hidden state for the second chromosome. The state space therefore now has dimension 4(*n*_{1}+*n*_{2})^{2}. Allowing independent transitions between the marginal states for each chromosome, the terms in (0.1) now naturally define an HMM for these composite states (for reasons of space, we do not explicitly list all of the transition probabilities in the model here). This model could have up to 18 parameters – in our implementation, for natural biological reasons we assume all parameters are shared between chromosomes, apart from time since admixture *T* and admixture proportion *μ*_{1}, resulting in 11 parameters in total. Further, although our software allows these two parameters to differ, in all applications considered here we specify *T* and *μ*_{1} to be the same for each chromosome.

Emission probabilities are also adapted from the haploid case. For genotype data, there are 3 possible emissions at typed sites, which we denote as genotypes

*g*=

0, 1, or 2, with

*g* counting copies of the “1” allele. Conditional on the underlying hidden state, let

denote the probability of observing genotype

*g* given underlying state (

*i*,

*j*,

*k*,

*l*,

*m*,

*n*), and define

*t*_{jk} as before to be the type of parental individual

*k* in reference population

*j*. Then using (0.2)

where

and

are as defined above.

Having defined the HMM for this setting, we again use standard techniques to obtain posterior probabilities on (joint) ancestry for the two chromosomes, and then sample states from this posterior distribution. We note that as a by-product of sampling complete states jointly for the two chromosomes together, we are phasing the original data with respect to the underlying ancestry. This may help reduce phasing error rates in admixed populations compared to methods that ignore local ancestry, although we do not pursue this issue here.

We can adapt the computational speedups described above to the diploid setting, so that while a naïve implementation of the forward algorithm would take time proportional to 16

*S*(

*n*_{1}+

*n*_{2})

^{4}, we can complete the forward/backward algorithm in time proportional to 4

*S*(

*n*_{1}+

*n*_{2})

^{2}. A further speedup for the diploid setting is described in

Text S2. With these speedups implemented, the running time of HAPMIX is roughly 30 minutes on a single processor per diploid genome analyzed (519,248 sites). Because the computations can be parallelized across admixed individuals (they can also be parallelized across chromosomes), HAPMIX is computationally tractable even for very large data sets if a cluster of computing nodes is available. For example, the running time for a data set of 1,000 admixed individuals on a cluster of 100 nodes is roughly 5 hours.