Latino populations of the Americas, such as Mexicans or Puerto Ricans, arose by the influx of Europeans into existing Native American populations. Subsequently, African individuals were introduced into the population (Morales Carrión, 1983
; Tang et al., 2007
). Thus, most of the genomes of current Latino populations can be modeled as an admixture of chromosomes from three ancestral populations with various global proportions of European, Native American and West African ancestries [e.g. 0.45:0.5:0.05 for Mexicans and 0.67:0.13:0.2 for Puerto Ricans (Burchard et al., 2005
; Mao et al., 2007
; Price et al., 2007
; Tian et al., 2007
)]. Correspondingly, we simulated Latino admixed haplotypes as mosaics of segments taken from three of the HapMap phase 3 haplotype panels (The International HapMap Consortium, 2005
). Unless otherwise noted, we used the phased haplotypes from the CEU (117 haplotypes), CHB+CHD (169) and YRI (115) panels in our simulations of admixed haplotypes and phased haplotypes from the TSI (117), JPT (169) and LWK (115) panels as proxy reference data in local ancestry inference. The haplotype sets used for generating the simulations data and reference data are therefore disjoint. Our use of East Asian haplotypes to represent the Native American haplotypes was motivated by the small sample sizes of existing Native American panels and by the presence of European gene flow into some of these populations. It is likely, however, that the use of East Asian haplotypes will overestimate the accuracy of local ancestry inference.
We performed the analyses on Chromosome 10, restricted to the SNPs present on the Illumina Human 1 M SNP array so as to obtain a realistic SNP density and a typical genomic LD pattern. Following standard approaches (Price et al., 2009
), we simulated admixed chromosomes by performing a random walk over the HapMap haplotypes. Distance to the next crossover was sampled from the exponential distribution with parameter 1/θg
, where θ=10−8
is the average recombination probability along the genome per base per generation, and g
=15 is the approximate number of generations in admixture for Latinos. At a crossover event the new ancestry is chosen given the mixture-specific proportions, and a specific haplotype is drawn uniformly from the corresponding reference set. This procedure was used to generate 400 haplotypes, which were then joined in pairs to form 200 diplotypes.
Several metrics have been proposed to measure the performance of local ancestry inference methods (Seldin et al., 2011
). Here we use the squared Pearson's correlation coefficient r2
between the true and the inferred number of alleles from each of the ancestries, averaged over the three ancestries. The squared correlation is directly related to the power achieved in case-only admixture mapping, i.e. N
cases are required to achieve the same power as a study with N
cases where the local ancestries are known without error (see Supplementary Material
). The second measure we use is the percentage of all SNP loci whose diploid ancestry was incorrectly inferred, which we refer to as the Diploid Error
3.1 Comparison with other methods
Several methods have been developed for inferring local ancestry (Bryc et al., 2010
; Pasaniuc et al., 2009a
; Patterson et al., 2004
; Price et al., 2009
; Sankararaman et al., 2008
; Sundquist et al., 2008
; Tang et al., 2006
) and have been shown to attain very high accuracy in admixtures of two genetically diverged ancestral populations such as African Americans (Pasaniuc et al., 2009b
; Price et al., 2009
). Only a few of these methods have been extended to admixtures of three populations such as Latinos (Bercovici et al., 2012
; Henn et al., 2012
; Johnson et al., 2011
; Pasaniuc et al., 2009a
), and we compared LAMP-LD with two of them. The first is WINPOP, a method shown to attain high accuracy in simulated data (Pasaniuc et al., 2009b
), which has been used in a number of recent empirical studies of Latinos (Bryc et al., 2010
; Yang et al., 2011
). WINPOP treats the observed genotypes as independent given the local ancestry, thereby ignoring the haplotype structure within each population. The second is GEDI-ADMX, which is similar to our approach in using fixed size HMMs to model haplotype diversity, but uses a completely different framework for inferring ancestries at each locus in the genome. We also compared LAMP-LD with HAPMIX (Price et al., 2009
). Although LAMP-LD and HAPMIX are similar in that they require reference haplotypes from each of the ancestral populations, the HMMs employed by the two models have different structure. In addition, LAMP-LD traverses the chromosome using the window-based framework, whereas HAPMIX employs a ‘miscopying’ parameter to account for imperfections in the reference panels.
As a safety check, we first simulated two-way mixtures of African Americans using 0.8:0.2 proportions for YRI and CEU, respectively, with six generations of admixture. On this data LAMP-LD attained an average r2
=0.99, very similar (no significant difference) with the r2
=0.98 attained by HAPMIX, thus confirming the high accuracy of local ancestry inference in African Americans (Price et al., 2009
; Seldin et al., 2011
Since HAPMIX was not designed to directly process multi-way mixtures, we adapted it to the task by running it two times on each genotype. The first run aimed at discerning the African segments from the rest of the segments: One reference panel included the TSI and the JPT haplotypes, and the other one comprised the LWK haplotypes, with the mixture proportion set to the proportion of the African ancestry in the mixture. The second run aimed at discerning between the European and the Native American segments. For the Mexican simulations, the first reference panel included the TSI haplotypes, the second panel included the JPT haplotypes, and the mixture proportion was set to the relative share of the European and Native American ancestries in the non-African segments. For the Puerto Rican simulations, the first reference panel included TSI+LWK haplotypes, the second panel included the JPT haplotypes, and the mixture proportion was set to the proportion of the Native American ancestry in the mixture. The different schemes were designed to account for the fact that the proportion of African ancestry is small in Mexican data (5%) but considerably higher in the Puerto Rican data (20%), and were matched to the datasets as to yield more accurate results. Throughout the article we denote the described schemes for running HAPMIX jointly as HAPMIX*.
compares LAMP-LD to WINPOP, GEDI-ADMX and HAPMIX* on the Mexican and on the Puerto Rican datasets. LAMP-LD achieves the highest accuracy under both the r2 and the diploid error on both datasets, showing a considerable improvement compared with WINPOP, thus reflecting the utility of the LD information. HAPMIX* attains comparable accuracy with WINPOP in the Mexican simulations and much worse on the Puerto Rican data; this could be because the parameters of the HAPMIX model were not optimized for Latinos—for example, it is not obvious how to set the effective population size parameter for HAPMIX in these scenarios. However, we should note that HAPMIX was not designed for multi-way mixtures and it could potentially be improved by a more principled extension to multi-way mixtures.
Accuracy (standard error of the mean) attained by the compared methods averaged over 200 simulated Latino genotypes
In addition to its high accuracy, LAMP-LD runs an order of magnitude faster compared with HAPMIX. Each run of LAMP-LD is composed of a preliminary stage in which the HMMs are constructed from the reference panels and a second stage of actual inference on the given genotypes. In the experiments above the first stage took 56 min, and the processing of each genotype 6.5 s (all running times were measured on a single AMD Opteron 1.1 GHz processor). These numbers can be used to extrapolate the running time over 1000 genotypes, obtaining ~3 h for Chromosome 10. HAPMIX's runtime, on the other hand, is linear in the number of genotypes, requiring 89 s for each. Running it on 1000 genotypes would therefore require over 24 h for one chromosome. This leads to a runtime of ~3 days for a full genome scan for LAMP-LD as compared with over ~22 days for HAPMIX on a single CPU.
3.2 Assessment of model parameters
The only two parameters required by LAMP-LD are the number of states S and the window length L. We assessed the performance of our method as a function of these parameters. shows that the accuracy is maximized at a value of L=50−100 SNPs, corresponding to ~200 to 400 Kb on average in our simulated chromosomes. Interestingly, the optimal value for L is fairly stable across the two different populations, suggesting that this parameter can be set independently of the specific mixture proportions. We note that although these results are likely to be specific to the SNP density in our datasets (a SNP every 4350 bases on average), increasing L>500 to accommodate for denser SNP panels has only a minor effect on the running time.
Fig. 2. Effect of the window length (left) and the number of states parameter (right) on accuracy of LAMP-LD. We observe that a window length of 50–100 SNPs (200–400 Kb) minimizes the error rate for both simulations. Accuracy increases with the (more ...)
Next, we assessed the robustness of our method to different values of S
, the number of states per SNP (L
is set to 50). The results are presented in . As expected, the diploid error decreases as S
increases; however, increasing S
> 10 provides only marginal improvement in accuracy, reflecting the fact that most of the haplotypic diversity within the reference panels necessary for accurate local ancestry inference is captured by 10 states per population. This is especially important because the running time of HMM-based methods increases quadratically with the number of states. This advantage of LAMP-LD is reflected in the large differences in running time between LAMP-LD and HAPMIX presented in Section 3.1
, since in order to utilize the entire reference set HAPMIX employed ~400 states, each modeling a single reference haplotype. According to the results of this section, if not explicitly noted, all results of this article for LAMP-LD use parameters L
=50 and S
3.3 Advantage of incorporating trio information in local ancestry inference
We simulated nuclear family trios by generating one offspring haplotype from each of the 200 simulated admixed genotypes, followed by grouping the offspring haplotypes into 100 pairs, each forming the genotype of a single progeny. An offspring haplotype was generated by recombining the two parental haplotypes according to the average genomic recombination rate. We then compared the performance of LAMP-LD and LAMP-HAP when inferring local ancestry in the Mexican and Puerto Rican datasets assuming different amounts of information in the inference. For consistency the accuracy was assessed only on the parental genotypes for both methods. Additionally, we measured the accuracy of LAMP-HAP when the haplotype phase is known (i.e. the method receives as an input the true phasing for the simulated trio data) so as to provide an upper bound on the achievable accuracy using trio data.
The result in show a considerable increase in the accuracy, as measured by the diploid error as well as by the squared correlation, with the incorporation of pedigree information. Interestingly, only a marginal improvement was obtained when we provided the true haplotypes to LAMP-HAP, demonstrating that the unambiguously phased positions are sufficient for highly accurate ancestry inference.
Table 2. Error rate (standard error of the mean) of methods for local ancestry inference as a function of the amount of information taken into account. LAMP-HAP* is provided the true haplotypes used to simulate the trio, as to provide an upper bound on the accuracy (more ...)
3.4 The effect of size and precision of reference sets on accuracy
Most local ancestry inference methods require some information about the mixing populations: Haplotype-based methods, such as LAMP-LD and HAPMIX, require sample haplotypes, while other methods, such as WINPOP, require only SNP allele frequencies. With the growing availability of genetic data, it is important to examine the effect of the reference datasets (genotypes or haplotypes) on the performance of the methods. Particularly, since LAMP-LD is able to efficiently process large reference datasets, an interesting question is whether it can utilize the additional information provided in sets of growing sizes, given the fact that it uses only 10 prototype haplotypes (states) per ancestry.
This question was tested by providing LAMP-LD with reference sets of varying sizes: we compared the results obtained on the full set used in the previous sections (117 TSI haplotypes, 169 JPTs and 115 LWKs) to those obtained on a partial reference, which contained only two-thirds of the haplotypes in each of the three ancestral panels. We did the same with WINPOP, to examine how a non-haplotypes-based method would be affected. In a LAMP-LD can be seen to considerably improve when provided with the larger reference. In contrast, WINPOP does not improve, presumably because estimating the allele frequencies can be done well enough using small panels. On the other hand, LAMP-LD's performance also deteriorates more rapidly as the reference size decreases, and WINPOP's accuracy becomes superior when using 0.4 and 0.5 of each panel for the Mexican and Puerto Rican datasets, respectively (these fractions correspond to panel sizes of 46/58 TSI haplotypes, 67/84 JPTs and 46/57 LWKs).
Fig. 3. Effect of reference panel size and divergence on the accuracy of WINPOP and LAMP-LD. Both methods show increased performance with sample size with LAMP-LD showing the highest gain in accuracy when more accurate reference haplotypes are provided as proxy (more ...)
It has been shown that the genetic divergence between the haplotypes used as proxy and the true unknown ancestral population greatly impacts local ancestry performance (Pasaniuc et al., 2009b
; Price et al., 2009
). We quantified this effect in Latinos by running LAMP-LD and WINPOP using the proxy
reference set (which included haplotypes from the TSI, JPT and LWK panels; the same populations were used to obtain the previous results presented in this article) and a true
reference set. The true reference in this experiment included the same number of haplotypes in each ancestral panel as the proxy set, but taken from the CEU, (CHB+CHD) and YRI panels; we note that the haplotypes in this set are different from those used for generating the simulated haplotypes.
b demonstrates the anticipated deterioration in the performance of both methods on both datasets when data from the proxy populations is used as reference instead of the true ancestral populations. This decrease is smaller on the Puerto Rican dataset, presumably because it contains a larger proportion of African ancestry which is more easily differentiated from the rest, even when the proxy LWK haplotypes are used. The deterioration in accuracy is at the same scale as the improvement resulting from increasing the reference size, suggesting that a large enough reference would compensate for the divergence.
3.5 The effect of European gene flow into Native American reference haplotypes
Current day Native American haplotypes used as proxy for the Native American component of Latinos are presumed to contain European gene flow. In order to test the effect of this phenomenon on ancestry inference, we introduced TSI segments into the Asian haplotypes of a reference set composed of 117 CEU, 169 (CHB+CHD) and 115 YRI haplotypes. We performed 10 experiments, in each choosing at random a 5 Mb region along the chromosome, and replacing a percentage of the (CHB+CHD) haplotypes with TSI haplotypes along the chosen region.
We observed that the typical effect of increasing the number of TSI segments present in the Native American reference panels is an increase in the estimated proportion of the Native American ancestry along the modified region, at the expense of the estimated European proportion. Presumably, the Native American reference haplotypes in the modified region are now able to approximate reasonably well windows containing both Native American and European haplotypes; in some cases, they will approximate the European windows better than the CEU haplotypes, and hence the increase in the estimated Native American proportion.
shows, for each region and for each fraction of modified (CHB+CHD) haplotypes, the maximal deviation of the average estimated European ancestry
from the true average European ancestry p
obtained across all loci in the modified region. More precisely, we provide the maximal value of the statistic
, which can be used to obtain the scale of the p
-value for testing the null hypothesis of the modified region having the average genome-wide fraction of European ancestry: given a sample of size N
, the p
-value is computed as
. For example, when the number of modified haplotypes is 30 (0.18 of the Native American panel), the resulting p
-value of the most severely affected region in our simulations (N
= 200) is 2· 10−2
, whereas for a sample of size 1000 we obtain that a similar effect would yield a p
-value of 2· 10−6
. We note that we observe similar but smaller effect when modifying shorter segments; ultimately, for large enough samples and under the assumption of a small finite reference panel, these local biases would appear as statistically significant local deviations in the ancestral proportions. However, for low levels of gene flow (≤6%) shows that the biases in local ancestry are unlikely to produce large deviations, and would be statistically significant only at very large sample sizes.
Fig. 4. The simulated effect of European gene flow into Native American haplotypes on estimation accuracy. Different fractions of the 169 (CHB+CHD) reference haplotypes were replaced by TSI haplotypes along 10 different regions, each of length 5 Mb. The plot (more ...)
3.6 Assessment of local ancestry performance in real Latinos
In order to estimate the precision of local ancestry inference methods in real data, for which the true local ancestry is unknown, we leverage the fact that local ancestry needs to follow Mendelian inheritance rules. For example, if the father has African local ancestry on both chromosomes whereas the mother has European ancestry, the child's local ancestry has to have a single chromosome that is African and one that is European. Therefore, pedigree relationships can be used to identify errors in local ancestry estimation by simply testing whether the inferred ancestral status of the child's chromosomes can arise through Mendelian inheritance from the ancestral status of the parent chromosomes. This is done by estimating the local ancestry of each individual in the pedigree separately, and then integrating the trio information to test each genomic position for inconsistency. Any such inconsistency indicates at least one erroneous call in the local ancestry assignments of the trio, so that the counts of the MILANC give a direct lower bound on local ancestry inference error rate. A critical feature of MILANC is that it is computed without knowing the true ancestry in real data; for this reason LAMP-HAP, which is designed to produce MILANC =0, is not tested in this section.
We first investigated the relation between MILANC and the true underlying error rate. When introducing erroneous calls in the local ancestry of our simulated trios using a random uniform error model, we observed that roughly one-third of inserted errors lead to Mendelian inconsistencies, thus indicating that MILANC captures only one component of the true error rate.
Next, we assessed the accuracy of LAMP-LD and WINPOP in empirical data using 232 Mexican and 257 Puerto Rican nuclear mother–father–child families. These trios were collected as part of the GALA Study (Burchard et al., 2004
); GALA is a multi-center, international effort designed to identify and directly compare clinical, genetic and environmental risk factors associated with asthma, asthma severity and drug responsiveness among Latino ethnic groups. The trios were ascertained on an asthmatic proband. When running local ancestry inference, as proxy for the African (European) ancestry we used the 226 (224) haplotypes of the HapMap 3 phase 2 YRI (CEU) population, whereas for the Native American ancestry we used 88 Native American samples (25 Bolivian Aymara, 24 Peruvian Quechua and 39 Mesoamericans; Bigham et al., 2010
). We intersected all SNP sets to achieve a combined panel of 588 595 SNPs.
shows the average MILANC attained by WINPOP and LAMP-LD in the GALA trios. We note that the empirical metric of accuracy (MILANC) shows that the accuracy in real data roughly matches the results of our simulations (), given that we expect one-third of the errors to yield Mendelian inconsistencies. We also note that modeling LD in the form of ancestral haplotypes appears to have a bigger effect for Puerto Ricans rather than Mexicans.
Average genomic MILANC (standard error of the mean) in % attained by best performing methods that model or ignore ancestral population LD in the Mexican and Puerto Rican trios of the GALA study.