Recent advances in genotyping technologies have opened up unprecedented opportunities to improve our understanding of complex diseases through disease association studies. Most of these association studies have been performed on Caucasian populations of cases and controls. To gain additional insight, studies are often replicated on other populations, some of which are recently admixed. Recently admixed populations are formed by the mixing of two or more ancestral populations for a small number of generations. For instance, African Americans are a recently admixed population, where the ancestral populations are West Africans and Caucasians. Even the Caucasian population in the USA is in fact a recently admixed population, where the original ancestral populations are different European populations that immigrated to the USA over the last few centuries.
Admixed populations have been extensively used to detect associations in diseases that differ in prevalence across populations through admixture mapping (Reich et al.
; Zhu et al.
). The technique of admixture mapping is based on the observation that the cases in such an admixed population will have enhanced ancestry from the higher risk population near loci associated with the disease. In order to perform such studies successfully, it is crucial to be able to accurately infer the locus-specific ancestry of each individual. Moreover, accurate estimates of the locus-specific ancestry may reveal patterns of selection (Tang et al.
) as well as recent recombination events (Sankararaman et al.
). Particularly, in this work we demonstrate that locus-specific ancestry may also play an important role in the problem of genotype imputation, in which, genotypes left untyped in case–control studies are reliably inferred by leveraging the single nucleotide polymorphism (SNP) correlation information from large repositories of human SNP variation such as the HapMap project (The International HapMap Consortium, 2005
While many methods have been proposed for the inference of locus-specific ancestry (Hoggart et al.
; Patterson et al.
; Pritchard et al.
; Sankararaman et al.
; Sundquist et al.
; Tang et al.
, more recent works have focused on developing methods that are scalable to whole-genome datasets (Sankararaman et al.
; Sundquist et al.
; Tang et al.
). These methods have been shown to incur low error rates in admixtures that originated from ancestral populations with a high fixation index (Fst
), such as African Americans. However, when the ancestral populations are closely related (e.g. the Japanese and Chinese populations), their accuracies have been shown to be quite low (<70%, for populations that have been mixing for seven generations or more) (Sankararaman et al.
In contrast to locus-specific ancestry, when considering the averaged genome-wide ancestry of each individual, it has been recently shown that principal component analysis can be used to detect differences between populations that are as close as a few 100 km away from each other (Novembre et al.
). However, it is not clear that such high resolution can be achieved by methods that seek to infer the locus-specific ancestry. In particular, it is an open question whether locus-specific ancestry can be accurately inferred on very close populations such as mixtures of Asians, or mixtures of Europeans (e.g. Americans of European descent).
We present here an efficient and accurate method for the inference of locus-specific ancestry. Our method, called WINPOP, is unique in that it achieves high accuracy on admixtures of closely related populations, including mixtures of European populations or mixtures of Asian populations (e.g. JPT-CHB from the HapMap populations). To achieve this, we partition the genome into overlapping, contiguous windows of SNPs, and we optimize a likelihood model over each of the windows. We then glue the solutions together by casting a majority vote for each SNP.
The basic framework in which overlapping windows are used for the inference of local ancestry has been previously suggested in our previously reported method LAMP (Sankararaman et al.
). LAMP is a highly efficient method, that has been shown to be accurate on admixtures of distant populations. The basic idea behind LAMP lies in making predictions in each window using a likelihood model that assumes no recombinations. In contrast to LAMP, our method uses an improved modeling of the recombination events, and it chooses the window size adaptively at each location in the genome, according to the local genetic structure of the ancestral populations. These two new ideas result in a substantial improvement in accuracy.
Extensive simulation results demonstrate that WINPOP achieves improved inference of locus-specific ancestries on both distant and closely related admixtures. The improvements in accuracy across the closely related populations range from 13% to 35%. Further, we examined the utility of locus-specific ancestry on the task of imputing missing genotypes. We show that exploiting accurate methods for locus-specific ancestry leads to lower error in imputation, and that the imputation accuracy critically depends on the accuracy of the ancestral inference.