|Home | About | Journals | Submit | Contact Us | Français|
We describe an analysis of genome variation in 825 Plasmodium falciparum samples from Asia and Africa that reveals an unusual pattern of parasite population structure at the epicentre of artemisinin resistance in western Cambodia. Within this relatively small geographical area we have discovered several distinct but apparently sympatric parasite subpopulations with extremely high levels of genetic differentiation. Of particular interest are three subpopulations, all associated with clinical resistance to artemisinin, which have skewed allele frequency spectra and remarkably high levels of haplotype homozygosity, indicative of founder effects and recent population expansion. We provide a catalogue of SNPs that show high levels of differentiation in the artemisinin-resistant subpopulations, including codon variants in various transporter proteins and DNA mismatch repair proteins. These data provide a population genetic framework for investigating the biological origins of artemisinin resistance and for defining molecular markers to assist its elimination.
The malaria parasite Plasmodium falciparum has a remarkable capacity to develop resistance to antimalarial drugs by evolutionary adaptation. For reasons that remain poorly understood, successive global waves of antimalarial drug resistance have originated in western Cambodia.1 The most common form of chloroquine resistance was observed there in the late 1950’s before it spread around the world 2 and the most common forms of clinically significant pyrimethamine resistance3 and sulfadoxine resistance4 are thought to have originated in the same region. Clinical resistance to artemisinin and its derivatives is now well established in the P. falciparum population of western Cambodia 5-8 and appears to be emerging in neighbouring regions.9,10 These recent developments have grave implications for public health, since artemisinin derivatives are the mainstay of malaria treatment worldwide. There is an urgent need to discover the parasite genetic factors that cause artemisinin resistance 11 and to identify effective markers to monitor its spread.12
A central question is whether the genetic epidemiology of the P. falciparum population in Cambodia offers insights into the emergence of drug resistance, and of artemisinin resistance in particular. To address this question, we analysed patterns of genome variation in 825 P. falciparum samples collected at 10 locations in West Africa and South-East Asia. These include Ghana, Mali, Burkina Faso, The Gambia, Thailand, Vietnam and 4 locations in Cambodia (Table 1). An estimated median of 2 Gbp of sequence read data were obtained for each sample using the Illumina Genome Analyser platform, and SNP genotype calls were made using a well-validated set of algorithms and quality control procedures whose development and evaluation is described in detail elsewhere.13 Here we focus on a set of 86,158 coding SNPs that could be genotyped with confidence in the majority of samples, as detailed at http://www.malariagen.net/resource/10. For purposes of quality assurance, 733/825 (89%) of the samples analysed were independently assayed for 148 SNPs using the Sequenom platform: the genotype calls produced by Sequenom showed 99.4% concordance with the genotypes calls obtained from Illumina sequencing (see Supplementary Note).
As a starting point for the analysis of population structure, we constructed a neighbour-joining tree (see Supplementary Note) of all 825 samples based on a pairwise distance matrix (Figure 1a). Cambodian and Vietnamese samples formed a complex branching cluster of ancestrally related individuals, which could be clearly distinguished from Thai samples and, to a much greater extent, from West African samples. Principal components analysis (PCA, see Supplementary Note) revealed a surprising pattern of population structure within the total sample set: the first principal component (PC1) separated samples from different continents, but the second and third components (PC2 and PC3 respectively) represented variation in western Cambodia alone (Figure 1b). Although we expected to find high levels of population structure in South-East Asia, consistent with relatively low levels of malaria transmission14, the strength of population structure observed in western Cambodia was remarkable when compared to the other populations in this sample set. To put this in geographical context, the western Cambodian sites sampled are located up to ~200km apart, while the other South-East Asian locations were separated by up to ~1000km, and the West African locations by up to ~1,500 km (Supplementary Figure 1). Thus the population structure observed in western Cambodia cannot be explained simply by geographical distance. We considered the possibility that it might be an artefact of increased variance introduced by the relatively large number of samples from Cambodia, but the signal was found to be robust when correcting for sample size by sub-sampling (Supplementary Figure 2).
The existence of multiple parasite subpopulations in western Cambodia became evident when PCA was restricted to South-East Asia (Figure 1c, d). A core group of western Cambodian samples clustered with samples from north-eastern Cambodia and Vietnam, close to samples from Thailand, while three distinct groups of western Cambodian samples were identified as outliers from this core group by the first, second and third principal components, respectively. An inspection of principal components beyond PC3 did not reveal further outlier groups. Other samples were intermediate between the core and outlier groups, suggesting admixture between the groups. To characterise this in more detail we used a process of chromosome painting, a probabilistic method of reconstructing the chromosomes of each individual sample from homologous segments of DNA in samples representative of the core group and three outlier groups (see Figure 2, Methods). This yielded a matrix of ancestral similarity in which the three outlier clusters could be clearly distinguished from other samples, and complex admixture patterns were apparent (Supplementary Figure 3). Using the ancestral proportions determined by chromosome painting, each Cambodian sample was classified as belonging to one of the following subpopulations: KH1 representing the core group that clustered with samples from neighbouring countries; KH2, KH3 and KH4 representing the three outlier groups; and KHA representing samples with apparently mixed ancestry, defined here as having <80% ancestral content from any one of the main groups (Figure 3a). Using these definitions, 63 samples were classified as KH1, 55 as KH2, 23 as KH3, 17 as KH4 and 135 as KHA.
Other methods of population structure analysis corroborated these findings. A neighbour-joining tree of Cambodian and Vietnamese samples showed clusters corresponding to KH1, KH2, KH3 and KH4, interspersed with KHA samples (Supplementary Figure 4). When the ADMIXTURE program was used to model the Cambodian samples as admixture of 4 putative subpopulations, there was 82% concordance between the classification of samples by ADMIXTURE and by the chromosome painting method described above (see Methods, Supplementary Figure 5). Both analyses confirmed that most samples from north-eastern Cambodia and Vietnam are very similar to KH1 samples in western Cambodia. It was evident from all methods that our classification of samples into KH1/2/3/4/A should be regarded as a first approximation, since the population structure is clearly complex and will require further epidemiological sampling to understand in detail.
Table 2 shows the frequency of subpopulations at different locations in Cambodia. All of the subpopulations were present in western Cambodia, while KH1 was predominant in north-eastern Cambodia. The samples came from four independent field studies conducted over the past 5 years: there was clear evidence of population structure within each study, and it was not limited to a particular site or sampling year (Supplementary Tables 1-3). The discovery of several distinct subpopulations of P. falciparum co-existing at the same location is unexpected, and requires a biological explanation.
An obvious question arising from these findings is whether different parasite subpopulations in Cambodia have different levels of resistance to artemisinin and its derivatives. To address this question we analysed parasite clearance rates in patients with acute falciparum malaria following artesunate treatment.6,8 In vivo estimation of parasite clearance rate from frequent peripheral blood parasite counts is becoming the standard method used for surveillance of artemisinin resistance, since no laboratory method has yet been developed to correlate reliably with clinical resistance. In vivo clearance data were available for 212 of the samples in this study: 30 from Pailin, 47 from Tasanh, 89 from Pursat and 46 from Ratanakiri (Table 2). For each sample we calculated the parasite clearance half-life based on the slope of the linear portion of the parasite clearance curve15. This revealed a significant prolongation of parasite clearance half-life in the KH2, KH3 and KH4 subpopulations compared to the KH1 subpopulation (Figure 3b, c). The difference was evident from analysis of all 212 Cambodian samples, and also when restricted to the 166 samples from western Cambodia. The median parasite clearance half-life in 35 KH2 samples (all from western Cambodia) was 6.8 hours compared to 4.1 hours in 6 KH1 samples from western Cambodia (P=2×10−4 by Mann-Whitney test) and 2.7 hours in all 50 KH1 samples from Cambodia (P= 2 ×10−14). More detailed results for all subpopulations are given in Table 3. KHA samples showed wide variance in parasite clearance half-life, consistent with their genetic makeup as an admixture of artemisinin-sensitive and artemisinin-resistant types. The long median clearance half-life in this group is consistent with selection of recombinants with an ART-R genetic background. However, we observed no direct correlation between admixture proportions and parasite clearance half-life (Supplementary Figure 6), which suggests that the acquisition of ART-R phenotypes depends on receiving a limited set of specific DNA segments from resistant ancestors.
To put this in an epidemiological context, recent clinical studies show that parasite clearance rates following artesunate treatment are significantly faster in north-eastern Cambodia (Fairhurst et al., manuscript in preparation) and in Vietnam10 than in western Cambodia. It appears likely that these geographical differences in parasite clearance rate can be largely attributed to the predominance of KH2, KH3, KH4 and KHA in western Cambodia, whereas KH1 predominates in north-eastern Cambodia and Vietnam.
A striking feature of the KH2, KH3 and KH4 subpopulations is their strong genetic differentiation relative to KH1, and also relative to each other. This was evident from genome-wide estimates of the fixation index (FST) between subpopulations: e.g., the pairwise FST for KH2 vs KH4 (0.38) was considerably higher than that for Thailand vs Ghana (0.16) (Supplementary Table 4). It was also evident from counting the number of SNPs showing high levels of FST between subpopulations: e.g. the number of SNPs with FST >0.5 was 11 for Thailand vs KH1, 562 for Thailand vs Ghana and 1200 for KH2 vs KH4 (Supplementary Table 5). In summary, we found higher levels of genetic differentiation between subpopulations within western Cambodia than between typical parasite populations on different continents.
Several features of the KH2, KH3 and KH4 subpopulations indicate that their strong genetic differentiation is due to founder effects. Consistent with this explanation, we observed a marked reduction of low frequency alleles in KH2, KH3 and KH4, compared to KH1 and to parasites from other parts of South-East Asia and West Africa (Figure 4a). When SNPs with high levels of differentiation between subpopulations were compared to SNPs with low levels of differentiation, both groups had a similar ratio of non-synonymous to synonymous polymorphisms (overall pN/pS was 2.4 for SNPs with FST <0.2, and 2.1 for SNPs with FST >0.8; Supplementary Table 6), suggesting that differentiation is likely to have arisen from founder effects.
As further evidence for founder effects, the KH2, KH3 and KH4 subpopulations were observed to have markedly reduced haplotype diversity. This was evident from higher levels of linkage disequilibrium in KH2, KH3 and KH4 compared to KH1 and other populations (Figure 4b). The loss of haplotypic diversity was also quantified more directly, e.g. levels of haplotype homozygosity, measured in a sliding window across the genome (see Supplementary Note), were considerably higher for KH2, KH3 and KH4 than for KH1 (Figure 4c, Supplementary Figure 7). The level of haplotype diversity varied considerably across the genome, and in some cases it was found that a single haplotype predominates across a large chromosomal region, e.g. KH4 has essentially a single haplotype across most of chromosome 9, and that KH2 has essentially a single haplotype across approximately half of chromosome 13 (1.4-3.4Mb), in marked contrast with KH1 (Supplementary Figure 8).
An analysis of within-host diversity, quantified in terms of the FWS metric13, shows that the vast majority of samples in KH2, KH3 and KH4 are essentially clonal, as are those in the recombinant KHA population (Supplementary Figure 9). Although within-sample diversity was found to be generally lower in Southeast Asian populations than in West Africa, the artemisinin-resistant samples are clearly more likely to be clonal than those in KH1 or in Thailand. This suggests a high level of inbreeding in the KH2/3/4 populations.
We conclude that western Cambodia is home to at least three distinct founder populations of P. falciparum, each of which is artemisinin resistant, co-existing with a more diverse and widely-distributed subpopulation to which they are ancestrally related (Figure 1a). Although KH2, KH3 and KH4 populations maintain extreme genetic differentiation, analyses of population structure and chromosome painting indicate that there is some gene flow between these populations (Figure 2). This is most evident in KHA, which appears as a mosaic of the other populations. This appearance might have arisen from recent admixture or from past evolutionary events, or some combination of the two. Future studies may be able to elucidate the evolutionary and demographic processes that underlie these observations.
These data provide a set of genetic markers to map the geographic distribution of the artemisinin-resistant founder populations and to monitor their spread outside western Cambodia. Supplementary Tables 7-9 lists SNPs that differentiate the founder populations from other populations analysed here.
To illustrate the potential utility of this dataset, it has recently been reported that low levels of artemisinin-resistance observed in North-Western Thailand are associated with a genomic region of low haplotype diversity on chromosome 13, around the 1.7-1.8 Mb position7. We find that KH2 has essentially a single haplotype extending across half of chromosome 13, from the 1.4Mb to the 3.4Mb position (Supplementary Figure 8). At position 13:2075035 we observe a codon variant (Y175S) in an ABC transporter gene (PF13_0271) for which all samples from West Africa carry the ancestral allele, whereas almost all KH2 samples have the derived (non-ancestral) allele. Since this gene has been experimentally associated with chloroquine and quinine resistance 16 and has previously been reported to lie within a genomic region of positive selection17, it would appear to be an important candidate for further investigation.
Supplementary Table 10 lists transporter genes that are highly differentiated in the founder populations, as these are prime candidates for the causation of antimalarial drug resistance.16,18 Apart from PF13_0271, they include PFE1150w (pfmdr1, an ABC transporter known to be associated with drug resistance), PFC0725c (a putative formate-nitrate transporter), PFE0805w (a cation-transporting ATPase 1) and PF13_0252 (a nucleoside transporter).
The founder populations also serve as a reservoir for a variety of known antimalarial resistance alleles (Supplementary Table 11). In pfcrt19, almost all samples in KH2, KH3 and KH4 have the resistant haplotype CVIET, whereas KH1 has a wider variety of haplotypes.20,21 A comparison of pfmdr1 22-24 sequence coverage suggests that most samples in KH4 and KH3 have higher copy numbers than those in KH1 and KH2 (Supplementary Table 12). In pfdhps 25, KH2 has a high frequency of the triple mutant haplotype comprising K540N, A581G and A437G 26,27, whereas KH3 and KH4 have a high frequency of the triple mutant haplotype comprising S436A, K540E and A437G28. In pfdhfr 25,29 the codon variant I164L, which characterizes a highly resistant triple or quadruple mutant haplotype30, is absent in KH1 and KH4, but near fixation in KH2 and at an intermediate frequency in KH3.
It has been proposed that antimalarial resistance emerges rapidly in South-East Asia because of the existence of hypermutable parasites, based on experimental evidence that a P. falciparum strain originating from Indochina shows accelerated resistance to multiple structurally unrelated drugs 31. It will therefore be of interest to researchers in this area that a number of the most differentiated nonsynonymous SNPs in the KH2 subpopulations are located in genes involved in DNA mismatch repair pathways (Supplementary Table 13). These include PMS1 and MLH, the two proteins that make up MutL, a DNA mismatch repair heterodimer that is highly conserved throughout evolution32; and UvrD, a helicase that is critical for methyl-directed mismatch repair and interacts physically with MutL, which loads it onto DNA.33-35 This could be highly relevant, since MutL variants in bacteria have been shown to cause hypermutation and rapid evolution of antibiotic resistance. However, uncertainty over the evolutionary timeline of the KH2 population means that further studies will be needed to establish whether these parasites are indeed hypermutable.
P. falciparum population structure is known to be related to the intensity of malaria transmission. The parasites are transmitted to humans by Anopheles mosquitoes, and a crucial step in this process is sexual recombination of parasites within the mosquito vector. Previous studies have shown that higher rates of inbreeding and greater population structure occur in regions of low transmission intensity, such as Southeast Asia, than in high transmission regions such as West Africa.36,37
The population structure observed in Western Cambodia is unusual in several respects. Within this relatively small geographical area we find multiple distinct, but apparently sympatric, parasite subpopulations with exceptionally high levels of genetic differentiation. Three of the subpopulations have skewed allele frequency spectra and remarkably high levels of haplotype homozygosity, indicative of founder effects and recent population expansion. These subpopulations are associated with clinical resistance to artemisinin. Two key questions arise from these findings: what caused the founder effects and what is the biological significance of their association with artemisinin resistance?
Here we offer a hypothesis to explain these unusual findings. Our starting assumption is that a parasite population under drug pressure will eventually acquire a number of genetic variants with the potential to cause drug resistance. However, the majority of such variants will have relatively small effects when acting individually. Moreover, many potential resistance alleles will be associated with a biological fitness cost in the absence of the drug. We postulate that the process of sexual recombination occasionally produces a parasite possessing a ‘winning combination’ of alleles that jointly confer high levels of drug resistance together with high levels of biological fitness. The inbred progeny of such a parasite, which carry the full set of alleles required for optimum biological fitness, are at a selective advantage compared to the rest of the parasite population. When such a parasite outcrosses with other parasite lines, the progeny are at risk of losing the winning combination of alleles and thereby losing biological fitness. In brief, our hypothesis is that the founder effects observed in Western Cambodia represent recent expansion of artemisinin-resistant parasite lines, whose biological fitness requires a particular combination of alleles to be maintained at multiple loci across the genome.
If this hypothesis is broadly correct, it follows that multiple environmental and ecological factors will affect the emergence and spread of artemisinin resistance. A multigenic resistance phenotype is more likely to emerge if the individual component alleles are already prevalent in the local parasite population, as determined by the history of previous antimalarial drug usage. Low transmission intensity and other factors that favour inbreeding could assist the propagation of a multigenic resistance phenotype once it has emerged. High rates of inbreeding might also arise due to physical isolation, e.g. if a group of parasites became isolated in a remote area of the jungle, or due to some form of reproductive isolation, e.g. it is conceivable that different parasite subpopulations are preferentially transmitted by different Anopheles vector species.38
These observations provide a population genetic framework for revisiting the longstanding debate about why Western Cambodia is a global hotspot for antimalarial drug resistance. Drug pressure was particularly high in Pailin in the late 1950s and early 1960s as a result of mass administration of chloroquine and pyrimethamine.39,40 This may explain why these subpopulations possess a variety of haplotypes that are highly resistant to antimalarials other than artemisinin - in other words it is possible that we are witnessing the most recent episode in a longer and more complex chain of events. Human demographic factors relevant to Western Cambodia should also be considered, such as physical isolation of human settlements due to poor road infrastructure in forested mountain areas, or to restricted human movement in the period of Khmer Rouge resistance (1979-1998), which might have provided a favourable environment for parasite inbreeding.
A major objective of the WHO Global Plan for Artemisinin Resistance Containment is to stop the spread of resistant parasites.41 The discovery of multiple subpopulations of resistant parasites, with widely different genetic characteristics, suggests that there could be multiple forms of resistance, each of which will need to be controlled. The definition of genetic markers for these subpopulations will assist efforts to eliminate major foci of transmission through characterisation of their geographical distribution and the ecological niches that they occupy. This will require the acquisition of genomic epidemiological data at a high level of spatial and temporal resolution, and the data presented here should not be regarded as a definitive account of the current situation, which is clearly in flux.9
All samples in this study were derived from blood samples obtained from patients with P. falciparum malaria, collected with informed consent from the patient or a parent or guardian. At each location, sample collection was approved by the appropriate local ethics committee; a full list is given in the Supplementary Note.
In Pursat, patients aged 10-65 years had uncomplicated malaria with a P. falciparum density ≥10,000/μl whole blood.8 Patients were treated with 4 mg/kg artesunate given orally at 0, 24 and 48 h, followed by 15 mg/kg mefloquine given orally at 72 h, and 10 mg/kg mefloquine given orally at 96 h. During treatment, thick blood films were made from finger-prick blood samples every 6 h until the asexual parasite density was undetectable. In Ratanakiri, patients were aged 2-65 years and their parasite density was counted every 6 h until undetectable or 48 h elapsed, whichever occurred first. Further details can be obtained from the trial registrations (NCT00341003 for Pursat; NCT01240603 for Ratanakiri).
In Tasanh and Pailin, two separate but coordinated studies were conducted within the Artemisinin Resistance Confirmation, Characterisation and Containment (ARC3) multi-site study, with harmonized methodologies. Depending on the study arm, patients recruited were aged 18-65, presenting acute P. falciparum malaria symptoms, with a parasite density of 1000-200,000/μL. In Tasanh patients were treated with artesunate, given as a single oral dose of 2, 4 or 6 mg/kg/day for 7 days. Parasite counts based on blood smears were taken up to 8 times on Day 0, and then 4 times a day until 2 consecutive thick blood films were negative for asexual parasites. Further details of the Tasanh study are given in42. In Pailin patients were treated with artesunate in a daily dose of 6 or 8 mg/kg administered as a single or split dose. Three day regimens were followed by a split dose of mefloquine at 72 and 96 hours. Peripheral blood parasitemia was assessed at 0, 2, 4, 6, 8, 12, and 18 hours after enrolment and then 6 hourly until two consecutive slides were negative. More details can be obtained from the trial registration (ISRCTN15351875).
For all sites, parasite clearance half-life estimates were computed from the frequent parasite counts obtained during treatment. The statistical models used to estimate the parasite clearance rate constant and lag phase duration were fitted using the Parasite Clearance Estimator15 developed by the Worldwide Antimalarial Resistance Network (WWARN), accessible at https://www.wwarn.org/research/parasite-clearance-estimator.
The sequencing and genotyping methods applied to the clinical samples were previously published13 and are briefly summarized here; an extended description is provided in the Supplementary Note. For most samples, parasitized erythrocytes were obtained directly from the blood samples, after leukocyte depletion to remove the majority of human DNA, without culturing. A minority of samples from Thailand and Cambodia were adapted to in vitro culture prior to DNA extraction. Samples with >1 μg DNA and <60% human DNA contamination were sequenced by the Illumina Genome Analyzer II, with read lengths ranging from 37 to 105 bp. From short sequence read alignments against the P. falciparum 3D7 reference, we obtained a set of 86,158 quality-controlled SNPs which passed a series of stringent quality filters13 (the V1.0 SNP Catalogue). Samples were genotyped at each SNP by a single allele, based on sequencing read counts.
Allele frequencies at any given SNP in any given population were determined by analysing all genotyped samples (i.e. samples with undetermined genotype at the SNP were excluded from the computation). The non-reference allele frequency (NRAF) was computed as the proportion of genotyped samples in the population whose genotype was not the reference allele. The minor allele frequency (MAF) was computed as the proportion of genotyped samples carrying the least common genotype in the population (i.e. MAF=NRAF if NRAF<0.5; MAF=(1-NRAF) otherwise).
In analyses where derived allele frequency (DAF) was used, we determined the putative ancestral state alleles at each position by comparison of the allele in the P. falciparum 3D7 reference sequence with alleles in homologous sequences observed in P. reichenowi (Pr) as previously described13. The putative ancestral/derived alleles at each SNP are included in the online 86k SNP dataset. The derived allele frequency (DAF) was computed as the proportion of genotyped samples carrying the derived allele in a population; the DAF was undetermined at positions where the putative ancestral/derived state could not be established.
We determined gene reading frames and exon boundaries from the PlasmoDB 5.5 annotation (www.plasmodb.org) of the 3D7 genome43. Each SNP in our 86k set was classified as synonymous or non-synonymous according to whether an amino acid change occurs when substituting the reference allele with the non-reference allele at that SNP in the 3D7 reference genome sequence, without any other changes.
Chromosome painting44 is a probabilistic method of reconstructing the ancestry of individuals in a recombinant population. While commonly used methods for discovering population structure (such as ADMIXTURE45) tend to consider SNPs independently, chromosome painting utilizes linkage information in the dataset in its probability computation, based on the assumption that groups of linked alleles tend to be exchanged together during recombination. Hence, given a set of candidate recombinant “donor” populations, chromosome painting assigns the most probably ancestral population to a chunk of DNA from a given individual, based on its similarity to corresponding chunks in individuals from the donor populations (rather than perform this analysis on each SNP independently). This results in visualizations of the recombination patterns, which assign a colour to each donor population, and use these colours to paint each DNA chunk according to the most probable assignment.
We performed chromosome painting using the Chromopainter software package (http://www.maths.bris.ac.uk/~madjl/finestructure/chromopainter.html). The parameters Ne (effective population size) and μ (global mutation rate) were estimated in Cambodian samples by running the expectation-maximization (E-M) algorithm for 40 iterations, for each sample and chromosome. For this analysis, a small number of samples were excluded to ensure that no two samples in the subset were more than 99% similar. The final values for the parameters (Ne =8824, μ=0.000492) were computed as a weighted mean, according to chromosome length. These parameter values were used in conjunction with a uniform recombination map, to infer linkage information. We painted the chromosomes of each sample by running Chromopainter for 40 E-M iterations, maximizing over copying proportions, with the restriction that only individuals from KH1, KH2, KH3 and KH4 were considered as donors (i.e. KHA was assumed to be an admixture population). The plots were generated by assigning a colour to each founder populations (KH1=purple, KH2=blue, KH3=red and KH4=green), with each segment being painted with the colour of the population the donor sample belonged to.
We used ADMIXTURE45 (http://www.genetics.ucla.edu/software/admixture/) to estimate ancestry of the Cambodian samples. ADMIXTURE is a high-performance program for estimating ancestry in a model-based manner from large autosomal SNP genotype datasets. The ADMIXTURE model assumes low linkage disequilibrium between markers (SNPs). Accordingly, we excluded SNPs pairs that appeared to be linked, by two filtering steps. First, we removed all SNPs with extremely low MAF (MAF <= 0.01), as these SNPs are be less informative towards the inference process. In a second step, we discarded SNPs according to the observed sample correlation coefficients. Using the plink tool-set (http://pngu.mgh.harvard.edu/~purcell/plink/), we scanned the genome with a sliding window of size 100 SNPs, advanced in steps of 10 SNPs, and removed any SNP with correlation coefficient ≥0.02 with any other SNP within the window. The 5484 SNPs that remained after this filtering (available upon request) was used to run ADMIXTURE with 5-fold cross validation, 1000 replicates for bootstrapping and K=4.
To estimate the FST between two populations at a given SNP, we used
where πs is the average probability that two samples chosen at random from the same population will carry different allele at the SNP, and πt is the average probability that two samples chosen at random from the joint population will carry different allele. If one of the alleles is observed with frequencies p1 and p2 in the two populations, we estimate
To estimate genome-wide FST between two populations, we used GWFST = 1 - (Πs/Πt) where Πs is the average correspondence within the two populations, and Πt is the average correspondence in the joint population. Given a genome-wide set of M SNPs, the above can be expressed as
From applying the estimators for πs and πt at each of the 86,128 SNPs, we derive genome-wide FST estimates.
We would like to thank Sambunny Uk (National Center for Parasitology, Entomology and Malaria Control, Cambodia) and Erika S. Phelps (NIAID, NIH, USA) for their contributions in the Cambodian studies. We are grateful to Tim Anderson for helpful review comments. The sequencing, genotyping and analysis components of this study were supported by The Wellcome Trust through core funding of the Wellcome Trust Sanger Institute (077012/Z/05/Z; 098051), core funding of the Wellcome Trust Centre for Human Genetics (075491/Z/04; 090532/Z/09/Z) and a Strategic Award (090770/Z/09/Z); and the Medical Research Council through the MRC Centre for Genomics and Global Health (G0600230) and an MRC Professorship to Dominic Kwiatkowski (G19/9). Other parts of this study were partly supported by the Wellcome Trust; the Medical Research Council; the Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health; and a Howard Hughes Medical Institute International Scholarship (55005502) to Abdoulaye Djimde. PR is a staff member of the World Health Organization; he alone is responsible for the views expressed in this publication and they do not necessarily represent the decisions, policy or views of the World Health Organization.
URLs Information page for this study: http://www.malariagen.net/resource/12
Information page for the SNP list used: http://www.malariagen.net/resource/10
ENA archive records: http://www.ebi.ac.uk/ena/data/search/?query=plasmodium
WWARN Parasite Clearance Estimator: https://www.wwarn.org/research/parasite-clearance-estimator
R language: http://www.r-project.org/
R ape package: http://ape.mpl.ird.fr/
Accession Codes All sequence data are available online at the European Nucleotide Archive (ENA), accessible at http://www.ebi.ac.uk/ena/data/search/?query=plasmodium. ENA accession numbers for all samples used in this study are listed in Supplementary Table 14.
Supplementary Information The following Supplementary Information files are provided: a document containing the Supplementary Note, Supplementary Tables and Supplementary Figures referenced by the paper; and two Excel spreadsheets containing the electronic forms of Supplementary Table 7 and Supplementary Table 8.
The authors declare no competing financial interests.