|Home | About | Journals | Submit | Contact Us | Français|
Signatures of balancing selection operating on specific gene loci in endemic pathogens can identify candidate targets of naturally acquired immunity. In malaria parasites, several leading vaccine candidates convincingly show such signatures when subjected to several tests of neutrality, but the discovery of new targets affected by selection to a similar extent has been slow. A small minority of all genes are under such selection, as indicated by a recent study of 26 Plasmodium falciparum merozoite-stage genes that were not previously prioritized as vaccine candidates, of which only one (locus PF10_0348) showed a strong signature. Therefore, to focus discovery efforts on genes that are polymorphic, we scanned all available shotgun genome sequence data from laboratory lines of P. falciparum and chose six loci with more than five single nucleotide polymorphisms per kilobase (including PF10_0348) for in-depth frequency–based analyses in a Kenyan population (allele sample sizes >50 for each locus) and comparison of Hudson–Kreitman–Aguade (HKA) ratios of population diversity (π) to interspecific divergence (K) from the chimpanzee parasite Plasmodium reichenowi. Three of these (the msp3/6-like genes PF10_0348 and PF10_0355 and the surf4.1 gene PFD1160w) showed exceptionally high positive values of Tajima's D and Fu and Li's F indices and have the highest HKA ratios, indicating that they are under balancing selection and should be prioritized for studies of their protein products as candidate targets of immunity. Combined with earlier results, there is now strong evidence that high HKA ratio (as well as the frequency-independent ratio of Watterson's θ/K) is predictive of high values of Tajima's D. Thus, the former offers value for use in genome-wide screening when numbers of genome sequences within a species are low or in combination with Tajima's D as a 2D test on large population genomic samples.
Identifying the most important targets of immunity expressed by large and complex eukaryotic pathogens is difficult but can benefit from studies of pathogen genetics and evolution. Experimental blood-stage infection and challenge studies on genetically crossed rodent malaria parasites in mice have mapped important targets of parasite strain-specific immunity to the loci encoding merozoite surface protein 1 (MSP1) (Martinelli et al. 2005; Cheesman et al. 2009) and apical membrane antigen 1 (AMA1) (Pattaradilokrat et al. 2007), with results indicating there may be few other such important polymorphic targets in that experimental system. However, human malaria parasites such as Plasmodium falciparum encode many proteins that are not present in rodent malaria parasites, the importance of which must be largely investigated without genetic crossing and infection experiments, as a relatively inaccessible experimental primate model makes such approaches difficult (Hayton et al. 2008).
A population genetic approach is to consider the effects of frequency-dependent selection on pathogens due to the memory component of acquired immune responses, which will generally lead to balancing selection maintaining polymorphism of the genes encoding immune targets in endemic pathogens (Conway and Polley 2002). A series of initial studies on the statistical distribution of DNA sequence polymorphism within and among populations has indicated strong signatures of balancing selection on particular vaccine candidate antigens of the malaria parasite P. falciparum, including AMA1 (Polley and Conway 2001; Cortes et al. 2003; Polley et al. 2003), MSP1 (Conway et al. 2000), MSP2 (Conway 1997; Ferreira and Hartl 2007), MSP3 (Polley et al. 2007), erythrocyte-binding antigen 175 (Baum et al. 2003; Verra et al. 2006), and thrombospondin-related adhesive protein (Weedall et al. 2007).
To identify new candidate targets of naturally acquired immunity in P. falciparum, we previously made a prospective search for signatures of balancing selection in less-studied merozoite-stage protein genes (Tetteh et al. 2009). Out of a panel of 26 genes screened, 1 had a strong signature (PF10_0348, a gene predicted to encode an MSP3/6-like protein) (Tetteh et al. 2009), for which the protein has been recently characterized independently and named DBLMSP (merozoite surface protein containing a duffy binding–like [DBL] domain) (Wickramarachchi et al. 2009). This hit rate is low, consistent with expectations that balancing selection generally operates on only a small minority of genes (Bubb et al. 2006; Charlesworth 2006; Andres et al. 2009) and prompts consideration of modifications that could increase efficiency in screening for such genes. We had selected that initial panel for study regardless of any preexisting information on polymorphism as such data were scant and had taken a two-step approach to 1) first identify genes with a high ratio of polymorphism (π) to interspecific divergence (K, from the chimpanzee parasite Plasmodium reichenowi) in the Hudson–Kreitman–Aguade (HKA) test or a skew in the intraspecific versus interspecific nonsynonymous (NS) to synonymous (S) ratios in the McDonald–Kreitman (MK) test, by sequencing the genes from a global sample of P. falciparum laboratory isolates and P. reichenowi, and 2)select the genes with highest positive HKA or MK ratios for endemic population–based analysis of allele frequency distributions with Tajima's D and Fu and Li's F indices.
The expanding availability of data from shotgun genome sequences of different P. falciparum isolates (Jeffares et al. 2007; Mu et al. 2007; Volkman et al. 2007) together with highly accessible tools for browsing such data (Aurrecoechea et al. 2009) now allow easier and rapid screening for genes showing an unusually high level of polymorphism, so these can be immediately prioritized for population-based studies to test for signatures of selection. Here, to select candidate loci for population-based analysis in an endemic Kenyan site, we first identified six genes with an exceptionally high number of single nucleotide polymorphisms (SNPs) per kilobase, including PF10_0348 and five others that had not been previously tested (one other msp3/6-like paralogue and four unrelated genes expressed at the schizont and merozoite erythrocytic stage of infection). In the population analysis, the PF10_0348 gene gave a highly positive value of Tajima's D, as did two of the other five genes (the msp3/6-like PF10_0355 and the surf4.2 gene PFD1160w), whereas the remaining three were negative. Thus, the positive screening for polymorphic genes gave an enhanced hit rate, enabling us to confirm one and identify two new genes that have signatures of balancing selection. The protein products of allelic forms of these three genes are now prioritized for study as candidate targets of immunity. Using these and previous data on other genes, we show that allele frequency–based and polymorphism-versus-divergence analyses give independent signatures that are useful in screening for balancing selection, either separately or in 2D tests, and recommend the application of both in large-scale discovery approaches.
We accessed available SNP and stage-specific transcript profile data on the PlasmoDB website (www.plasmodb.org) (Aurrecoechea et al. 2009) in November 2007 (PlasmoDB release 5.4) to screen for highly polymorphic genes expressed in the replicating asexual blood stage of P. falciparum. We found 490 annotated genes with evidence of maximum relative transcription at the later stages of the ~48-h asexual cycle (schizonts and merozoites or >30 h into the cycle up until the stage of newly invaded “ring” stages) (Bozdech et al. 2003; Llinas et al. 2006), and on each of these we performed SNP density screening for all possible pairwise comparisons of alleles in available data from 13 laboratory P. falciparum isolates of diverse geographical sources: 3D7, HB3, Dd2, D10, V1_S, 7G8, RO33, K1, D6, FCC2, FCB, FCR3, and IT. Genes greater than 1.0 kb in length with density of at least around five SNPs per kilobase were identified as suitable for analysis, excluding those that had been previously focused on elsewhere as they encode vaccine candidates, or as members of the well-studied eba, Rh, and RhopH1/clag gene families. Six highly polymorphic genes with peak transcription during the schizont/merozoite stage were thereby selected for population-based analysis: the surf4.1 gene PFD0100c (on chromosome chr 4), surf4.2 gene PFD1160w (chr 4), PF07_0004 (chr 7), PF10_0342 (chr 10), the msp3/6-like PF10_0348 (chr 10), and surf 13.1 gene PF13_0075 (chr 13). As we encountered inconsistent amplification of PF13_0075 from field isolates in initial attempts, for the purpose of this study we replaced it with another highly polymorphic gene PF10_0355 (chr 10), which is more constitutively transcribed but that might function in schizonts and merozoites as it is MSP3/6-like in structure and paralogous to PF10_0348.
Cross-sectional venous blood samples were obtained in August–September 1998 from a broad sample of children and adults (age range 0.5–80 years, median 8 years) resident in Ngerenya village in Kilifi District, Kenya, in a study of malaria reviewed and approved by the Scientific Steering Committee and the Ethics committee of the Kenya Medical Research Institute. This population had low to moderate endemic malaria transmission at the time of sampling (Mwangi et al. 2008). Parasite DNA was extracted from frozen heparinized venous blood samples from individuals that were slide positive for P. falciparum, using the QIAamp DNA Blood Mini Kit (QIAGEN, Crawley, UK). A population sample of at least 50 allele sequences of each gene was sought as optimal for frequency-based tests, so the six genes selected for study were each amplified from 90 of these individual parasite-positive DNA isolates, using oligonucleotide primers and amplification conditions listed in supplementary table S1 (part A) (Supplementary Material online). The P. reichenowi orthologues of these genes were amplified from DNA from the blood of a laboratory chimpanzee infected with P. reichenowi (CDC-1 “Oscar” strain), using primers and amplification conditions listed in supplementary table S1 (part B) (Supplementary Material online); the PF10_0348 ortholog sequence in P. reichenowi had been obtained previously (Tetteh et al. 2009). Polymerase chain reaction (PCR) products were purified with the QIAquick PCR Purification Kit (QIAGEN) and sequenced using the amplification primers and several internal sequencing primers, with ABI BIGDYE terminator v3.1 chemistry and electrophoresis on ABI 3130xl and ABI 3730 capillary sequencers (Applied Biosystems, Warrington, UK). Sequences were assembled, edited, and aligned using SeqMan and MegAlign software (Lasergene 7; DNASTAR, Madison, WI). For each locus, each isolate giving clear single-allele sequence representing the sole or predominant haploid parasite allelic type within the blood was analyzed, whereas isolates that showed mixed and electrophoretically superimposed allele sequences were not analyzed. All singleton nucleotide polymorphisms were confirmed by independent reamplification and resequencing from the relevant samples.
Tests for departures from neutrality were based on allele frequency distribution indices (Tajima's D, and Fu and Li's F) and comparisons of variation within and between species (HKA and MK ratios), using DnaSP5.0 (Rozas 2009). Tajima's D test takes into account the difference between average pairwise nucleotide diversity between sequences (π) and Watterson's population nucleotide diversity parameter theta (θ) expected under neutrality from the total number of segregating sites (S) (Tajima 1989c). Fu and Li's F test statistic is based on the difference between the observed number of singleton nucleotide polymorphisms and the number expected under neutrality given the total number of segregating sites and Watterson's estimate of nucleotide diversity (θ) (Fu and Li 1993). The HKA ratio is used to identify genes with exceptionally high ratios of polymorphism (π) over divergence (K) from a closely related species (in this case P. reichenowi) (Hudson et al. 1987; Innan 2006). The MK test counts the numbers of NS and S polymorphic sites within species and fixed differences between closely related species, using a Fisher's exact test on the 2 × 2 contingency table (McDonald and Kreitman 1991).
Linkage disequilibrium (LD), the association of nucleotide variants at different polymorphic sites, was assayed for all informative pairs of polymorphic sites within the genes, using DnaSP5.0. The r2 indices (square of the correlation coefficient of allelic states at each pair of loci) were calculated and tested for departures from randomness by Fisher's exact test. The value of r2 ranges from 0 to 1, although its values are constrained by the underlying allele frequencies (Hill and Robertson 1968).
For each of the six loci, a majority of the 90 Kenyan isolates each yielded a clear single-allele sequence, representing the P. falciparum allelic type that was dominant within each blood sample. Mixed P. falciparum genotype infections in a proportion of isolates yielded electrophoretically superimposed allelic sequences, excluded from analysis here (numbers excluded generally differ among genes due to the relative amounts of sequence polymorphism, particularly disruptive effects of repeat polymorphisms, and stochastic effects of PCR). Thus, the sample size of alleles for each locus was in excess of the target number of 50 for statistical power, ranging from 51 (for PFD0100c) to 79 (for PF10_0342) (table 1). Three of the genes (PF07_0004, PF10_0348, and PF10_0355) contained regions of repeat sequences that were excluded from the alignment-based tests below (the amino acid translations of these repeats are shown separately in supplementary fig. S1, Supplementary Material online). The positions of nucleotide polymorphisms in the genes in this population and fixed differences from P. reichenowi are shown schematically in figure 1. Full alignments of all alleles for each of the genes are given in supplementary figure S2 (Supplementary Material online) and in EMBL Nucleotide Sequence Database alignment files (accession numbers are listed in supplementary table S2, Supplementary Material online).
Of the six genes in this Kenyan population sample, PF10_0342 showed the lowest nucleotide polymorphism (π = 6.5 × 10−3) and PFD1160w showed the highest (π = 43.0 × 10−3) (table 1). Considering the HKA ratio, of polymorphism (π) divided by interspecific divergence from P. reichenowi (K), values ranged from 0.11 for PF07_0004 up to 0.52 for PF10_0348 (table 1). These values are higher than those indicated for the genes from a screen of shotgun genome sequence data (accessible at www.plasmodb.org) as some of the gene regions with low level of polymorphism were not studied here (e.g., the first exon of PF07_0004, the second exon of PFD1160w, and the second and third exons of PFD0100c). Only one locus (PFD1160w) had a significant MK test result showing an excess of NS versus S polymorphisms compared with fixed differences (P = 0.02). Three of the genes (PFD1160w, PF10_0348, and PF10_0355) had highly positive values of Tajima's D and Fu and Li's F indices, indicative of balancing selection (these positive values, respectively, indicate fewer rare alleles and fewer singletons than expected under neutrality) (table 1). The PF10_0348 gene had previously been studied in a Gambian population in which similar indices were obtained, but the positive result for the other two genes is unprecedented. The remaining three genes analyzed showed negative values of these indices.
There is evidence of extensive recombination in all these genes within the population, as LD indices are maximal between sites that are closely situated together, and significant values are mostly among sites separated by <0.5 kb (fig. 2). The PF10_0355 gene has a marked dimorphic structure throughout much of its sequence (supplementary fig. S2, Supplementary Material online), so there are more extended strong LD values in this gene (fig. 2). A high recombination rate indicates that signatures of selection on one part of a gene are not likely to be reflected in the pattern of nucleotide polymorphism throughout the gene, so sliding window analyses were performed for Tajima's D (fig. 3) and Fu and Li's F (supplementary fig. S3, Supplementary Material online). These show exceptionally high values of both indices for aligned nucleotides 1000–1200 and 1500–1900 in PFD1160w and aligned nucleotides 500–1200 in PF10_0348, with PF10_0355 having very high values throughout most of its dimorphic sequence and lower values only near the 5′- and 3′-ends of the alignment. The PFD1160w gene additionally shows an unusual signature of strong LD between a limited number of sites widely separated (~1500 bp apart) in the gene, indicating that there may be epistatic interactions between polymorphisms in different parts of the protein. The other three genes did not show any sliding windows of significantly elevated Tajima's D or Fu and Li's F values, consistent with their lack of overall departure from neutrality. In contrast, two windows of significantly negative values for PFD0100c are likely to reflect directional selection on polymorphisms in that gene.
These three genes with highest values of Tajima's D (and Fu and Li's F) also had highest values of the polymorphism-versus-divergence HKA ratio (π/K), a correlation among the indices similar to that seen for a set of genes previously studied in a Gambian population (fig. 4A). This supports the suggestion that the HKA ratio (which requires fewer allelic sequences to derive an accurate estimate than Tajima's D) may be a useful screen for genes under balancing selection when population-based data are limited. As both these indices involve the pairwise nucleotide diversity parameter π as a numerator, despite being based otherwise on very different data, they are not fully independent statistically as some correlation may be due to variance in π among loci. However, by substituting Watterson's θ as the nucleotide diversity parameter (based on numbers of polymorphic sites and independent of allele frequencies) in place of π, in a modified polymorphism-versus-divergence ratio (θ/K), this can be tested for correlation with Tajima's D. A positive correlation between these indices would not be expected under neutrality, as θ occurs as a negative term toward calculating Tajima's D = (π − θ)/SD (π − θ) (Tajima 1989c). This correlation is significantly positive for the polymorphic genes studied (fig. 4B), which reflects independently concordant signals of selection from the frequency-based and polymorphism-versus-divergence indices.
It remains vital to identify important targets of immunity among the many proteins expressed by malaria parasites, apart from the relatively small set that have been studied already as vaccine candidates. This study identifies three genes to be under strong balancing selection (the surf4.2 gene PFD1160w, the msp3/6-like dblmsp gene PF10_0348, and the msp3/6-like PF10_0355), out of six highly polymorphic genes subjected to allele frequency–based analyses on a Kenyan population and tests of diversity versus divergence from P. reichenowi. One of these (PF10_0348) was previously identified as the only gene under strong balancing selection after analysis of 26 merozoite-stage genes that were not chosen on the basis of prior polymorphism data. As expected, focusing analysis on genes with high nucleotide diversity yields a higher hit rate of genes with signatures of balancing selection. The potential extra benefit of using a polymorphism-versus-divergence ratio in such a screen should encourage more sequencing of related species such as P. reichenowi (Jeffares et al. 2007; Krief et al. 2010; Prugnolle et al. 2010).
Population genomic studies on P. falciparum should identify further loci under balancing selection, and these should ideally be based on large samples from each population, such as the single endemic village community in Kenya studied here. Admixture or pooling of samples between two divergent populations could lead to elevated Tajima's D values generally (Tajima 1989a), and this can be reduced by sampling individual well-defined populations that are not derived by secondary contact between two separate populations. In contrast, a historical population expansion would tend to lead to reduced Tajima's D values (Tajima 1989b), and this appears to be a feature of P. falciparum populations in Africa (Joy et al. 2003; Verra et al. 2006). Importantly, inferences of nonneutrality should be based on comparisons across a set of loci, as we have demonstrated here and that should be realized fully in genome scale population analyses (Nielsen 2001).
Immunological analyses of allelic protein products of each of the three hits from the current study can now be prioritized. Initial studies on the proteins expressed by these genes support the idea that they could be targets of immunity. The SURFIN4.2 protein accumulates in the parasitophorous vacuole (the compartment between the intraerythrocytic parasite and the erythrocyte cytoplasm) and appears to be transported to the infected erythrocyte membrane as well as being associated with released merozoites. High concentrations of rabbit antibodies raised to SURFIN4.2 had an inhibitory effect on erythrocyte invasion, suggesting that the protein may be accessible to inhibitory antibodies in vivo (Winter et al. 2005). The DBLMSP is specifically located on the merozoite surface, and a recombinant protein fragment incorporating the DBL domain showed binding to human erythrocyte surface that was neuraminidase and trypsin sensitive and that could be inhibited by murine antibodies raised to the protein (Wickramarachchi et al. 2009). Although expression of PF10_0355 transcript is not narrowly stage specific, one report indicates that antibodies raised to a recombinant protein fragment recognize the merozoite surface (Singh et al. 2009), and other data indicate it is present on a minority of merozoites within a laboratory cloned line (Knuepfer E and Holder AA, personal communication). This raises the important question of whether these proteins are variantly expressed in abundance or cellular location within or on the parasite surface, as well as being structurally polymorphic among allelic forms.
Although each of these genes occurs at a single locus, it is possible that some polymorphisms are derived from more complex evolutionary history than would classically occur at a single-copy gene. Gene copy number variation has been identified to be widespread in the P. falciparum genome, not only in subtelomeric regions where it is concentrated but also at many loci on different chromosomes (Kidgell et al. 2006; Ribacke et al. 2007; Cheeseman et al. 2009; Mackinnon et al. 2009). For example, the dblmsp gene (PF10_0348) has a paralogous copy in 3 of 14 laboratory isolates studied (Tetteh et al. 2009), and one of the polymorphic genes without a selective signature in this study, the surfin4.1 gene (PFD0100c), has six copies in laboratory isolate FCR3 and one copy in other isolates (Mphande et al. 2008).
Importantly for future studies, we show that the allele frequency–based Tajima's D and the polymorphism-versus-divergence ratio (π/K or the allele frequency–independent Watterson's θ/K) are significantly positively correlated, indicating that signatures of selection are discernable from each type of index with some concordance. Thus, 2D tests based on allele frequency distribution spectra and ratios of polymorphism to interspecific divergence can be recommended (Innan 2006; Zhai et al. 2009), for large-scale genome-wide screens to identify further signatures of balancing selection on malaria parasites.
We are grateful to Tabitha Mwangi for leading the epidemiological study that provided the population sample of Plasmodium falciparum parasites, Brett Lowe for encouragement of this investigation and coordination of laboratory management, Alan Thomas and Clemens Kocken for provision of Plasmodium reichenowi DNA, and all colleagues who have discussed ideas on this investigation.
This work is published with the permission of the director of the Kenya Medical Research Institute (KEMRI). This work was supported by Wellcome Trust (074695/Z/04/B).