|Home | About | Journals | Submit | Contact Us | Français|
DJ processed SSAHA data, produced diversity and evolutionary measures, analysed the data and wrote the manuscript.
ETD and MB directed the project and assisted with analysis of the data and writing of the manuscript.
AP and AB assisted with analysis/processing of the data and biological interpretation of the data
AT collected the P. reichenowi sample and extracted DNA
KS assisted with data processing/analysis
AC provided SSAHA mapping
JS assisted with data processing
CI re-sequenced genes and manually verified SNPs
A-CU assisted with with parasite DNA extraction
SKr assisted biological interpretation of the data and manuscript, and with A-CU parasitology
CN shaped some of the initial ideas for the project, assisted in biological interpretation of the data, and with parasite DNA extraction
SKy grew the IT parasite, purified and extracted DNA from parasites
Infections with the malaria parasite Plasmodium falciparum result in more than one million deaths each year worldwide1. The evolutionary history and genetic variation of P. falciparum is of critical importance to the understanding of the evolution of drug resistance, the identification of potential vaccine candidates and an appreciation of the effect of parasite variation upon prevalence and severity of malaria in humans. Most studies of natural variation in P. falciparum have been either in depth over small genomic regions (up to the size of a small chromosome2) or genome-wide but only at low resolution3. In an effort to complement these studies with genome-wide data, we undertook shotgun sequencing of a Ghanaian clinical isolate (5x coverage), the IT laboratory isolate (1x) and the chimpanzee parasite P. reichenowi (2x). These genomes were compared to the fully sequenced clone4. We describe the most salient features of P. falciparum polymorphism and adaptive evolution with relation to gene function, transcript and protein expression and cellular localisation. This analysis reveals the primary evolutionary changes that have occurred since the P. falciparum – P. reichenowi speciation, and those that are occurring within P. falciparum.
In this study we compare the completely sequenced genome of the P. falciparum laboratory-cultured clone 3D7 (Ref 4) to whole genome sequence data from the chimpanzee malaria parasite P. reichenowi, the closest relative of P. falciparum; a non-cultured clinical P. falciparum isolate, directly isolated from a patient infected in Ghana (PFCLIN); and a P. falciparum laboratory model, clone IT (IT). Sequence data were produced by whole genome shotgun sequencing to five-, two- and one-fold coverage for PFCLIN, P. reichenowi and IT, respectively. We used SSAHA2 (Ref 5) to align sequence reads to the reference P. falciparum genome 3D7. The fractions of the genome that could be reliably analysed (see methods) were; 74% using PFCLIN, 25% with IT and 42% with P. reichenowi (excluding the majority of var genes). Single nucleotide differences (SNPs or fixed substitutions) and indels were identified from these alignments based on standard calling strategies and additional filters. We identified 27,169 non-redundant SNPs between 3D7, PFCLIN and IT (nucleotide diversity: 3d7 vs. PFCLIN π = 0.00131; 3D7 and IT π = 0.0011). We called 216,619 fixed differences between 3D7 and P. reichenowi (divergence, d = 0.0203). Based on coverage by multiple reads, the false discovery rate for both SNPs and fixed differences calls is estimated to be 10% (90% of the calls are correct – see methods). We also identified many insertions/deletion events (indels) (see Table 1). This data indicates that there is substantial variation in P. falciparum genomes and that the divergence between P. reichenowi and P. falciparum is approximately 10-fold greater than the within-species polymorphism.
In regions where the prevalence of malaria is high, many individuals harbour multiple P. falciparum genotypes that survive at low parasitemia during asymptomatic infection6. Clinical episodes of malaria are caused by a new infection, the majority of which are of a single genotype. To estimate the multiplicity of infection in the patient from whom PFCLIN was isolated, we estimated heterozygosity based on overlapping high quality reads. Approximately 7-10% of the SNP sites called between 3D7 and PFCLIN have good support for both allelic states in PFCLIN, while this is only about 5% for the IT, which is expected to be monomorphic. By subtracting the noise inferred from IT we estimate that about 2-5% of the PFCLIN sample (genome-wide) corresponds to other genotypes.
Culture-adapted P. falciparum strains may be imperfect models for the genome due to extensive accumulation or loss of sequences in culture. We investigated whether this had occurred in the culture-adapted 3D7 isolate by examining the indel events we detected in the PFCLIN and P. reichenowi comparisons, neither of which has been adapted to in vitro culture. Large numbers of small indels were identified in this analysis (Table 1). In both comparisons we observed an excess of “insertions” in 3D7 (ancestral state not known), indicating that the 3D7 isolate had accumulated DNA in culture at the rate of approximately +1nt/kb (see Supplementary Methods). Indels were underrepresented in coding sequences (0.07 indels/kb, compared to 3.75 indels/kb in other regions), without any bias towards gain or loss of DNA relative to the reference 3D7 and frequently multiples of three, thus maintaining open reading frames. This study excludes much of the subtelomeric regions that have been the focus of prior observations of sequence loss/gain7,8, and did not detect any indels greater than ~100 nt long due to the short length of read alignments. It is therefore possible that accumulation of sequences in interstitial regions occurs simultaneously with loss and rearrangement events towards the chromosome ends7,8.
To estimate the selective effects acting on individual genes we calculated the ratio of non-synonymous to synonymous substitution rates (dN/dS) for the three pair-wise comparisons for each gene with at least 100 codons of read coverage (n = 4,686, 2,437 and 3,674 genes for PFCLIN, IT and P. reichenowi respectively). This produced 762, 207, and 3024 genes for which dN/dS estimates could be calculated for PFCLIN, IT and P. reichenowi. All dN/dS distributions were skewed toward zero (median = 0.27, 0.30 and 0.17, for PFCLIN, IT, and P. reichenowi), as expected, since most genes are under purifying selection (Fig. S1). However, some genes exhibited elevated dN/dS values (see Supplementary Table 1) that in some cases were above 1. Many genes that are under positive selection may have dN/dS values below 1 if only a fraction of the amino acids are under positive selection9. We also performed McDonald-Kreitman tests (MK test10) using coding sequences with read coverage for all three P. falciparum isolates (3D7, PFCLIN and IT) and P. reichenowi (see methods). MK tests indicated categories of genes that exhibited either an excess of divergence (likely explained by directional selection) or an excess of polymorphism (likely diversifying selection) (see Supplementary Table 1) in the isolates we examined.
Studies of other organisms have shown the evolutionary rates (of protein-coding genes, dN/dS) are often correlated with mRNA and protein expression levels11. We used P. falciparum expression studies to look for similar correlations. Both total protein mass spectrometry counts (corrected for protein length, see methods)12, and total transcript levels13 were highly significantly correlated with P. reichenowi and PFCLIN dN/dS estimates (Fig. 1a,b, Figure S2). In all cases rapidly evolving genes were expressed at low levels, and abundantly expressed genes were conserved during evolution. The duration of gene expression (measured by the number of developmental stages in which a gene is expressed, data from Ref 13) was also negatively correlated with the 3D7-P. reichenowi dN/dS values (Spearman rank correlations, protein r = −0.24, P < 2.2×10−16, transcript r = −0.22, P < 2.2×10−16) (Fig. 2a, b). These observations were very robust even after correction for total expression effects (Supplementary Methods) or with the use of different expression and genetic variation data sets (Fig. S2). Although the incomplete coverage and relatively high SNP false detection rate preclude detailed analysis of individual genes, these results indicate that global analysis of our dN/dS distributions provides biologically meaningful information. We also observed heterogeneity between dN/dS estimates of genes that were primarily expressed primarily in a single developmental stage of the Plasmodium life cycle (Fig. 2c). Therefore, highly specialized genes in the Plasmodium genome are more likely to undergo accelerated selective processes either due to relaxed constraint or due to directional selection.
The events that occur between the invasion of the host erythrocyte to its eventual rupture to release daughter merozoites determine the clinical manifestations of malaria. We used a detailed study of the intra-erythrocytic developmental cycle (IDC)14 to examine this stage in more detail. In that study it was found that ~80% of the nuclear-encoded IDC-expressed genes could be clustered into 13 groups that were both co-expressed, and were functionally related. When we compared the dN/dS values split in these clusters, we found that two of these clusters, “merozoite invasion” and “early ring transcripts”, showed significantly elevated rates of nonsynonymous change (high dN/dS) in the 3D7 - P. reichenowi comparison (Fig. 3a). MK tests indicated that both these clusters contained significant excess of nonsynonymous polymorphisms in P. falciparum relative to fixed differences (merozoite NI = 2.5, Fisher's exact test P = 0.009; ring NI = 10.4, P = 1.59×10−9), indicating that these proteins were also highly variable within P. falciparum isolates.
To test whether proteins that were directly interacting with host cells were under adaptive evolution we examined the evolutionary rates of genes grouped by their intracellular localisation using the cellular component aspect of the Gene Ontology (GO) classification15, and definitions of predicted exported proteins16. Proteins localised to the nucleus, cytoplasm and the mitochondria were generally conserved, while apicoplast-localised proteins, predicted membrane-spanning proteins, and predicted exported proteins had evolved significantly more rapidly (Fig. 3b). Exported and transmembrane proteins also show elevated evolutionary rates in rodent parasites17. These observations, and previous studies indicating that the strongest evidence for positive selection between human and chimpanzee is related to immunity and defense18, are consistent with the model of an ‘evolutionary arms race’ between the mammalian immune system and the exposed proteins of Plasmodium parasites. As expected given this ‘arms race’ model, genes annotated as ‘antigens’ have significantly higher dN/dS distributions from the P. reichenowi comparison (Mann-Whitney, P = 2.67×10−3, antigens median = 0.29, all others median = 0.17).
We also analyzed the evolutionary rates of genes with respect to their Gene Ontology (GO) biological process annotations. 2098 genes with detailed GO annotation were mapped to a cut-down version of the full ontology (GO-slim) with 19 broad categories (Supplementary Table 1). Of the 19 categories, 7 did not have significantly different P. reichenowi –3D7 dN/dS values from other GO-defined genes, 10 had significantly lower dN/dS values, mainly involved in metabolism and cell cycle (data not shown), and 2 GO-slim categories (“cell communication” and “entry into host cell” ) had significantly higher dN/dS estimates (Fig. 3c). The “cell communication” category also showed a significant excess of nonsynonymous polymorphisms with the MK test (NI = 2.8, Fisher's exact P = 7.1×10−3).
Collectively, our evolutionary analysis indicates that the most significant functional changes between P. falciparum and P. reichenowi have been membrane and exported proteins, and proteins involved in merozoite invasion and subsequent ring stage parasite growth. Although our data do not allow us to distinguish which lineage has undergone positive selection, adaptive changes in these functions in one or both of these species are consistent with differences in the biology of the parasites. P. falciparum produces 10 – 12 merozoites, in contrast to 2 or 3 times this number in P. falciparum19 and some data suggests that P. reichenowi is less pathogenic to chimpanzees than P. falciparum is to humans19. Both P. falciparum and P. reichenowi have a strong preference for their host species19,20, which may be due to differences in the sialic acid modifications to erythrocyte surface proteins between human and chimpanzee, which are major determinants of the invasion of Plasmodium parasites20.
Our analysis describes genome-wide patterns of variation but can also indicate specific genes of interest. The list of the genes with the highest number of nonsynonymous polymorphisms in the PFCLIN isolate is comprised largely of uncharacterised genes (Supplementary Table 1). However it also contains examples of well characterised surface antigens, such as 2 VAR-like genes (PFL0030c and PFE0040c), PfEMP2 and R45. Thus, the other members of the list may include additional uncharacterised antigens. Of these, gene PF10_0355 is particularly striking; it has the second highest total SNP number and indicating that is may be under immune selection and, like VAR genes, it contains a DBL (duffy binding like) domain. The chromosomal location of PF10_0355 contains a cluster of genes that have previously received much attention: the vaccine candidate, liver stage antigen 1 (PF10_0356) is encoded downstream and several genes encoding merozoite surface proteins are found upstream. An additional 15 genes were identified as having a significant excess of nonsynonymous polymorphisms (MK test: P < 0.05, Supplementary Table 1). These 15 genes encode for two known polymorphic antigens; surf13.1 (PF13_0074) a member of the Surfin protein family, which have been identified on the cell membrane fractions of merozoites and infected erythrocytes21, and a CSP and TRAP-related protein (CTRP, PFC0640w), a transmission-blocking vaccine candidate22. The samples of genes mentioned above highlight the usefulness of genome-wide surveys of nucleotide variation for the identification of medically important targets for vaccine and drug development.
The present analysis provides an unbiased view of natural variation in the Plasmodium falciparum genome and its divergence from P. reichenowi. This study and its companion papers23,24 are invaluable assets to our further understanding of the biology of this complex organism. The increasing efficiency of gene manipulation strategies for P. falciparum means that hypotheses about gene function derived from evolutionary data can rapidly be tested in the future. More extensive exploration of P. falciparum polymorphism will be required for fine-scale analysis of the evolution in the genome. This information, combined with accurate data on geographical origin and parasite phenotype, should allow us to understand both the complex population structure of this important pathogen, and to carry out genotype/phenotype association studies. The latter will be vital for elucidating mechanisms underlying processes such as virulence and drug resistance within human populations and is likely to lead to novel medical intervention strategies.
We thank the Pathogen Sequencing teams for producing the sequence data used in this study, Paul Horrocks and Bob Pinches for the supply of IT isolate DNA and Matias Marti for the list of PEXEL motif-containing genes. This study was funded by the Wellcome Trust through their support of the Pathogen Sequencing Unit and ETD's group at the Wellcome Trust Sanger Institute.
Detailed methods are available in Supplementary Methods.
The clinical parasite sample (PFCLIN) was obtained by erythrocytapheresis from a non immune woman who had recently returned from Ghana, (a detailed case report is available in Supplementary Methods). Molecular studies were approved by the Wandsworth Local Research Ethics Committee (UK). DNA was extracted from about 400 ml of enriched infected red blood cells (see Supplementary Methods). Laboratory isolate IT DNA was prepared from trophozoite stage culture of IT subclone P1B5 (Ref 26), using standard overnight proteinase K digestion followed by phenol and chloroform extractions. The P. reichenowi study was approved by the Institutional Ethics Committee according to Dutch law prior to commencement.
Randomly sheared DNA was sequenced from small-insert (1.4 - 5.0 kb) pUC19 clone libraries by whole genome shotgun sequencing as previously described27. From the PFCLIN, IT and P. reichenowi libraries 201,068, 36,138 and 78,442 reads were produced, respectively.
Paired shotgun clone reads were aligned to the P. falciparum 3D7 genome4 (3D7.version2.0, ftp://ftp.sanger.ac.uk/pub/pathogens/malaria2/3D7/) with SSAHA2 http://www.sanger.ac.uk/Software/analysis/SSAHA2/ (Ref 5). Only alignments that mapped to a single location on the reference genome and were opposed by their read pair were used in further analysis (single location paired reads). Single nucleotide differences (SNPs), and polynucleotide differences (POLYs, insertion/deletion events, differences of two or nucleotides within five nt of each other, and more complex polynucleotide differences) were identified from these alignments using neighbourhood quality standard28.
We excluded all read alignments and difference calls (SNP/POLY) that mapped to the non-unique regions of the genome4; for each project (PFCLIN, IT, P. reichenowi) we excluded any10 kb block of the genome where the ‘uniqueness’ U = [(single location paired reads mapped to block)/(all reads mapped to the block)] ≥ 0.5.
SNP locations with multiple read coverage were used to determine the minimum base phred score for which 90% of SNP calls were validated (minimum validated score, MVS). All subsequent analysis used only SNPs and ‘multiple SNP’ POLYs with a phred score ≥ MVS or a cumulative phred score (from all reads covering the location) greater than 60 (1/106 probability of error). All indel POLYs were retained. Estimates of the total rate of genetic change, dN/dS and MK tests were calculated from a predicted variant sequence generated using SNP and multiple SNP POLYs with SSAHA2 scores ≥ MVS.
We manually scrutinised 50 SNPs called from PFCLIN over 12 genes (44124 nt in length) by examining Gap4 (v4.10b.3) aligned electrophoretograms from all sequence reads from PFCLIN covering these genes, and further reads were generated by sequencing PCR products obtained from the initial PFCLIN DNA extraction. 48 SNPs were located and verified as correct, including one heterozygous SNP. One was identified but not found to be a SNP in this alignment, one SNP was not aligned by Gap4.
Sites with ≥ 1 MVS SNP call and at least N (N ≥ 2) reads of high quality read length (See Supplementary Methods) were used to estimate the proportion of heterozygous sites. A site qualified as heterozygous if the cumulative phred score from all reads covering that site was ≥ H (where H = 60, 70 or 80) for each of two alleles. For various values of N and H, estimates of the percentages of SNP locations that were heterozygous were: IT 4% - 6%, PFCLIN 7 - 10%.
Pair wise alignments were generated using the predicted variant sequences for each comparison (IT, PFCLIN, P. reichenowi) to the reference for each protein-coding genes with ≥ 100 codons of unique read coverage, excluding regions containing indels. dN/dS was calculated with the yn00 program of the PAML package29, using the Yang and Nielsen algorithm. Counts of the number of synonymous and non-synonymous changes in codons used the same method, except that all codons that were completely included in unique read coverage were included.
We carried out MK tests according to10 using alignments as above. The neutrality index (NI) was calculated from the equation30: NI = (Pn/Ps)/(Fn/Fs); Pn nonsynonymous polymorphisms, Ps synonymous polymorphisms, Fn nonsynonymous fixed differences, Fs synonymous fixed differences. The NI's for groups of genes were calculated by concatenating all coding sequences for the group of genes. Significance was evaluated from concatenated gene sequences using Fisher's exact test.
All statistics and plots were produced in R v1.11 (The R Foundation for Statistical Computing http://www.R-project.org). Tests for differences in evolutionary rate between groups of genes or correlations used only those genes included in the classification or detected in the expression study. For grouped data, if Kruskal-Wallis rank sum null hypothesis (that all groups were the same dN/dS distribution) was rejected (P < 0.05) then Mann-Whitney tests were performed on each group to examine whether the dN/dS distribution of the genes in that group differed from the dN/dS distribution of all other genes that were detected/classified in the expression study/classification.
SNP data have been deposited at dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/ accession numbers; PFCLIN 65863114-65917062, IT 65849337-65863113. Sequence trace files are available at The Wellcome Trust Sanger Institute http://www.sanger.ac.uk/Projects/P_falciparum/.
The polymorphism data will be available for browsing and querying in PlasmoDB http://www.plasmodb.org.
GFF format SSAHA outputs are available upon request.
Competing interest statement
The authors declare that they do not have any competing financial interests.