In this study we compare the completely sequenced genome of the
P. falciparum laboratory-cultured clone 3D7 (Ref
4) to whole genome sequence data from the chimpanzee malaria parasite
P. reichenowi, the closest relative of
P. falciparum; a non-cultured clinical
P. falciparum isolate, directly isolated from a patient infected in Ghana (PFCLIN); and a
P. falciparum laboratory model, clone IT (IT). Sequence data were produced by whole genome shotgun sequencing to five-, two- and one-fold coverage for PFCLIN,
P. reichenowi and IT, respectively. We used SSAHA2 (Ref
5) to align sequence reads to the reference
P. falciparum genome 3D7. The fractions of the genome that could be reliably analysed (see methods) were; 74% using PFCLIN, 25% with IT and 42% with
P. reichenowi (excluding the majority of
var genes). Single nucleotide differences (SNPs or fixed substitutions) and indels were identified from these alignments based on standard calling strategies and additional filters. We identified 27,169 non-redundant SNPs between 3D7, PFCLIN and IT (nucleotide diversity: 3d7 vs. PFCLIN π = 0.00131; 3D7 and IT π = 0.0011). We called 216,619 fixed differences between 3D7 and
P. reichenowi (divergence, d = 0.0203). Based on coverage by multiple reads, the false discovery rate for both SNPs and fixed differences calls is estimated to be 10% (90% of the calls are correct – see methods). We also identified many insertions/deletion events (indels) (see ). This data indicates that there is substantial variation in
P. falciparum genomes and that the divergence between
P. reichenowi and
P. falciparum is approximately 10-fold greater than the within-species polymorphism.
In regions where the prevalence of malaria is high, many individuals harbour multiple
P. falciparum genotypes that survive at low parasitemia during asymptomatic infection
6. Clinical episodes of malaria are caused by a new infection, the majority of which are of a single genotype. To estimate the multiplicity of infection in the patient from whom PFCLIN was isolated, we estimated heterozygosity based on overlapping high quality reads. Approximately 7-10% of the SNP sites called between 3D7 and PFCLIN have good support for both allelic states in PFCLIN, while this is only about 5% for the IT, which is expected to be monomorphic. By subtracting the noise inferred from IT we estimate that about 2-5% of the PFCLIN sample (genome-wide) corresponds to other genotypes.
Culture-adapted
P. falciparum strains may be imperfect models for the genome due to extensive accumulation or loss of sequences in culture. We investigated whether this had occurred in the culture-adapted 3D7 isolate by examining the indel events we detected in the PFCLIN and
P. reichenowi comparisons, neither of which has been adapted to in vitro culture. Large numbers of small indels were identified in this analysis (). In both comparisons we observed an excess of “insertions” in 3D7 (ancestral state not known), indicating that the 3D7 isolate had accumulated DNA in culture at the rate of approximately +1nt/kb (see
Supplementary Methods). Indels were underrepresented in coding sequences (0.07 indels/kb, compared to 3.75 indels/kb in other regions), without any bias towards gain or loss of DNA relative to the reference 3D7 and frequently multiples of three, thus maintaining open reading frames. This study excludes much of the subtelomeric regions that have been the focus of prior observations of sequence loss/gain
7,8, and did not detect any indels greater than ~100 nt long due to the short length of read alignments. It is therefore possible that accumulation of sequences in interstitial regions occurs simultaneously with loss and rearrangement events towards the chromosome ends
7,8.
To estimate the selective effects acting on individual genes we calculated the ratio of non-synonymous to synonymous substitution rates (dN/dS) for the three pair-wise comparisons for each gene with at least 100 codons of read coverage (n = 4,686, 2,437 and 3,674 genes for PFCLIN, IT and
P. reichenowi respectively). This produced 762, 207, and 3024 genes for which dN/dS estimates could be calculated for PFCLIN, IT and
P. reichenowi. All dN/dS distributions were skewed toward zero (median = 0.27, 0.30 and 0.17, for PFCLIN, IT, and
P. reichenowi), as expected, since most genes are under purifying selection (
Fig. S1). However, some genes exhibited elevated dN/dS values (see
Supplementary Table 1) that in some cases were above 1. Many genes that are under positive selection may have dN/dS values below 1 if only a fraction of the amino acids are under positive selection
9. We also performed McDonald-Kreitman tests (MK test
10) using coding sequences with read coverage for all three
P. falciparum isolates (3D7, PFCLIN and IT) and
P. reichenowi (see methods). MK tests indicated categories of genes that exhibited either an excess of divergence (likely explained by directional selection) or an excess of polymorphism (likely diversifying selection) (see
Supplementary Table 1) in the isolates we examined.
Studies of other organisms have shown the evolutionary rates (of protein-coding genes, dN/dS) are often correlated with mRNA and protein expression levels
11. We used
P. falciparum expression studies to look for similar correlations. Both total protein mass spectrometry counts (corrected for protein length, see methods)
12, and total transcript levels
13 were highly significantly correlated with
P. reichenowi and PFCLIN dN/dS estimates (,
Figure S2). In all cases rapidly evolving genes were expressed at low levels, and abundantly expressed genes were conserved during evolution. The duration of gene expression (measured by the number of developmental stages in which a gene is expressed, data from Ref
13) was also negatively correlated with the 3D7-
P. reichenowi dN/dS values (Spearman rank correlations, protein r = −0.24, P < 2.2×10
−16, transcript r = −0.22, P < 2.2×10
−16) (). These observations were very robust even after correction for total expression effects (
Supplementary Methods) or with the use of different expression and genetic variation data sets (
Fig. S2). Although the incomplete coverage and relatively high SNP false detection rate preclude detailed analysis of individual genes, these results indicate that global analysis of our dN/dS distributions provides biologically meaningful information. We also observed heterogeneity between dN/dS estimates of genes that were primarily expressed primarily in a single developmental stage of the
Plasmodium life cycle (). Therefore, highly specialized genes in the
Plasmodium genome are more likely to undergo accelerated selective processes either due to relaxed constraint or due to directional selection.
The events that occur between the invasion of the host erythrocyte to its eventual rupture to release daughter merozoites determine the clinical manifestations of malaria. We used a detailed study of the intra-erythrocytic developmental cycle (IDC)
14 to examine this stage in more detail. In that study it was found that ~80% of the nuclear-encoded IDC-expressed genes could be clustered into 13 groups that were both co-expressed, and were functionally related. When we compared the dN/dS values split in these clusters, we found that two of these clusters, “merozoite invasion” and “early ring transcripts”, showed significantly elevated rates of nonsynonymous change (high dN/dS) in the 3D7 -
P. reichenowi comparison (). MK tests indicated that both these clusters contained significant excess of nonsynonymous polymorphisms in
P. falciparum relative to fixed differences (merozoite NI = 2.5, Fisher's exact test P = 0.009; ring NI = 10.4, P = 1.59×10
−9), indicating that these proteins were also highly variable within
P. falciparum isolates.
To test whether proteins that were directly interacting with host cells were under adaptive evolution we examined the evolutionary rates of genes grouped by their intracellular localisation using the cellular component aspect of the Gene Ontology (GO) classification
15, and definitions of predicted exported proteins
16. Proteins localised to the nucleus, cytoplasm and the mitochondria were generally conserved, while apicoplast-localised proteins, predicted membrane-spanning proteins, and predicted exported proteins had evolved significantly more rapidly (). Exported and transmembrane proteins also show elevated evolutionary rates in rodent parasites
17. These observations, and previous studies indicating that the strongest evidence for positive selection between human and chimpanzee is related to immunity and defense
18, are consistent with the model of an ‘evolutionary arms race’ between the mammalian immune system and the exposed proteins of
Plasmodium parasites. As expected given this ‘arms race’ model, genes annotated as ‘antigens’ have significantly higher dN/dS distributions from the
P. reichenowi comparison (Mann-Whitney, P = 2.67×10
−3, antigens median = 0.29, all others median = 0.17).
We also analyzed the evolutionary rates of genes with respect to their Gene Ontology (GO) biological process annotations. 2098 genes with detailed GO annotation were mapped to a cut-down version of the full ontology (GO-slim) with 19 broad categories (
Supplementary Table 1). Of the 19 categories, 7 did not have significantly different
P. reichenowi –3D7 dN/dS values from other GO-defined genes, 10 had significantly lower dN/dS values, mainly involved in metabolism and cell cycle (data not shown), and 2 GO-slim categories (“cell communication” and “entry into host cell” ) had significantly higher dN/dS estimates (). The “cell communication” category also showed a significant excess of nonsynonymous polymorphisms with the MK test (NI = 2.8, Fisher's exact P = 7.1×10
−3).
Collectively, our evolutionary analysis indicates that the most significant functional changes between
P. falciparum and
P. reichenowi have been membrane and exported proteins, and proteins involved in merozoite invasion and subsequent ring stage parasite growth. Although our data do not allow us to distinguish which lineage has undergone positive selection, adaptive changes in these functions in one or both of these species are consistent with differences in the biology of the parasites.
P. falciparum produces 10 – 12 merozoites, in contrast to 2 or 3 times this number in
P. falciparum
19 and some data suggests that
P. reichenowi is less pathogenic to chimpanzees than
P. falciparum is to humans
19. Both
P. falciparum and
P. reichenowi have a strong preference for their host species
19,20, which may be due to differences in the sialic acid modifications to erythrocyte surface proteins between human and chimpanzee, which are major determinants of the invasion of
Plasmodium parasites
20.
Our analysis describes genome-wide patterns of variation but can also indicate specific genes of interest. The list of the genes with the highest number of nonsynonymous polymorphisms in the PFCLIN isolate is comprised largely of uncharacterised genes (
Supplementary Table 1). However it also contains examples of well characterised surface antigens, such as 2 VAR-like genes (PFL0030c and PFE0040c), PfEMP2 and R45. Thus, the other members of the list may include additional uncharacterised antigens. Of these, gene PF10_0355 is particularly striking; it has the second highest total SNP number and indicating that is may be under immune selection and, like VAR genes, it contains a DBL (duffy binding like) domain. The chromosomal location of PF10_0355 contains a cluster of genes that have previously received much attention: the vaccine candidate,
liver stage antigen 1 (PF10_0356) is encoded downstream and several genes encoding merozoite surface proteins are found upstream. An additional 15 genes were identified as having a significant excess of nonsynonymous polymorphisms (MK test: P < 0.05,
Supplementary Table 1). These 15 genes encode for two known polymorphic antigens;
surf13.1 (PF13_0074) a member of the Surfin protein family, which have been identified on the cell membrane fractions of merozoites and infected erythrocytes
21, and a CSP and TRAP-related protein (CTRP, PFC0640w), a transmission-blocking vaccine candidate
22. The samples of genes mentioned above highlight the usefulness of genome-wide surveys of nucleotide variation for the identification of medically important targets for vaccine and drug development.
The present analysis provides an unbiased view of natural variation in the
Plasmodium falciparum genome and its divergence from
P. reichenowi. This study and its companion papers
23,24 are invaluable assets to our further understanding of the biology of this complex organism. The increasing efficiency of gene manipulation strategies for
P. falciparum means that hypotheses about gene function derived from evolutionary data can rapidly be tested in the future. More extensive exploration of
P. falciparum polymorphism will be required for fine-scale analysis of the evolution in the genome. This information, combined with accurate data on geographical origin and parasite phenotype, should allow us to understand both the complex population structure of this important pathogen, and to carry out genotype/phenotype association studies. The latter will be vital for elucidating mechanisms underlying processes such as virulence and drug resistance within human populations and is likely to lead to novel medical intervention strategies.