|Home | About | Journals | Submit | Contact Us | Français|
We sequenced and annotated the genomes of four Plasmodium vivax strains collected from disparate geographical locations, tripling the number of genome sequences available for this understudied parasite and providing the first genome-wide perspective of global variability within this species. We observe approximately twice as much SNP diversity among these isolates as we do among a comparable collection of isolates of Plasmodium falciparum, a malaria parasite that causes higher mortality. This indicates a distinct history of global colonization and/or a more stable demographic history for P. vivax than P. falciparum, which is thought to have undergone a recent population bottleneck. The SNP diversity, as well as additional microsatellite and gene family variability, suggests the capacity for greater functional variation within the global population of P. vivax. These findings warrant a deeper survey of variation in P. vivax to equip disease interventions targeting the distinctive biology of this neglected but major pathogen.
Half the world’s population is estimated to be at risk of P. vivax malaria1, owing to this parasite’s unique potential for lengthy remission and tolerance of cooler climates than those preferred by strictly tropical Plasmodium species. Although the P. falciparum parasite is responsible for the majority of contemporary malaria-related mortality, there is evidence that P. vivax may have been a more virulent parasite prior to the advent of modern medicine. As far north as England, death records indicate that P. vivax likely reduced the average life span from 58 to 33 years during the 19th century2. More recently, studies have shown P. vivax to be capable of causing severe malaria syndromes long attributed only to P. falciparum3.
In spite of the past and present significance of P. vivax to human health, it remains chronically understudied relative to P. falciparum. The ability to continuously culture P. falciparum but not P. vivax in the laboratory, compounded with the differential mortality imposed by the two species, has led to vast discrepancies in the state of our knowledge regarding almost all aspects of the biology of these species. The recently renewed push for malaria eradication may remain incomplete unless this disparity in knowledge is addressed and applied to disease control programs4.
Genetic diversity is important to characterize in order to understand the history of our association with a disease, to evaluate the direct impacts of diversity on clinical disease, and also because it may directly or indirectly reduce the efficacy of therapeutics such as drugs and vaccines. In contrast to P. falciparum, where the genomes of many hundreds of isolates have now been sequenced or genotyped5, only the P. vivax genomic reference strain (Salvador I)6 and unassembled shotgun sequencing of a Peruvian isolate (IQ07)7 have been completed. We sequenced, assembled, and annotated the genomes of four geographically disparate isolates of P. vivax to remedy the lack of genetic diversity data available for this species relative to P. falciparum. The designations and geographic origins of each P. vivax strain are indicated in Table 1, which also lists the names and origins of a concurrently-sequenced comparator panel of P. falciparum isolates hailing from similarly disparate geographic locales. Templates of both species were sequenced using the same next-generation sequencing platform (Illumina GAIIx/HiSeq) and evaluated for diversity using the same bioinformatic tools. All four of the newly sequenced P. vivax reference strains (North Korean, India VII, Mauritania I and Brazil I) are clonal infections adapted for growth in monkeys, are publically available via the Malaria Research and Reference Reagent Resource Center (see URLs), and were sequenced from genomic DNA derived from leukocyte-depleted monkey blood (Online methods). The P. falciparum isolates (from Honduras, India, Indochina and Senegal) were also clonal, and were sequenced using template derived from in vitro cultures. We generated de novo assemblies for each of the four P. vivax isolates (Table 2) using the ALLPATHS LG assembly algorithm8. Assembly quality was significantly higher for P. vivax than for a P. falciparum assembly generated using the same approach (Supplementary Table 1), likely due to the more moderate A/T nucleotide composition of the P. vivax genome (P. vivax 57.7% A/T vs. P. falciparum 80.6% A/T). Synteny between the new P. vivax assemblies and the Salvador I reference assembly was found to be highly conserved (≥ 97.9 %; Supplementary Fig. 1).
We next used the sequencing data to evaluate genetic diversity within each species. The pairwise SNP rate for each sequenced isolate relative to the reference assembly for each species (P. vivax: Salvador I; P. falciparum: 3D7) is indicated in Figure 1a as a function of inferred SNP quality. Pairwise SNP rate relative to a reference assembly is expected to be a function of the evolutionary or geographic distance, but despite the varying geographic origin of isolates from both species, we observed the P. vivax SNP rates to be uniformly higher than the P. falciparum SNP rates regardless of SNP quality threshold. This finding suggests globally higher genetic diversity in P. vivax relative to P. falciparum, a genome-wide result confirming previous pilot surveys9, 10. We explored this finding by comparing mean SNP rates according to sequence class, using SNPs with a minimum PHRED-style quality score of 30 (estimated accuracy of at least 99.9%)11. We find P. vivax to exhibit a significantly higher mean SNP rate than P. falciparum at intergenic, and intronic, fourfold-degenerate (4D) synonymous coding sites, within coding sequence overall, and across all sequence classes (t-test, P = 0.0087; Fig. 1b), suggesting that the difference in diversity is pervasive and genome-wide. To control for variation in the degree of functional constraint among genes within each genome, we next evaluated mean pairwise SNP diversity (π) in a collection of 3,401 genes for which we could confidently identify 1:1 orthologs between the species using a reciprocal best BLAST hits (RBH) criterion. For this comparison we observed approximately twice as much SNP diversity in P. vivax compared to P. falciparum (paired t-test, P = 2.2E-16; Fig. 1c), confirming the ubiquity and magnitude of SNP diversity disparity between the species (Supplementary Fig. 2).
To test whether a differing SNP mutation rate, rather than a different effective population size and/or demographic history, can account for the differences in SNP diversity observed, we compared the genome-wide diversity of microsatellites, a different class of mutation. Because microsatellite length variants are caused by replication slippage rather than point substitution12, we would expect their relative diversity profile to be different from that of SNPs under the null hypothesis that a different point substitution mutation rate explains the SNP diversity disparity. We applied a novel method of evaluating microsatellite length variation from Illumina data, and observed, as we had with the SNPs, significantly higher diversity in P. vivax than in P. falciparum (bootstrapping, 279 and 22,713 microsatellites with at least eight repeats in P. vivax and P. falciparum, respectively; P = 0.021, Fig. 1d; Supplementary Fig. 3). Given the population genetic evidence that P. falciparum underwent multiple drug-induced selective sweeps and at least one significant bottleneck13, these results indicate that P. vivax may exhibit a comparatively large effective population size due to an absence of such demographic events in recent history. Even in the face of common drug pressure, P. vivax may exhibit disproportional demographic stability due to its unique capacity for dormancy within infected hosts.
Our sample is smaller than ideal for evaluating the time to the most recent common ancestor (TMRCA), but by comparing the deepest pairwise nucleotide divergence observed for each species at 4D synonymous sites (P. vivax: Mauritania vs. Brazil I, 1.628E-3 substitutions/site; P. falciparum: 3D7 vs. Dd2, 9.59E-4 substitutions/site) we can predict that the lower bound for TMRCA in P. vivax is approximately 70% larger than that for P. falciparum. The calculation of absolute TMRCA dates is dependent on the accuracy of the inferred mutation rate, and therefore results must be interpreted with caution. Nevertheless, if we assume a commonly accepted eukaryotic genome-wide mutation rate for 4D sites (2.2 E-9 subs/site/yr14; similar to a Plasmodium rate estimated with less precision15), we can estimate the TMRCA as 768 KYA (thousands of years ago) and 452 KYA for P. vivax and P. falciparum, respectively. This P. vivax estimate is deeper than previous TMRCA estimates produced using a small number of loci in a larger population sample16, but reconciliation of the absolute estimates is difficult without knowledge of the true mutation rate. Assuming mutation rates are similar in both parasite lineages, however, the result stands that P. vivax exhibits a much deeper TMRCA.
Other departures in the global population history of these two species are indicated by the topology and branch lengths of their respective phylograms (Figure 2). The relatively large degree of divergence between the IQ07 Peruvian isolate and the Brazil I and Salvador I strains of P. vivax suggests a distinct history in the New World (NW) relative to P. falciparum, which exhibits low diversity in the NW and is thought to have been introduced within the last 500 years via the African slave trade17. The high NW diversity, combined with the significantly closer phylogenetic affinity of the three NW P. vivax isolates with the East Asian (North Korean) rather than the African (Mauritania I) or South Asian (India VII) strains, could suggest the pre-colonial arrival of P. vivax in the NW accompanying human dispersal from Asia by sea, or, less likely, by the Bering land bridge during the last glacial maximum. Alternatively, this profile could be explained by recent but very large-scale (relative to P. falciparum) post-colonial introductions to the NW via trans-oceanic trade from East Asia, the Pacific, or Europe, the last of which presumably harbored a genetically distinct parasite clade prior to disease elimination18. Deeper genomic sampling of P. vivax populations will be required to explain these patterns in the diversity data.
We next explored the profile of variation within individual genes and gene families to evaluate the potential functional consequences of the extraordinarily high genomic diversity we observe in P. vivax. As Figure 3a indicates, mean pairwise divergence among the sequenced P. vivax isolates is highest in gene families associated with red blood cell invasion and immune evasion. Functional enrichment analysis of diversity in individual genes also finds nonsynonymous SNPs to be concentrated in invasion-related motility genes (Supplementary Table 2). This extremely high sequence diversity suggests that vaccines targeting polymorphic antigens may encounter an even greater hurdle in eliciting cross-protective immune responses than they do in P. falciparum, where strain-specific immunity has been recently observed to limit vaccine efficacy19.
Differences in the distribution of nonsynymous SNPs among genes with orthologs in both species are potentially reflective of differences in disease biology between P. vivax and P. falciparum (Supplementary Table 3). Genes expressed in the pre-erythrocytic stages are the most enriched group among those exhibiting higher ratios of nonsynonymous to synonymous diversity (πNS/πS) in P. vivax relative to P. falciparum (Mann Whitney U test; Z score = 2.2); whereas genes associated with host/parasite interactions are the most enriched group exhibiting higher πNS/πS in P. falciparum relative to P. vivax (Mann Whitney U test; Z score = 2.4; Supplementary Table 4). While neither of these enrichments is statistically significant after correction for multiple testing, this pattern bears further exploration when sequence data from more P. vivax genomes become available.
As expected, we observed enormous diversity in the vir gene family, the members of which are variably expressed and encode proteins that are exported to the host cell surface for the purpose of evasion of the host adaptive immune response20. Of the 313 vir genes included in the Salvador I reference assembly, Figure 3b indicates that less than a third are also observed in the four new assemblies. Unexpectedly, we encountered 15 ‘ultra-conserved’ vir genes that were present in all assemblies and exhibited very low SNP diversity, in particular one locus (PVX_113230) that was invariant and exhibited the highest similarity (70% protein sequence identity) of any vir to the homologous kir gene family in Plasmodium knowlesi, a zoonotic parasite (Fig. 3c). Unlike most vir genes, this locus also exhibits conserved synteny in more distantly related rodent malaria parasites. These attributes suggest that PVX_113230 is likely the founder of the vir family in the P. vivax lineage, and the lack of polymorphism suggests that the protein it encodes performs an ancestral role rather than host immune modulation. The molecular function of PVX_113230 could be related to erythrocyte invasion, as suggested by its distinct intra-erythrocytic expression profile relative to most virs21 (Supplementary Fig. 4).
This global census of genomic diversity in P. vivax has uncovered an unexpected degree of genetic polymorphism, much of which may translate into important functional variation. Our data stop short of suggesting the existence of distinct sub-species of P. vivax, similar to the sub-species P. vivax vivax and P. vivax hibernans proposed on the basis of relapse phenotype22. However, the extreme diversity we observe among these new reference strains suggests a more stable and older association of this parasite with humans than for P. falciparum, and serves as a warning that P. vivax could present a qualitatively different eradication task.
We chose four strains of P. vivax for whole genome sequencing based upon geographical origin and phenotype, to provide a resource of high-quality assembled and annotated sequences for the malaria research community. The North Korean strain has a long relapse phenotype that enables it to survive in the primate host through periods of drought and long winters when mosquito vectors for transmission are unavailable26. The Brazil I strain is highly resistant to the anti-relapse drug primaquine23. The India VII strain is the first Plasmodium species to be sequenced from India. The Mauritania I strain is a rare example of an African P. vivax strain that occurs among West Africans with Berber and Arab genetic backgrounds28. Genomic DNA for P. vivax sequencing was obtained from the leukocyte-depleted blood of infected, splenectomized Saimiri monkeys as described previously6. Genome-wide fragment analysis of 15 microsatellites prior to sequencing confirmed their independent origins. DNA and frozen stabilates are available upon request at MR4 (see URLs).
Genomic DNA was used to construct two Illumina sequencing libraries for each P. vivax isolate, with library insert sizes of 180 bp and 3 kb. Each library was sequenced to a depth of at least 150 fold coverage (to account for contaminating monkey host DNA) using 76 bp paired end Illumina reads on Illumina GAIIx sequencers. After filtering out reads with sequence similarity to monkey/primate sequence each genome was assembled using the ALLPATHS-LG algorithm8. Assembly quality was quantified using the N50 statistic for contigs and scaffolds, which describes the minimum contig or scaffold size such that the sum of the lengths of all contigs or scaffolds of equal or greater size accounts for at least half of the total assembly length. Synteny with the P. vivax Salvador I reference assembly was quantified for each de novo assembly by checking the concordance of approximately 60,000 randomly chosen pairs of 100 bp sequences, each with 100 kb of intervening sequence in the new assemblies. This method identifies breaks in synteny when the sequence pairs do not map to locations separated by 100 kb in the reference assembly, or when only one member of a sequence pair is successfully mapped to the reference. Finally, for three of the four P. vivax isolates (North Korean, Mauritania I, Brazil I) we were able to recover a scaffold representing the complete apicoplast genome.
P. falciparum analyses were based on comparable Illumina read data generated from a single library for each isolate (insert size 180 bp). Coverage for each of the isolates was as follows: Indian isolate 87_239, 42X; Indochina isolate Dd2, 196X; Honduran isolate HB3, 30X; Indian isolate ML-14, 41X; Senegalese isolate Th231.08, 59X.
The protein-coding genes in the P. vivax nuclear genomes were annotated using a combination of reference gene mapping, homology-based gene models (GeneWise)31, EST-based gene models, and ab initio gene predictions. Ribosomal RNAs (rRNAs) were identified with RNAmmer32. The tRNA features were identified using tRNAScanSE33. Other common RNA features were identified with RFAM34.
The MUMmer algorithm35 was used to align draft assemblies to the P. vivax Salvador I reference genome assembly. Neighboring syntenic alignment blocks were joined to form longer alignments, which were then used to map the gene coordinates from the reference genome to the draft assemblies. Homology-based gene models were created using tblastn to search against the draft genome assemblies with the UniRef90 protein database, a Plasmodium protein database created from the annotated proteins of P. falciparum 3D7 and P. vivax SaIvador I. The resultant BLAST hits were used to create GeneWise gene models. Gene models were also built using 31,777 P. vivax ESTs available on GenBank, with an ORF cutoff length of 300 bp (“EST ORFs”). Ab initio gene models were predicted using self-training GeneMark-ES36 and Geneld37 with parameters trained on genes from the P. vivax Salvador I genome.
The final merged gene set for each of the four sequenced P. vivax strains was created using the following workflow: if a mapped reference gene in a given assembly had intact start and stop codons, no frame-shift or in-frame stop, and no exons in the contig gaps, then the mapped reference gene was used directly as the gene model. If a mapped gene had a frame-shift or in-frame stop, then the corresponding GeneMarkES gene model was selected. If a GeneWise gene model had no overlap to gene models from the previous two sources, but had a non-generic gene product name or overlap with non-repeat PFAM domains, then the GeneWise feature was added as a gene model. Finally, if an EST ORF was at least 600 bp in length and exhibited no overlap with models identified from the previous three sources, then the EST ORF was added as a gene model. The initial gene set was checked against tRNA and rRNA features and filtered where appropriate. Additional gene filtering was performed by removing genes with 30% or more coding sequence overlap with TransposonPSI (see URLs) hits (e < 1E-10).
Sequencing reads from each isolate were aligned using BWA38 to the references assemblies of P. falciparum 3D7 (build v7.1) and P. vivax Salvador I (build v7.0), both downloaded from PlasmoDB (http://plasmodb.org). SNPs were called using the Unified Genotyper11 in the GATK package39. SNPs with an estimated PHRED-style quality score of Q30 or greater were used for diversity analyses.
The mreps program40 was used to identify microsatellites in the references assemblies of P. falciparum and P. vivax. The following mreps parameters were used for the microsatellite search: (1) minp=1, (2) maxp = 9. Searching under these conditions identified 538,794 and 95,990 microsatellite loci in P. falciparum and P. vivax, respectively. For P. falciparum, the allow-small parameter was employed to allow identification of small microsatellites that the mreps algorithm might otherwise flag as biologically insignificant due to the high AT content (~80%) of that genome.
Illumina sequencing reads were mapped to the P. falciparum 3D7 or P. vivax reference assemblies using BWA38 with the “−q 5” and “ l 32” options. Base quality score recalibration and local realignment around microsatellites11 was applied using GATK39. Custom Python scripts were used to filter out reads that did not span the entirety of a microsatellite interval in the genome. A final local realignment around microsatellite sequences, followed by indel (insertion/deletion) calling with standard hard filtering parameters was applied to using GATK, and accepted indels were converted into microsatellite length genotypes. Illumina-based microsatellite calls were validated by comparison with calls made by aligning the Sanger-sequence based Dd2 assembly (downloaded from http://www.broadinstitute.org/annotation/genome/plasmodium_falciparum_spp/Downloads.html) to the 3D7 reference assembly to generate a truth set of indels for genotyped microsatellites. Comparison of the Illumina-based calls to the Sanger calls indicates that our Illumina calling method exhibits a specificity of 100% and a sensitivity of 97%, on the basis of 81,569 AT dinucleotide microsatellites callable in both datasets. P values for evaluating the significance of the difference in microsatellite diversity between species were generating by resampling π values for each species 10,000 times and noting the frequency with which a difference in mean π occurred that was equal to or greater than the observed difference in mean π. The overall comparison of microsatellite diversity across motif unit sizes was performed using only microsatellites with eight or more repeat units given that, as previously observed for P. vivax41, these longer microsatellites were observed to be more polymorphic in both species (Supplementary Fig. 5).
Mean pairwise diversity at all nucleotide sites (π), synonymous sites (πS), and nonsynonymous (πN) sites was calculated for each gene in the reference annotations for P. falciparum and P. vivax using the method described in Table 1 of Ina (1995)42. P. knowlesi orthologs were employed to calculate interspecific dN/dS ratios using PAML v4.543. with sequences from the Salvador I isolate of P. vivax. A matrix of pairwise nucleotide distances between isolates (Supplementary Table 5) was constructed for each species using SNPs identified in fourfold-degenerate (4D) synonymous coding sites of genes with orthologs in each species. To control for differences in the nucleotide substitution profile between species and enable direct comparison of branch lengths, pairwise distances were normalized by empirical nucleotide transition matrices (Supplementary Table 6) constructed for each species by rooting 4D polymorphisms using an outgroup species (P. knowlesi for P. vivax, P. reichenowi for P. falciparum). Phylograms were constructed from pairwise distance matrices using the Neighbor-Joining method. The lower bound of the TMRCA was estimated for each species as the deepest pairwise divergence. SNP functional enrichment analyses were carried out using a Mann Whitney U test and the Z scores were interpreted using a Bonferroni correction for multiple testing.
This work has been funded in whole or in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under contracts HHSN266200400001C and HHSN2722009000018C. We gratefully acknowledge the Indian Council of Medical Research for financial support of the Malaria Parasite Bank at the National Institute of Malaria Research, New Delhi. We thank the NIAID/NHGRI Eukaryotic Pathogens and Disease Vectors Working Group, and the Broad Institute Genome Sequencing Platform for its significant contributions to this project. KG and LY are supported by a Global Health Program grant from the Bill and Melinda Gates Foundation. AAE is supported by a grant from the US National Institutes of Health, GM084320, and JMC and PLS by U19AI089676, an NIH International Center of Excellence for Malaria Research.
Illumina sequencing reads have been submitted to the NCBI short read archive (accession numbers: SRP007883, SRP007923, SRP000493, SRP000316). SNP calls have been submitted to dbSNP (handle: BROAD-GENOMEBIO; SS-IDs: 522951448-525286517; 526103049-526165183; 526166468-526189295; 526189799-526214992; 526889101-526983923; 526985108-527023853; 527024052-527114619; 527114764-527224909; 527225105-527336921). P. vivax nuclear and apicoplast genome assemblies and gene calls may be downloaded from the Broad Institute website (http://www.broadinstitute.org/annotation/genome/plasmodium_vivax/MultiDownloads.html) or GenBank (accession numbers: AFMK01000000, AFBK01000000, AFNI01000000, AFNJ01000000, JQ437257, JQ437258, JQ437259).
AUTHOR CONTRIBUTIONSJWB provided P. vivax strains and ARA and APD provided Indian P. falciparum material. SMS, SS, and SY performed genome assembly. SG, JMG, and QZ performed genome annotation. KG, RJ, LY, and DEN performed analyses. SBC performed project management. PLS undertook experimental validation. DEN, AAE, BWB, and JMC directed the analyses. DEN, AAE, JWB and JMC wrote the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Editorial Summary (AOP and Month, same)
Jane Carlton and colleagues report the genome sequencing, de novo assembly and annotation of four Plasmodium vivax reference strains from diverse geographical locations. Their cross species comparisons show that P. vivax has greater genetic diversity than P. falciparum.