The
P. knowlesi genome sequence was produced by whole-genome shotgun sequencing to eightfold coverage, with targeted gap closure and finishing (
Supplementary Table 1). The 23.5-megabase (Mb) nuclear genome is composed of 14 chromosomes and contains the expected complement of non-coding RNA (ncRNA) genes with known function (
Supplementary Table 2) and a large number of novel structured ncRNA candidate genes (Supplementary Figs
1-
5 and Supplementary Tables
3 and
4). The presumed centromeres are similar to those found in other
Plasmodium species
4,6, and are positionally conserved within regions sharing synteny with
P. vivax (see of ref.
4). The overall G+C base composition is 37.5%. A total of 5,188 protein-encoding genes were identified, which is slightly lower than the predicted proteome size of
P. falciparum and
P. vivax
4,6.
Unusually for
Plasmodium species, (G+C)-rich repeat regions containing intrachromosomal telomeric sequences (ITSs, containing the heptad sequence GGGTT[T/C]A) are found at multiple internal sites in the
P. knowlesi chromosomes, arrayed tandemly or as components of larger repeat units (). These sequences appear infrequently in
P. vivax and
P. falciparum at internal chromosome sites (Supplementary Figs
6 and
7). In the protozoan parasite
Trypanosoma brucei
10, ITSs may be the templates for recombination events that result in gene conversion among variant antigen
VSG genes
11. In mammalian genomes
12, ITSs are common and may represent the ‘scars’ of double-stranded DNA break repair
12. Alternatively, ITSs may have a role in transcriptional control.
For approximately 80% (4,156 out of 5,185) of predicted genes in
P. knowlesi, orthologues could be identified in both
P. falciparum and
P. vivax (for details, see ref.
4). The
P. knowlesi-specific variant antigen gene families,
SICAvar genes
13 and
kir genes
9, form the largest groups of
P. knowlesi-specific expansions (Supplementary Tables
5 and
6). Five distinct gene families of unknown function, with 4-15 paralogous members, are unique to
P. knowlesi (referred to as Pk-fam-a to Pk-fam-e in
Supplementary Table 7). Pk-fam-a and Pk-fam-b each have more than nine paralogous members (
Supplementary Fig. 8), which have a two-exon gene structure with a signal peptide, a carboxy-terminal transmembrane region, but lack typical export motifs
14,15. Members of the protein family Pk-fam-c and Pk-fam-e represent two new families with putative protein export signals (
Supplementary Fig. 8 and
Supplementary Table 8).
A comparison of Pfam domains
16 between the predicted proteomes of
P. knowlesi, P. vivax and
P. falciparum (
Supplementary Table 9, Supplementary Information) revealed major differences in domains that distinguish species-specific protein families involved in antigenic variation. The remainder of the proteome was relatively conserved albeit with some interesting copy number variations of a few key housekeeping enzymes (
Supplementary Fig. 9 and
Supplementary Table 9).
In other
Plasmodium genomes sequenced so far, variant gene families involved in antigenic variation (Supplementary Figs
6 and
7) are typically arranged in the subtelomeres, and only a few members of these families have hitherto been found at intrachromosomal sites. Notably, the
P. knowlesi genome sequence has revealed that the major variant gene families (that is,
SICAvar
13 and
kir
9) are randomly distributed across all 14 chromosomes () and often co-localize with ITS-containing repeats (Supplementary Information). Although all of the telomeres were not fully assembled, we know that in the case of chromosome 7,
P. knowlesi and
P. vivax have atypical gene content—the subtelomere encodes proteins associated with merozoite invasion (for example, MAEBL and members of the reticulocyte-binding-like (RBL) family) (
Supplementary Fig. 10).
Variant SICA (schizont-infected cell agglutination) antigens on the surface of infected red blood cells
5 are associated with parasite virulence
17 and are encoded by the
SICAvar gene family
13—the largest variant antigen gene family in
P. knowlesi. Switching of variant types underlies the establishment of a chronic infection in the vertebrate host, a process that is essential in all species, to ensure mosquito transmission and the completion of the life cycle. Full-length
SICAvar genes have 3-14 exons (
Supplementary Table 5 and
Supplementary Fig. 11), resulting in a range of sizes for the predicted proteins of 53-247 kDa. Although many of the
SICAvar genes are present only as fragments, we estimate that there are up to 107 members in the H strain of
P. knowlesi based on the number of conserved final exons.
Twenty-nine predicted SICAvar genes have complete gene structures and were divided into two subtypes (). The type I SICAvar genes with 7-14 exons predominate, with a few containing unusually long introns (). The type II subgroup represents small SICAvar genes with 3-4 exon structures. Unusually large introns (5.8-13.6 kb) are a unique feature of SICAvar genes and have not previously been seen in any other sequenced apicomplexan gene ().
SICA antigens have a modular structure (,
Supplementary Fig. 12) comprising a variable number of highly diverged cysteine-rich domains (CRDs) encoded by multiple exons, a transmembrane domain and a cytoplasmic domain. A high level of sequence diversity was observed, with the exception of the 3′ terminal exon
13.We investigated the domain organization of the CRDs using profile hidden Markov models (HMMs; and
Supplementary Fig. 13). The full-length SICA proteins contain a distinct five-cysteine CRD (termed SICA-α) at the amino terminus, which occurs once or twice and may have a stabilizing role analogous to the cysteine-rich N-terminal capping motifs of extracellular leucine-rich repeat proteins
18. There are 1-8 CRDs (referred to as SICA-β) with 7-10 conserved cysteine residues. The transmembrane domain and a conserved domain follow at the C terminus (termed SICA_C in Supplementary Figs
12 and
13).
Although
P. knowlesi and
P. falciparum are phylogenetically distant, the SICA and
P. falciparum erythrocyte membrane protein 1 (PfEMP1) variant antigens share many fundamental biological characteristics (reviewed in ref.
19). Common regulatory mechanisms involving post-transcriptional gene silencing have been proposed between the
var gene family in
P. falciparum and the
SICAvar family in
P. knowlesi
19. We have identified conserved sequence motifs between the single
var intron and
SICAvar introns (Supplementary Figs
14-
18) in the region thought to be the origin of a ncRNA transcript involved in the silencing of
var genes
20, indicating possible commonality in regulatory mechanisms.
We searched for evidence of gene conversion within the
SICAvar family, using the predicted sequences of 20 type I full-length
SICAvar genes (Supplementary Information). It is clear that exon shuffling has an important role in
SICAvar evolution
13. The low-complexity repeat regions found within introns might facilitate recombination through misalignment during mitosis; this could explain the presence of
SICAvar fragments found throughout the genome and/or
SICAvar gene models with partial intron/exon structures. These comprise whole, and apparently intact, exons that might provide a reservoir for diversification analogous to that seen with
VSG genes in
Trypanosoma brucei
11 (Supplementary Information).
Kirs represent the second largest variant gene family. They encode predicted proteins of 36-97 kDa that are hypothesized to be expressed at the surface of infected erythrocytes and undergo antigenic variation
9. There are 68 predicted
kir genes, 4 of which have incomplete structures (
Supplementary Table 6). They were divided into four types depending on the number of exons (
Supplementary Fig. 19). Most (58 out of 64)
kir genes belong to types I and II. The domain organization of all predicted KIR proteins was also determined using profile HMMs ( and
Supplementary Fig. 20). They contain 1-3 domains, followed by a transmembrane domain at the C terminus (referred to as KIR TM in
Supplementary Fig. 20). A BLAST analysis of KIR proteins revealed stretches of up to 36 amino acids within the predicted extracellular domain that have 100% identity to host proteins, the most striking of which is to CD99. These matches were evident in several KIR proteins. Interestingly, different family members contain matches to different regions of CD99, such that together, they represent over one-half of the CD99 extracellular domain (). Tests were performed to assess the possibility that such matches could occur by chance (
Supplementary Table 10). We have compared the sequences to
Macaca mulatta, African green monkey and human. The matches exclude conserved cysteine regions and the degree of sequence identity decreases noticeably as the evolutionary distance to the natural host increases ( and
Supplementary Table 10). CD99 has a critical role as a immunoregulatory molecule in T-cell function (see
http://www.ncbi.nlm.nih.gov/omim/). These exact matches may interfere with recognition of parasitized erythrocytes by the host immune system or act as CD99 analogues that interfere by competing with T cells for CD99 partner molecules.
We undertook a more systematic search for other such instances of parasite proteins containing extensive stretches of identical host sequences, using the PMATCH algorithm (Supplementary Information). Unsurprisingly, a large number of matches to highly conserved housekeeping genes were observed, but in addition regions of perfect identity to another host protein (known as AHNAK, see
http://www.ncbi.nlm.nih.gov/omim/) were detected in two KIRs and one SICA-like protein (
Supplementary Fig. 21 and
Supplementary Table 10). Analogous searches using the predicted exported protein repertoires (exportome) of
P. vivax and
P. falciparum found no such matches to host proteins (
Supplementary Table 11). The identity to host proteins is maintained at the amino acid sequence rather than DNA sequence level (data not shown).
Acquisition of host proteins, and thus the ability to mimic their function, has been observed in many bacterial and viral pathogens
21. In parasitic protozoa there are known cases where stretches of amino acids present on a parasite-encoded cell surface protein match perfectly to regions of host proteins
22. However, in all such cases, the matches correspond to a common amino acid repeat that is shared between them
22-24. Malaria parasites are known to have a potential immunomodulatory role either by secreting functional homologues of host molecules or by binding to host antigen-presenting cells
25,26. This is the first observation of its kind in a malaria protein that shows acquisition of host peptide sequences that are likely to be on the infected cell surface and thus may interact with the host. The mechanism by which these host sequences have arisen remains to be clarified. Possible explanations include convergent evolution or horizontal transfer followed by gene degeneration events.
During the intraerythrocytic life cycle, malaria parasites significantly remodel the erythrocyte by exporting numerous proteins
14,15. This depends on a short motif, termed the plasmodium export element (PEXEL) or vacuolar transport signal (VTS), which is present in over 300
P. falciparum proteins and is common to all
Plasmodium species sequenced so far
27. In addition to the members of the PHIST family
27, an additional 100 proteins in
P. knowlesi have typical PEXEL-like motifs (
Supplementary Table 8 and
Supplementary Fig. 22).
Like the PfEMP1 protein in
P. falciparum, the SICAs and KIRs lack a signal peptide and a typical PEXEL-motif. We have identified a novel motif in the N-terminal region of SICA-α domains with a positionally conserved tryptophan residue surrounded by hydrophilic residues (
Supplementary Fig. 22) that may be the export signal. Similarly, 75% of KIR proteins have a conserved Z-L-P-S motif (where Z denotes a hydrophilic residue) at the beginning of the KIR domain that may also facilitate export (
Supplementary Fig. 22). In summary, approximately 280 predicted
P. knowlesi proteins may be exported to the infected erythrocyte surface via the PEXEL-dependent or PEXEL-independent pathways. By comparison, the exportome of
P. vivax is considerably larger than that of
P. knowlesi and seems to be much bigger than previously thought
27. About 145
P. vivax proteins contain typical PEXEL motifs including the members of the PHIST family and a small subgroup of 12 VIRs.
Genome sequencing of P. knowlesi and its comparison with other malaria genomes has highlighted several novel features of this emerging and potentially life-threatening human malaria parasite, and underscores the importance of full genome sequencing of new Plasmodium species. Major differences in both content and organization of its genome were revealed that involve the host-parasite interface, reinforcing the notion that malaria species have evolved specific mechanisms for enhancing their survival within their respective hosts. The P. knowlesi genome will also greatly enhance the utility of this human-infective species as a model for addressing questions pertinent to all Plasmodium species.