|Home | About | Journals | Submit | Contact Us | Français|
Leishmania parasites cause a broad spectrum of clinical disease. Here we report the sequencing of the genomes of two species of Leishmania: Leishmania infantum and Leishmania braziliensis. The comparison of these sequences with the published genome of Leishmania major reveals marked conservation of synteny and identifies only ~200 genes with a differential distribution between the three species. L. braziliensis, contrary to Leishmania species examined so far, possesses components of a putative RNA-mediated interference pathway, telomere-associated transposable elements and spliced leader–associated SLACS retrotransposons. We show that pseudogene formation and gene loss are the principal forces shaping the different genomes. Genes that are differentially distributed between the species encode proteins implicated in host-pathogen interactions and parasite survival in the macrophage.
Leishmaniasis is an infectious disease that is prevalent in Europe, Africa, Asia and the Americas, killing thousands and debilitating millions of people each year. With 2 million new cases reported annually and 350 million people at risk, infection by the insect-transmitted Leishmania parasite represents an important global health problem for which there is no vaccine and few effective drugs (see TDR Leishmaniasis URL in Methods). At least 20 Leishmania species infect humans, and the spectrum of diseases that they cause can be categorized broadly into three types: (i) visceral Leishmaniasis, the most serious form in which parasites leave the inoculation site and proliferate in liver, spleen and bone marrow, resulting in host immunosuppression and ultimately death in the absence of treatment; (ii) cutaneous Leishmaniasis, in which parasites remain at the site of infection and cause localized long-term ulceration; and (iii) mucocutaneous Leishmaniasis, a chronic destruction of mucosal tissue that develops from the cutaneous disease in less than 5% of affected individuals1. Infections, particularly those caused by visceralizing species, do not necessarily lead to clinical disease: despite the annual incidence of 0.5 million cases of life-threatening disease, most infections remain asymptomatic. Although host genetic variability and specific immune responses, together with the transmitting sandfly vector and environmental factors, are known to influence the outcome of infections2, the main factor that determines clinical presentation is thought to be the species of infecting parasite. For example, the New World parasite L. braziliensis is the causative agent of mucocutaneous Leishmaniasis, whereas the Old World species L. major and L. infantum, which are present in Africa, Europe and Asia, are parasites that cause cutaneous and visceral Leishmaniasis, respectively.
Sequencing the genomes of three kinetoplastid parasitic protozoa, L. major3, Trypanosoma brucei4 (the causative agent of African trypanosomiasis) and Trypanosoma cruzi5 (the causative agent of Chagas disease), previously revealed the preservation of large-scale gene synteny over 200–500 million years6. Despite a conserved core of ~6,200 trypanosomatid genes, more than 1,000 Leishmania-specific genes have been found, many of which remain uncharacterized. Architecturally, the chromosomes of Leishmania differ from those of the trypanosome species in not having extended subtelomeric regions containing species-specific genes.
Here we have extended these studies to the genomes of two other species, L. infantum (of the subgenus Leishmania Leishmania) and L. braziliensis (of the subgenus Leishmania Viannia), and we compare these genomes with that of L. major. Against a background of conserved gene content, synteny and architecture, we have identified roughly 200 differences at the gene or pseudogene content level, including 78 genes that are restricted to individual species. In particular, the genomes show significant differences to the only other Leishmania genome published (L. major), and there is evidence of the existence of RNA-mediated interference (RNAi) machinery and transposable elements in the genome of the most divergent species, L. braziliensis. These findings suggest that a few species-specific parasite genes are important in pathogenesis, that parasite gene expression levels differ considerably between species (perhaps as a consequence of variation in gene copy number) or that, contrary to expectation, the parasite genome plays only a small part in determining clinical presentation. This study therefore provides a framework for experimentally tractable investigations into the role of a few genes that might influence the tissue-specific expression of disease associated with different Leishmania species.
The L. infantum and L. braziliensis genome sequences were produced by whole-genome shotgun sequencing to five- and sixfold coverage, respectively. Comparative-grade finished sequences were produced by aligning contigs against the reference L. major sequence3 and by using PCR amplification between adjacent contig ends to confirm joins. The resulting assemblies of L. infantum and L. braziliensis contain 470 (N50 contig size of 150,519 bases) and 1,031 contigs (N50 contig size of 57,784 bases), respectively, corresponding to ~98% of the reference 33-Mb haploid genome size (Table 1). As compared with 8,395 annotated genes in the L. major genome3, we found 8,195 and 8,314 genes in the genomes of L. infantum and L. braziliensis, respectively. Genes were manually annotated systematically, facilitated by the strong codon bias of Leishmania species7, conservation of synteny, and the absence of a significant amount of cis splicing. Thus, despite the lack of functional information for more than 50% of the genes identified, these numbers are likely to reflect closely the true gene complement in these species.
About 3–4% of the predicted proteomes of Leishmania spp. comprise conserved amino acid repeats8, which could potentially have a role in pathogenicity. For example, leucine-rich repeats comprise the largest class and can mediate interactions between the parasite surface and macrophage complement receptor9. DNA repeats comprise ~9–10% of the three Leishmania spp. genomes, and L. braziliensis has the largest number of these repeats (data not shown).
Despite an estimated 20–100 million years of separation between the L. Viannia spp. and the L. Leishmania spp. (depending on whether the Leishmania genus was separated by migration events or the breakup of the supercontinent Gondwanda10,11), synteny is conserved for more than 99% of genes between the three genomes. Conservation within coding sequences is also high: the average amino acid identity between L. major and L. infantum is 92%, and the average nucleotide identity is 94% (L. major versus L. braziliensis, 77% and 82%, respectively; L. infantum versus L. braziliensis, 77% and 81%, respectively). On the basis of sequence similarity and chromosome architecture, the New World L. braziliensis is clearly an outlier, consistent with its subgenus classification. L. major and L. infantum both have 36 chromosomes, whereas L. braziliensis, consistent with previous linkage analysis, has only 35 chromosomes owing to an apparent fusion of chromosomes 20 and 34 (ref. 12). Unlike many pathogenic protozoa in which subtelomeres play a central part in generating diversity, directional clusters of ‘housekeeping’ genes extend to within 5 kb of the telomeres.
Sexual reproduction is not an obligatory part of the Leishmania life cycle and may occur only rarely13. Nevertheless, strong selection clearly maintains both the organization and sequence of the Leishmania genomes. A plausible explanation is that there is a spatial constraint on the organization of genes into directional clusters, which are either polycistrons or groups of genes sharing uncharacterized regulatory elements.
In addition to selection pressure acting against chromosomal rearrangements, Leishmania may lack some of the machinery that generates diversity in other eukaryotes. A lack of transposable elements would favor chromosome stability and is seen in the genomes of L. major and L. infantum. In other kinetoplastid parasites, namely T. brucei and T. cruzi, several classes of transposable elements are present (the non–long terminal repeat (LTR) retrotransposons, ingi/L1Tc and SLACS/CZAR and the LTR retrotransposon VIPER), but the L. major genome has only remnants of ingi/L1Tc-related elements (DIREs), suggesting their loss during evolution of the Leishmania lineage14. Similarly, L. infantum and L. braziliensis also contain the ingi/L1Tc DIREs.
Unexpectedly, we found evidence in L. braziliensis for the site-specific non-LTR retrotransposon SLACS/CZAR, which is associated with tandemly repeated spliced leader sequences in an arrangement similar to that of the SLACS or CZAR element in T. brucei or T. cruzi, respectively15,16. In addition, the telomeres of L. braziliensis contain a family of 20–30 previously unknown DNA transposable elements, each including putative reverse transcriptase, phage integrase (site-specific recombinase) and DNA and/or RNA polymerase domains, which we have called ‘telomere-associated transposable elements (TATEs; Supplementary Fig. 1 online). The TATEs and their bordering regions are highly conserved and are inserted only in the telomeric hexamer repeats at the same relative position (GGG↑TTA). As observed for most mobile elements, a duplicated motif (TT), present on either side of the transposable element, seems to correspond to a target site duplication. Unlike non-LTR retrotransposons, the TATEs do not contain an APE-like endonuclease domain but they do contain a putative integrase-like domain (site-specific recombinase), related to the transposase domains of other transposable elements, that may contribute to the observed telomeric site specificity. The telomeres seem to contain clusters of tandemly arranged TATEs, including short elements probably derived from full-length elements by internal deletions. It has not been possible to determine the precise organization of the TATEs owing to their repetitive nature.
In many eukaryotes, the effects of retrotransposable elements can be regulated through a RNA silencing mechanism such as RNAi. Despite its demonstration and utility in T. brucei17, RNAi has not been detected in other kinetoplastid species including L. major and T. cruzi6,18. Our comparison revealed genes in L. braziliensis that may be involved in the RNAi pathway (Supplementary Fig. 2 online). A hallmark of this pathway in other eukaryotes is Dicer activity, which converts double-stranded RNA (dsRNA) into small interfering RNA (siRNA). A divergent gene (Tb927.8.2370) encoding a Dicer-like protein (TbDcl1) has been described in T. brucei19. The TbDcl1 protein bears the two RNAse III–like domains typical of Dicer and is required for generating siRNA-sized molecules, and its downregulation results in a less efficient RNAi response19. An ortholog of TbDcl1 has not been found in T. cruzi or L. major, trypanosomatids that lack a functional RNAi pathway. L. braziliensis, however, contains a similar gene (LbrM23_V2.0390) that is endowed with two conserved RNAse III domains. Dicer activity could also be carried out by a combination of independent proteins carrying the relevant dsRNA-binding domain, DEAD/H box RNA helicase and RNase III domains. The RNase genes implicated in this complex19 are missing in L. major and L. infantum, but present in the L. braziliensis genome at regions of otherwise conserved synteny between the Leishmania species (Supplementary Table 1 online).
Argonaute, an endonuclease involved in the dsRNA-triggered cleavage of mRNA, is another crucial component of the RNAi machinery and, unlike L. major, L. braziliensis contains an ortholog of the functional argonaute gene (TbAGO1) present in T. brucei. A second gene containing an argonaute PIWI domain (TbPWI1), which was originally identified in T. brucei and has orthologs in both Leishmania and T. cruzi, has been shown not to be involved in the RNAi pathway20. TbAGO1 can be functionally replaced by the human gene encoding Argonaute2, suggesting that TbAGO1 encodes the endonuclease activity required for mRNA target degradation in the trypanosome RNAi pathway21. The L. braziliensis gene contains the typical argonaute domains PAZ and PIWI, the latter of which contains key amino acids essential for TbAGO1 activity22. In addition, the L. braziliensis AGO1 gene encodes an amino-terminal RGG domain, which is present in TbAGO1 and shown to be essential for association with polyribosomes22.
Examination of the syntenic regions on chromosome 11 in L. major and L. infantum revealed remnants of AGO1, suggesting that the RNAi machinery has been lost from the Leishmania subgenus to which they both belong (Supplementary Table 1). In the alternative subgenus L. viannia (which includes L. braziliensis), RNA viruses have been characterized23, however, suggesting that this lineage could have retained RNAi as an antiviral defense mechanism. The RNAi machinery may also have a role in regulating the functions of transposable elements.
So far, only one gene locus has been directly implicated in Leishmania disease tropism. In Leishmania donovani, the causative agent of visceral Leishmaniasis, A2 gene products are required for parasite survival in visceral organs; by contrast, L. major contains only A2 pseudogenes24. Given this precedent, we systematically searched the three genomes in parallel (using ACT software25) for species-specific genes that might contribute to differences in disease presentation, immune response and pathogenicity. Despite the broad differences in disease phenotype, we found that few genes are specific to individual Leishmania species. Table 2 lists those genes that have been ascribed a putative function (the full list is given in Supplementary Table 2 online). We found 5 L. major–specific genes, 26 L. infantum–specific genes and ~47 L. braziliensis–specific genes, which were distributed throughout the genome (Fig. 1) rather than concentrated in subtelomeric regions or breakpoints of directional gene clusters, as previously observed across kinetoplastid species6. In addition to the 47 genes specific to L. braziliensis, an almost equivalent number of genes are present in L. major and L. infantum but absent or degenerate in L. braziliensis.
Given 20–100 million years of divergence within the Leishmania genus, the small number of species-specific differences in gene content is unexpected. For example, more than 1,000 genes differ between the human infective Plasmodium falciparum and the rodent malarial species26, which may have diverged over a similar timescale because the mouse and human lineages diverged from their common ancestor 75 million years ago27.
We found no obvious breaks in synteny or evidence that translocations or segmental duplications have served to generate lineage-specific diversity in Leishmania. We did, however, find clear instances where tandem duplication, followed by diversification, accounts for species-specific differences; for example, copies of a hydrolase gene in L. infantum (LinJ31.3030) and an adenine phosphoribosyltransferase gene in L. braziliensis (LbrM26_V2.0120) seem to have arisen and diverged from an adjacent gene. Larger tandem gene arrays are a characteristic feature of all kinetoplastid parasite genomes6, facilitating increased protein expression in the absence of gene regulation by transcription initiation. Although correctly assembling highly repetitive regions is technically difficult from randomly sequenced DNA, the depth of assembled reads provides an indication of the number of repeat units present in specific regions. The largest family of surface-expressed protein genes in Leishmania, the amastins, are specifically expressed by intracellular parasites in the host28. In L. major, the largest amastin array (comprising 21 out of 54 amastin genes) is interspersed with repeat units of the unrelated tuzin genes that encode proteins of unknown function. Although similar in organization, the amastin-tuzin array seems to be reduced in size by at least half in L. braziliensis (on the basis of the depth of coverage of reads across this repeat region). By contrast, the surface-expressed GP63 zinc metalloproteinases, which function in host cell binding and parasite protection from complement-mediated lysis29, are encoded by a repeated gene cluster that seems to be enlarged fourfold in L. braziliensis as compared with L. major or L. infantum.
A major determinant of lineage-specific differences in gene content seems to be pseudogene formation. The species specificity of ~80% of the genes listed in Table 2 and Supplementary Table 2 can be attributed to the deterioration of an existing coding sequence in the two other species: in each case, there is a degenerate sequence in the corresponding region of synteny in the species that lacks the ‘functional’ gene. This observation contrasts with an analysis of other kinetoplastid species, where gene insertions or substitutions were found more commonly to generate genus-specific sequences6.
We identified 23 pseudogenes, present in all three species, that show little degeneracy, suggesting that they have become pseudogenes recently or are under positive selection (Supplementary Table 2). In addition, they are interrupted by both frameshifts and in-frame stop codons in different positions across the three species (Fig. 2), indicating that they have arisen independently three times in the Leishmania lineage. Strong codon bias, a feature of Leishmania coding sequences, and sequence similarity are maintained in each pseudogene, and in-frame UAG or UAA stop codons are present in almost all, thereby ruling out translation through selenocysteine incorporation, a process that has been described in Leishmania30. For several pseudogenes, non-degenerate orthologs were identified in T. brucei and T. cruzi. Functions could be conceptually ascribed, on the basis of sequence similarity, to 12 pseudogenes, and in many cases relate to housekeeping (for example, carboxypeptidase, phosphoglycerate kinase, oxidoreductase, glutamamyl carboxypeptidase, aminoacyclase, epsilon-adaptin and beta-adaptin).
Of ~200 genes with a differential distribution between Leishmania species, the functions of only 34% could be annotated on the basis of sequence similarity or protein domain searches (Table 2 and Supplementary Table 2). Some gene products have similarity to proteins of unknown function in different organisms, whereas others are unique to the Leishmania species analyzed. Not surprisingly, a single candidate that might explain the different disease tropisms of the individual species did not emerge; however, many significant gene differences were identified.
One gene in L. infantum, which has become a pseudogene in L. braziliensis but seems to be absent from L. major, encodes a putative phosphatidylinositol or phosphatidylcholine transfer protein (PITP), SEC14 cytosolic factor. An apparently intact ortholog is present in T. cruzi but not in T. brucei. Although the precise role of this protein is unknown, it has been implicated in the budding of secretory vesicles from the trans-Golgi network31 and could therefore influence cell-surface molecule expression in L. infantum, affecting host-parasite interactions as a result.
Another L. infantum gene, which is a pseudogene in the other Leishmania species and T. brucei (but not in T. cruzi), encodes a putative phosphatidylinositol 3-kinase (PI3K). This PI3K has the remnants of a Ras-binding domain, a C2 lipid-binding domain, and accessory and catalytic domains reminiscent of class I PI3Ks present in other eukaryotes, including Dictyostelium discoideum, yeast and mammals. The only true PI3K identified in trypanosomatids so far is VPS34, a class III PI3K present in T. brucei32. Orthologs of VPS34 are present in all Leishmania species, but the L. infantum–specific class I PI3K is novel. Evidence suggests that PI3Ks and PITPs can work synergistically at the trans-Golgi to facilitate vesicle budding33 but, given the properties of class I PI3Ks in other systems and the large number of downstream effectors, the L. infantum PI3K might influence as yet unidentified processes that may have an impact on parasite tropism.
Another L. infantum–specific gene encodes glutathionylspermidine synthase (GspS), which is required for synthesis of the unusual thiol trypanothione that functions in protecting the parasite against oxidative stress. Although both GspS and trypanothione synthetase (TryS) are required to generate trypanothione in the related organism Crithidia fasciculata, a broad specificity trypanothione synthetase substitutes for both GspS and TryS in T. brucei and T. cruzi34. The gene encoding TryS in L. major is also sufficient to generate trypanothione, although a GspS pseudogene is also present in the genome35 and, with only four mutations, could be the result of a recent acquisition. Despite a much greater predicted period of separation, the L. braziliensis genome also has a clearly identifiable GspS pseudogene (with approximately ten mutations) with highly conserved domains.
Cyclopropane fatty acids (CFAs), although rare in eukaryotes, are common plasma membrane components in some bacteria and have been previously detected in lipid extracts from some but not all Leishmania species36. Consistent with that analysis, a single gene encoding cyclopropane fatty acyl phospholipid synthase (CFAS) is present in both L. infantum and L. braziliensis but not in L. major. In bacteria, cyclopropanation by CFAS requires S-adenosyl methionine (as a methylene donor) in a modification predicted to maintain the integrity of the plasma membrane—an important factor in the innate immune response to Mycobacterium tuberculosis infection37. The Leishmania CFAS gene is most similar to its bacterial homologs, and strong phylogenetic evidence (Supplementary Fig. 3 online) suggests that the Leishmania lineage acquired this gene by horizontal transfer (and secondary loss from L. major). Given that neither the enzyme nor its fatty acid modification are present in humans, CFAS is a putative chemotherapeutic target for the most severe form of leishmaniasis. In addition, the presence of this gene in some species but not others may explain published experimental data38 on the effects of the S-adenosyl methionine analog sinefungin, a compound with known antiparasitic properties. This drug inhibits the growth of L. donovani parasites (which are closely related to L. infantum and also have a CFAS gene) but has little effect on L. major38.
A notable absence from the L. braziliensis genome is the multigene HASP/SHERP locus, which encodes the HASP family of hydrophilic acylated surface proteins (expressed exclusively in infective stages of L. major and L. donovani) and the vector-stage–specific SHERP protein39. Although deletion of this region in L. major does not influence virulence, its overexpression causes increased sensitivity to complement-mediated parasite lysis and reduced viability in host macrophages40.
In addition to the small number of species-specific and differentially distributed genes, other genetic factors are likely to define the differences between the species. We therefore searched for genes with signatures of positive selection as an indicator that they may be involved in host-pathogen interactions (Methods). Those genes with the highest ratios of non-synonymous to synonymous mutations (dN/dS) were, for the most part, involved in undefined biological processes (Supplementary Table 3 online). We found, however, that ~8% of genes seem to be evolving at different rates between the three Leishmania species (Supplementary Table 4 online) and are involved in a spectrum of core processes (including transport, biopolymer metabolism, cellular metabolism, lipid metabolism and RNA metabolism), which might influence parasite survival in the host and disease outcome (Supplementary Table 5 online).
Comparisons of the complete genomes of three species of Leishmania have revealed a greater extent of synteny and similarity than would be expected, given their predicted period of separation. Contrary to previous comparisons of distantly related kinetoplastid genomes, gene loss and pseudogene formation are the principal factors shaping the Leishmania genomes. We have found little evidence of lineage-specific genetic acquisition accounting for differences between these parasite species.
Given our poor understanding of the way in which different human-infective species of the Leishmania genus cause diverse clinical disease, the identification of only a few differentially distributed parasite genes should facilitate timely experimental verification of their role in disease development. In addition, the unexpected identification of a putative RNAi pathway increases the likelihood that the findings from the three genome projects can be translated into insights into gene function. The potential to manipulate gene expression by RNAi, perhaps by using a tetracycline-inducible promoter system (as demonstrated in L. donovani41), may be especially useful to complement the classical ‘two-step gene knockout’ strategy for disruption of Leishmania gene function42. Identification of a few genes that are either species-specific or under positive selective pressure provides a comprehensive and manageable resource to target efforts in identifying parasite factors that influence infection. Conversely, factors that are unique to the Leishmania genus but common to all species may be used as potential drug targets or vaccine candidates.
Details of the sequenced L. major strain have been published3. L. infantum JPCM5 (MCAN/ES/98/LLM-877)43 and L. (Viannia) braziliensis M2904 (MHOM/BR/75M2904)44 were the strains selected for analysis here. The L. infantum JPC (MCAN/ES/98/LLM-724) strain, from which the JPCM5 clone used in the sequencing project was derived, was isolated in the WHO Collaborating Centre for Leishmaniasis, ISCIII, Madrid, Spain, from the spleen of a naturally infected dog residing in the area in 1998 (ref. 43). The parasites were tested for virulence by inoculation into hamsters: parasites were recovered from the spleen 15 weeks after infection. The parasites also infected the human U937 macrophage cell line and the dog DH82 macrophage cell line43.
L. (Viannia) braziliensis clone LB2904 (MHOM/BR/75M2904) is a reference strain from Evandro Chagas Institute, Belém, Brazil. This strain was isolated by direct culture from a lesion on the right side of the thorax of a man who had been performing survey work in Serra dos Carajás, Brazilian Amazonia. The LB2904 clone is infective in hamsters and BALB/c mice and can be genetically transfected and cloned on plates. The L. infantum and L. braziliensis strains used are available on request from D.F.S. or J.C.M., and A.K.C., respectively.
The following methodology for sequencing, assembly, finishing and annotation applies to both L. infantum and L. braziliensis. A whole-genome shotgun strategy was used and produced roughly sixfold coverage of the whole genome from plasmid clones containing small fragments of up to 4 kb inserted into the pUC19 vector (Sanger Institute). Problems associated with high G+C sequence were addressed by optimizing the sequencing mixture (a 4:1 ratio of standard Big Dye terminator mix and dGTP Big Dye mix with the addition of dimethylsulfoxide). Sequence reads were assembled with PHRED/PHRAP on the basis of overlapping sequence and were edited in a GAP4 database45. The quality of the reads for both projects was similar: 91.5% of L. infantum and 92.7% of L. braziliensis bases had a quality score (derived from the PHRED score generated by GAP4; ref. 45) >70 (P = 1.0−7). In comparison, in the finished genome of L. major 96.8% of bases exceeded this value.
Regions containing repeat sequences or with an unexpected read depth were manually inspected. We used positional information from sequenced read-pairs to help to resolve the orientation and position of contigs. Pre-finishing used an automated in-house software program (Auto-Prefinish) to identify primers and clones for additional sequencing to close physical and sequence gaps by oligo-walking. In addition, end sequences from a L. braziliensis fosmid library (4–5-fold clone coverage) were produced to provide paired-read information from 40-kb inserts. The assembled contigs were iteratively ordered and orientated by alignment to the L. major genome sequence and by manual checking. In particular, we re-examined regions with apparent breaks in synteny for potential mis-assembly errors or genuine breaks. Information from orientated read-pairs, together with additional sequencing from selected large insert clones, was used to resolve potential mis-assemblies. Version 2 of the L. infantum and L. braziliensis genomes were used for the subsequent analyses reported here.
Manual annotation of the L. major genome3 was transferred to the assembled genomes of both L. infantum and L. braziliensis on the basis of BLASTp matches and positional information by using an in-house Perl script. Gene models were manually inspected and further edited, where appropriate, with Artemis software46. New gene models were identified by using a combination of CodonUsage47 and Hexamer48, and by visualizing tBLASTx comparisons of regions with conserved synteny using ACT software25. We compared protein sequences against the non-redundant protein database UniProt and an in-house kinetoplastid-only database. Repetitive regions can largely account for small discrepancies in apparent sequence coverage and gene number.
For the dN/dS analysis, three-way positional orthologs were identified by a combination of reciprocal BLAST and manual curation of conserved synteny regions. Codon-based alignments were produced by using codeml from the PAML package49 and the settings: model = 0 (one dN/dS estimate over whole tree) for the dN/dStree estimates, and model = 1 (one dN/dS estimate for each branch of tree) for the dN/dSbranch estimates, with the assumption that orthologous rates were equivalent. dN/dS estimates were considered significantly different between species if 2(lnLmodel1 – lnLmodel0) > 5.911 (5% χ2 critical value with 2 d.f.). Genes with dN/dS > 5, or 2(lnLmodel1 – lnLmodel0) ≤ 0 were excluded from further analysis. Mann-Whitney tests were used to determine whether groups of genes had significantly higher or lower dN/dS values as compared with all other genes. A Kruskall-Wallis test was used to determine whether differences in dN/dSbranch values were significant between species for genes grouped by gene ontology category.
The CFAS gene was identified as a potential lateral transfer by similarity searching (BLASTp) against the GenBank non-redundant protein database using the L. infantum CFAS sequence as query. To assemble the data set for phylogenetic analysis, all sequences with an e-value of <10−30 were downloaded. Note that, although eukaryotes were not specifically excluded from this process, none of the eukaryotic sequences in GenBank, which includes the completely sequenced genomes of Trypanosoma cruzi and Trypanosoma brucei, met the e-value cut-off criterion.
Sequences were aligned with MUSCLE using default parameters. Regions of poor alignment where homology could not be ascertained with confidence were identified by eye and excluded. We conducted preliminary analyses of all sequences by unweighted parsimony using PAUP. The data set was narrowed down through successive rounds of analysis and sequence removal to obtain a final subset of sequences that were broadly representative of the full data set.
The final tree was derived by bayesian inference using a mixture of amino acid models. Alignment positions were weighting according to evolutionary rate by using a four-category γ-distribution with the shape parameter α calculated by the program on the basis of a neighbor-joining tree. Analyses consisted of two sets of four chains run for 600,000 generations with results saved every 1,000 generations. Analyses were run until both sets of chains converged (split frequency = 0.007), and tree topology and posterior probabilities were calculated after discarding a 25% burn-in (150 trees). The tree topology was further tested with 100 replicates of maximum likelihood bootstrapping by the program PhyML using a JTT substitution model with a four-category γ-distribution and with the shape parameter α calculated by the program.
We acknowledge the support of the Wellcome Trust Sanger Institute core sequencing and informatics groups. We thank N. Goldman (European Bioinformatics Institute) for advice on the evolutionary analysis, C. Hertz-Fowler for help in constructing the figures, J. Shaw for his help in selecting the strain for the L. braziliensis genome sequencing project and D. Harper for quality scores on the sequencing projects. This study was funded by the Wellcome Trust through its support of the Pathogen Sequencing Unit at the Wellcome Trust Sanger Institute. L.O.B. and J.C.R. were recipients of Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) fellowships. D.P.D. was supported by a postgraduate studentship from the Biotechnology and Biological Sciences Research Council. J.C.R. received financial support from the UNICEF/UNDP/WORLD BANK/WHO Special Programme for Research and Training in Tropical Diseases (TDR).
Accession codes European Molecular Biology Laboratory (EMBL): L. infantum chromosomes 1–36, AM502219 to AM502254; L. braziliensis chromosomes 1–35, AM494938 to AM494972.
URLs The L. infantum and L. braziliensis genome sequencing reads, quality files and annotated consensus sequences can be accessed from the following FTP sites: ftp://ftp.sanger.ac.uk/pub/pathogens/L_infantum/, ftp://ftp.sanger.ac.uk/pub/pathogens/L_braziliensis/. The fully annotated genomes for all three species of Leishmania are also available for searching, viewing and downloading at the GeneDB database (http://www.genedb.org). Other URLs: MUSCLE, http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py; PAUP, http://www.molecularevolution.org/software/paup*/; PhyML, http://atgc.lirmm.fr/phyml/; pUC19 vector information, http://www.sanger.ac.uk/Teams/Team53/psub/sequences/pUC19.shtml; RepeatMasker, http://www.repeatmasker.org/; TDR Leishmaniasis URL, http://www.who.int/tdr/diseases/leish.
Supplementary information is available on the Nature Genetics website.
COMPETING INTERESTS STATEMENT The authors declare no competing financial interests.
Published online at http://www.nature.com/naturegenetics
Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions