|Home | About | Journals | Submit | Contact Us | Français|
Schistosoma mansoni is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. We report here analysis of the 363 megabase nuclear genome of the blood fluke. It encodes at least 11,809 genes, with an unusual intron size distribution, and novel families of micro-exon genes that undergo frequent alternate splicing. As the first sequenced flatworm, and a representative of the lophotrochozoa, it offers insights into early events in the evolution of the animals, including the development of a body pattern with bilateral symmetry, and the development of tissues into organs. Our analysis has been informed by the need to find new drug targets. The deficits in lipid metabolism that make schistosomes dependent on the host are revealed, while the identification of membrane receptors, ion channels and more than 300 proteases, provide new insights into the biology of the life cycle and novel targets. Bioinformatics approaches have identified metabolic chokepoints while a chemogenomic screen has pinpointed schistosome proteins for which existing drugs may be active. The information generated provides an invaluable resource for the research community to develop much needed new control tools for the treatment and eradication of this important and neglected disease.
Schistosomiasis is a Neglected Tropical Disease that ranks with malaria and tuberculosis as a major source of morbidity affecting approximately 210 million people in 76 countries of the world, despite of strenuous control efforts1. Is caused by blood flukes of the genus Schistosoma (Phylum Platyhelminthes), which exhibit dioecy and have complex life cycles comprising multiple morphologically distinct phenotypes in definitive human and intermediate snail hosts. S. mansoni, one of the three major human species, occurs across much of sub-Saharan Africa, parts of the Middle East, Brazil, Venezuela and some West Indian islands. The mature flukes dwell in the human portal vasculature, depositing eggs in the intestinal wall that either pass to the gut lumen and are voided in the faeces, or travel to the liver where they trigger immune-mediated granuloma formation and peri-portal fibrosis2. Approximately 280,000 deaths per annum are attributable to schistosomiasis in sub-Saharan Africa alone3. However, the disease is better known for its chronicity and debilitating morbidity4. A single drug, praziquantel, is almost exclusively used to treat the infection but this does not prevent reinfection and with the large-scale control programs in place, there is concern about the development of drug resistance. Indeed, resistance can be selected for in the laboratory and there are reports of increased drug tolerance in the field5.
In this study we present the sequence and analysis of the S. mansoni genome. Previous metazoan projects have been restricted to Deuterostomia (e.g. Homo, Mus, Ciona) and the ecdysozoan clade of the Protostomia (e.g. Drosophila, Caenorhabditis, Brugia). Together with the accompanying article on S. japonicum, we present the first descriptions of metazoan genomes from the lophotrochozoan clade. The genome reveals features that aid our understanding of the evolution of complex body plans. We have mined the genome to predict new drug targets, based on searches involving traditional areas for drug discovery, metabolic reconstruction, and bioinformatics screens that exploit shared pharmacology. It is hope that these, and other, targets will accelerate drug discovery, generating the much needed new treatments for control and eradication of schistosomiasis.
The nuclear genome sequence of S. mansoni was determined by whole genome shotgun and assembled into 5,745 scaffolds greater than 2 kb (Supplementary Table 1), totalling 363 megabases (Mb). Although 40% of the genome is repetitive, 50% percent is assembled into scaffolds of at least 824.5 kb. Furthermore, 43% of the genome assembly (distributed over 153 scaffolds) was unambiguously assigned to chromosomes (7 autosomal, plus ZW sex-determination pairs) using Fluorescence In Situ Hybridization (Fig. 1, Supplementary Fig. 1 and Supplementary Table 2).
We identified 72 families of both LTR and non-LTR transposons, comprising 15% and 5% of the genome, respectively, and containing 63 and 60 new families each (Supplementary Table 3). The LTR transposons are from the Gypsy/Ty3 and BEL clades while the non-LTR transposons are restricted to the RTE, CR1 and R2 clades. Two previously described non-LTR retrotransposon families from the RTE clade (SR2 and perere-3)6,7, appear to have undergone a burst of transposition events after divergence of S. mansoni and S. japonicum and contribute to an overall higher representation of non-LTR retrotransposons in S. mansoni (15%, cf. 8% in S. japonicum). A novel DNA transposon belonging to the Mu family was also found, which represents the first instance in a flatworm. The presence of target site duplications in some copies implies recent transposition and suggests that active copies may still exist in the genome. A lack of terminal inverted repeats –a feature of Mu family members – suggests a peculiar mechanism for recognition of this element by the transposition apparatus.
We identified 11,809 putative genes encoding 13,197 transcripts. Considering genes that do not span a gap, the average gene size is 4.7 kb, typically with large introns (average is 1692 bp) and much smaller exons (average is 217 bp). Moreover, the introns display a strikingly skewed size distribution that has not been observed in other eukaryotes, whereby 5′ introns are smaller than 3′ introns (Fig. 2, Supplementary Information, Supplementary Table 5). In multi-exon genes the first few introns can be as small as 26 bp, whereas introns towards the 3′ end are typically kilobases in length (largest is 33.8 kb). The reason for this is unclear but suggests unusual transcriptional control. However, a survey of conserved transcription factor domains reveals S. mansoni to be broadly similar to other eukaryotes (Supplementary Information, Supplementary Fig. 2 and Supplementary Table 6). It is noteworthy that 43% of transcription factor families with schistosome representatives also contained vertebrate sequences, nearly twice the number that matched nematode worms, emphasising their evolutionary distance.
At least 45 genes have an unusual micro-exon structure. Individual micro-exons have been described in other genomes, dispersed among numbers of normal exons8. However, S. mansoni is remarkable in containing micro-exon genes (MEGs) where micro-exons comprise 75% of the coding sequence, are flanked at the 5′ and 3′ extremes by conventional exons and have lengths that are multiples of three bases (from 6 to 36).
Other than having shared gene structure, no similarity could be detected between 14 MEG families (each with up to 23 members; Fig. 3 and Supplementary Table 7). Moreover, they displayed no similarity with annotated genes from outside Schistosoma spp, nor any identifiable motifs or functional domains. Comparisons between MEG family members and related proteins from S. japonicum suggest that some gene duplication events preceded the divergence of the two species. Almost all encode a signal peptide at the 5′ end and three have membrane anchors, so most are probably secreted. Examination of the large EST data set from across the life cycle reveals that genes from all MEG families are transcribed in the intramammalian stages of the life cycle, and the germ balls of daughter sporocysts that develop into infective cercariae, but probably not in miracidia that infect the snail intermediate host (Fig. 3).
Sequencing of transcripts from three MEG families revealed the occurrence of numerous alternative splice variants formed by exon skipping. In one of the families analyzed, all internal exons except those coding for the signal peptide were missing in at least one transcript sampled and a gene from a second family presented different transcripts with extended exons produced by the use of alternative splicing sites. These observations suggest a ‘pick and mix’ strategy is used to create protein variation
Schistosomes are the first Platyhelminthes to be fully sequenced and provide insights into the evolution of ‘simple’ animals. Using Treefam to make comparisons with the sea anemone Nematostella vectensis, a representative of the Radiata, we sought gene families restricted to, or expanded in the Bilateria (Supplementary Table 8). The advent of a third germ layer in flatworms is paralleled by the expansion of genes encoding cell adhesion molecules such as cadherins. Similarly, tissue-patterning developmental cues (e.g. Notch/Delta) and histone-modifying enzymes (e.g. histone acetyltransferases) have proliferated. Some genes such as the Tetraspanins that encode membrane structural proteins have greatly proliferated in schistosomes suggesting a critical role in worm physiology/parasitism. The large array of paralogues for fucosyl and xylosyltransferases involved in the generation of novel glycans expressed at the host-parasite interface, may be important for subverting the immune system. The expansion of proteases in schistosomes also appears directly related to parasitism, as it includes families involved in host invasion (Invadolysins) and blood feeding (Cathepsins). Finally, G-Protein-coupled receptors show varying levels of contraction in schistosomes whereas several classes (e.g. peropsins) are greatly expanded in Nematostella implying functions associated with the free-living lifestyle.
Although schistosomes are acoelomate, they possess tissues approaching the sophistication of organs such as gut, nephridia, nerve and muscle, concerned with discrete physiological processes such as feeding, excretion and locomotion. However, as lophotrochozoans they are evolutionarily distant from the previously sequenced parasitic nematodes, Brugia9 and Meloidogyne10,11 (both ecdysozoans). Compartmentalisation of schistosome tissues and the formation of epithelial barriers are crucial for life in the hostile environment of the host bloodstream. Schistosomes possess the typical machinery of higher metazoa to interact with the cytoskeleton and control cell polarity (Supplementary Information, Supplementary Table 9), organise epithelia and denote tissue boundary lines.
S. mansoni posses a nervous system that includes an anterior brain and longitudinal nerve cords, which extend from the brain to run the length of the worm body. In addition, a variety of sensory structures (at least six types in the cercaria12) are able to transduce a wide range of stimuli that assist in host location, penetration and navigation through the vasculature. In common with more complex organisms, schistosomes possess the tools needed to mediate neurogenesis and control axon growth cones and migration of neural cells (Supplementary Information, Supplementary Table 9), supporting the ancient origins of neural complexity.
Historically, anti-schistosomiasis agents were identified by in vivo screening in animal models. The S. mansoni genome project makes a more target-based approach to drug discovery feasible and some promising leads have already emerged. These include a family of nuclear receptors 13 (Supplementary Information) and a redox enzyme, thioredoxin glutathione reductase, recently validated as a drug target14. The condensed redox biochemistry of S. mansoni, relative to its human host, may offer further drug development targets (Supplementary Information). In the context of drug discovery, we have explored other potential areas of vulnerability: lipid metabolism G-protein coupled receptors, ligand- and voltage-gated ion channels; kinases; proteases; and neuropeptides. We also undertook two bioinformatics-led approaches: metabolic reconstruction to identify chokepoints, and sequence searches for structures related to known drug targets.
S. mansoni contains a full complement of genes required for most core metabolic processes, such as glycolysis, tricarboxylic acid cycle and the pentose phosphate pathway. However, schistosomes are incapable of de novo synthesis of sterols or free fatty acids and must utilise complex precursors from the host15. An extensive lipid carrying protein repertoire could be identified but, despite producing precursors for fatty acid synthesis, fatty acid synthase could not be identified. An inability to utilize isoprene products of the mevalonate pathway most likely accounts for the lack of sterol biosynthesis (Supplementary Table 11, Supplementary Information). The genes necessary for a complete beta oxidation pathway are present, and this usually inactive pathway might operate in reverse to perform syntheses16. Despite constituting 40% or more of the lipid content of adult worms15, triacylglycerol plays an uncertain role in the schistosome's life cycle; they are slow to turn over, do not contribute to the formation of other lipids15 and their use as an energy store is doubtful16. Nevertheless, S. mansoni possesses lipases capable of breaking down triacylglycerol, so they may have functions beyond preventing too high concentrations of intracellular fatty acids15. Pathways responsible for synthesizing the phospholipid components of membranes are well represented except that phosphatidylcholine must be derived from diacylglycerol17 and the parasite must depend on its host as a source of inositol.
G-protein couple receptors (GPCRs), ligand-gated ion channels (LGICs) and voltage-gated ion channels (VGICs) are targets for 50% of all current pharmaceuticals18. At least 92 putative GPCR-encoding genes are present (Supplementary Table 12), the bulk (82) from the rhodopsin family. The largest groups are the alpha subfamily (30), which includes amine receptors and the beta subfamily (24), which contains neuropeptide and hormone receptors. The diversity of the former subfamily underlines the wide range of potential amine/neurotransmitter reactivities of schistosomes but the tentative identities assigned need to be confirmed by functional studies, as has already been performed for a histamine receptor19. Schistosomes detect chemosensory cues but a large, unique clade of the mediating receptors was not found. However, the 26 “orphan” rhodopsin family GPCRs may include proteins with this role. Outside the large rhodopsin family, representatives from each of the smaller families of GPCRs, glutamate family (2), frizzled family (3), and the secretin/adhesion family (4) are present.
Each of the three major LGIC families, the Cys-loop family, Glutamate activated cation channels, and ATP-gated ion channels, are represented in the schistosome genome. Of the 13 Cys-Loop Family LGICs, nine encode nicotinic acetylcholine receptor subunits (Supplementary Fig. 4 and Supplementary Table 13). The remaining four anion channel sub-units group amongst GABA, glycine and glutamate receptors but it is not possible to assign precise identities. The seven schistosome glutamate-activated cation channels comprise at least two sequences from each of the three common sub-groupings. The presence of a functional P2X receptor for ATP-mediated signalling in schistosomes was already known20, and the data here reveal at least four more.
VGICs generate and control membrane potential in excitable cells and are central to ionic homeostasis. There are examples of successful drugs targeting voltage-gated sodium, potassium and calcium channels21. Although voltage-gated sodium channels were not found, at least 41 members from each of the major of six transmembrane (6TM) and four transmembrane (4TM) families of potassium channel (Supplementary Table 14) are present. The 6TM voltage-gated potassium channel family (20 members) is the largest, including the well-characterized Kv1.1 channel found in nerve and muscle of adult schsitosomes22. Other classes of 6TM potassium channels include the KQT channels, large calcium-activated channels, small calcium-activated channels, and cyclic-nucleotide-gated groups. This last comprising 8 members is most often associated with signal transduction in primary olfactory and visual sensory cells (C. elegans has only 523). S. mansoni possesses six 4TM inward-rectifying TWIK-related postassium channels (cf. 46 in C. elegans). There are four alpha and two beta subunits of voltage-gated calcium channels in schistosomes and a beta subunit is implicated as a molecular target of the antischistosomal praziquantel24.
Protein kinases are important regulators of many different cellular functions. Both they and their inhibitors have entered the drug development pipeline in recent years25 but few schistosome kinases have been characterized to date. The S. mansoni genome encodes 249 kinases, including 22 genes with alternative splicing (Supplementary Information). This corresponds to 1.9% of the total coding proteins in the genome, a figure comparable to that found in other species26 (Supplementary Fig. 6). S. mansoni possesses representatives of all of the main kinase groups (Supplementary Fig. 7), the largest of which is the CMGC (cyclin-dependent, MAP-, glycogen synthase kinase 3 and CK2-related kinases) group, in contrast to other analysed eukaryotic genomes. However, a single class (RCK) is absent from the CMGC family, a deficiency shared with yeast but not nematodes or mammals.
The least represented groups are the Casein Kinase (CK1) and Receptor Guanylate Cyclase (RGC) families with only 7 and 3 members, respectively, contrasting with C. elegans where CK1 is the largest group and RGC has 27 members. CK1 (and CMGC) group members that are expressed in sperm or during spermatogenesis in C. elegans, are missing in S. mansoni.
Proteolytic enzymes (proteases), making up an organism's ‘degradome’27, operate in virtually every biological and pathological phenomenon28 and are proven drug targets in diverse biomedical contexts29,30. All five major classes of proteases (aspartic, cysteine, metallo-, serine and threonine) are represented as various clans (mechanistically related groups) in the parasite genome (Supplementary Table 17). The percentage distribution of the major clans is generally similar to that of the human host with some notable exceptions, mainly due to the expansion of constituent protease families in humans. Sixty-one of the 73 protease families found in humans are also in S. mansoni and 60 families are shared. With 335 sequences, proteases comprise 2.5% of the putative proteome (Supplementary Table 18), consistent with the proportion in other organisms (1-5%), but only one-third that in humans (945 sequences, if A2 family retrovirus and retrotransposon proteases are included).
The greatest difference between host and parasite is in the paucity of Family S1 chymotrypsin-like enzymes in the latter (22 vs. 135 human sequences). This reflects the evolution and diversification of Family S1 for complex and highly regulated proteolysis cascades in vertebrates and some invertebrates such as innate immunity, development, blood coagulation and complement activation31-33. From a therapeutic standpoint, the reduced complexity may prove valuable with fewer parasite proteases available for essential life-sustaining functions. For example, robust drug discovery programs are in place for Families S134 and C14 (caspases)35, upon which anti-schistosomal drug discovery could ‘piggy-back’36. It is also notable that a smaller number of schistosome protease families (e.g., C1, M8 and M13) have more members than the respective families in humans. C1 proteases are involved in nutrient digestion by the parasite, which contrasts with the S1 enzymes employed in the host. This disparity has already been exploited for a promising anti-schistosome therapy37. One protease family (C83) is apparently unique to S. mansoni.
Apart from the degradome, but involved in its modulation, 34 protease inhibitors were found (Supplementary Table 19). The majority of these are serine protease inhibitors belonging to Families I2 (Kunitz-type) and I4 (serpins). Two inhibitors of cysteine proteases (cystatins38,39) and two alpha-2-macroglobulin homologues (I39) were also identified, as were three Inhibitor of Apoptosis proteins (I32), one of which is highly expressed in adults, where it may function to regulate one or more of the four schistosome caspases.
Thirteen putative neuropeptides were identified (Supplementary Table 20), indicating that schistosomes may display much greater diversity than the two described previously. Apart from the neuropeptide Fs (NPFs), most are apparently restricted to the Platyhelminthes, their absence from humans making them a credible source of anthelmintic drug leads. The predicted product of npp-6 (AVRLMRLamide) resembles molluscan myomodulin, while the two NPP-13 peptides display 100% C-terminal identity with vertebrate neuropeptide-FF-like peptides (PQRFamides); neither of these has previously been reported in any non-vertebrate organism. The discovery of a second NPF (Sm-NPP-21b) additional to the known Sm-NPP-21a40 is reminiscent of the vertebrate neuropeptide Y (NPY) superfamily, and strengthens the argument that NPFs and NPYs have a common ancestry.
A chokepoint analysis of metabolic pathways reconstructed from the S. mansoni genome was used to identify additional targets. A total of 607 enzymatic reactions could be placed in pathways and 120 of these enzymes were identified as chokepoints (Supplementary Table 21). The list of chokepoints includes many that are drug targets in other organisms as well as target reactions already characterized in S. mansoni, validating the approach (Supplementary Information). The list also contains new candidate targets and comprises approximately 1% of the S. mansoni proteome.
In the context of Neglected Tropical Diseases, with constrained investment in drug discovery, ‘piggy-backing’36 or ‘drug-repositioning’ strategies41 that re-use existing drugs, offer potential time savings and cost benefits. We adopted a two-fold strategy to find significant matches between proteins from the parasite and known ‘druggable’ protein targets of the human host and human-infective pathogens. Using conservative parameters of > 50% sequence identity over > 80% of the target, we first performed a similarity search against a database of targets curated from medicinal chemistry literature. This revealed 240 distinct S. mansoni transcripts with matches to targets against which there are high quality compounds (Supplementary Table 22). Given the need for short-course, oral therapies against schistosomiasis, this list was further reduced to 94 S. mansoni targets by filtering for potency and predicted bioavailability. A second search, against a database of the targets for human-directed drugs, revealed 66 significant matches with currently marketed pharmaceuticals (Supplementary Table 23), corresponding to 34 S. mansoni targets (26, after representing multicopy genes as a single instance; Table 1). For instance disulfiram, for controlling substance abuse, was highlighted as a potential anti-schistosomal drug; its anti-parasite properties have already been investigated42. Manual inspection of the list for compounds with side effects and toxicity can further refine choices, e.g, by eliminating the immunosuppressants, cyclosporin and rapamycin. The remaining known drugs could be directly tested in animal models, and either applied unmodified in anti-schistosomal therapy, or could serve as leads for further optimisation. Widening the search beyond the initial strict criteria would expand opportunities, e.g. Topoisomerase 1, is retrieved below our initial threshold, at 71% identity but only 58% overlap.
A century after Louis Sambon first named the species in 1907, the sequencing of the S. mansoni genome is a landmark event. The sequence provides the scientific community with multiple avenues to study this under-researched human pathogen and will drive future evolutionary, genetic and functional genomic research. Not least, given that just one drug is widely available to treat schistosomiasis, the genome sequence, including the genome-mining analysis presented, offers the possibility that new drug candidates will be identified soon.
Mixed sex cercariae from the Puerto Rico isolate of S. mansoni43, released from infected Biomphalaria glabrata snails, were placed in low-melting agarose plugs and genomic DNA prepared by standard methods. Approximately, six-fold coverage of the nuclear genome was obtained using a whole genome shotgun sequencing approach where libraries of different cloned insert sizes (in plasmid, fosmid and BAC vectors) were randomly sequenced by Sanger technology from either end. Sequence reads were assembled and scaffolds were FISH mapped to individual chromosomes where possible (Supplementary Table 2). The output of several gene prediction algorithms, trained using 409 manually curated gene structures, were integrated into a single set of gene predictions (v4), which were used for subsequent analyses. Data were accessed via GeneDB (http://www.genedb.org) and Artemis was used for subsequent manual annotation and curation of a further 958 genes during subsequent analyses (as described previously44).
Full methods and all associated references are available in the online version of the paper at www.nature.com/nature.
Further details for additional methods used in this study are provided in Supplementary Information.
This genome sequencing and annotation work was funded by the Wellcome Trust [grant number WT085775/Z/08/Z] and NIH-NIAID grant AI48828 to NES. We thank Neil D. Rawlings of the MEROPS database team at the Wellcome Trust Sanger Institute for his invaluable help, Dr. Jean C. Illes for discussions on polarity complexes, and Fransisco Prosdocimi and Maria Rosa Domingo Sananes for early discussions and analyses in the project. FISH chromosome mappings were partially supported by Oyama Health Foundation (H.H.), JSPS (13557021) (H.H.), 21st century COE and global COE of MEXT. Additional support by The Sandler Foundation (C.R.C., M.S.), NIH R01 GM60595 (P.C.B.), NIH-Fogarty 5D43TW006580 (P.T.L.), NIH-Fogarty 5D43TW007012-03, NIH Grant AI054711-01A2 (R.A.W. and G.P.D.), FAPEMIG REDE-281/05 (G.O.), the PhRMA Foundation (Postdoctoral Fellowship in Informatics to S.T.M.), The Burroughs Wellcome Fund (P.T.L.) and The UNICEF/UNDP/World bank/WHO Special program for research and training in tropical diseases (TDR) (P.T.L.). R.D.M. was a recipient of CAPES and FAPESP fellowships.
Author Contributions: AI, AW, CMFL, DAJ, NMES and PTLV initiated the project; MQ constructed DNA libraries and JP and JR directed sequencing; YG and ZN assembled the genome sequence data; HH, PTLV, RP and YH produced the mapping data; ADe, ADj, ART, BH, DB, DL, GCC, JW, MAA, MAR, MSa, OW, PDA, RH, SLS and TE provided computational and bioinformatic support; ART, MAA and RH setup and maintained the genome database; CM, DB, GB, GCC and JG produced the gene finding training set; BH, MP and MSt trained genefinding software; AP, BB and BH annotated the genome data; AC, AZ, BAL, CRC, DLW, GO, GPD, JPO, LFA, MSa, MZ, PMV, RC, RDM, STM, TAD and WW contributed specific analysis topics presented in this manuscript; CHF and EG contributed to general project and sequencing management; BB, CHF, CRC and JP commented on the manuscript drafts; GCC performed data submission to Genbank; AW, MB, NMES, PTLV drafted and edited the paper; AW, DAJ and PTLV provided DNA resources for the sequencing; MB and NMES directed the project and assembled the manuscript.
Author Information: The annotated genome sequence has been submitted to EMBL with the accession numbers FN357292-FN376313. All data are also available for browsing in the GeneDB database (http://www.genedb.org/genedb/smansoni/). The CHORI BAC clones used in this study are available from http://bacpac.chori.org/.
The authors have no competing financial interests.