(ncRNA) plays a crucial role in gene expression regulation, cellular function and defense, and disease. Indeed, in higher eukaryotes, most of the genomic DNA sequence encodes non-protein-coding transcripts [1
]. In contrast to protein-coding mRNAs, ncRNAs do not form a homogeneous class. The best-characterized subclasses form stable basepairing patterns (secondary structures) that are crucial for their function. This group includes the well-known tRNAs, catalytically active RNAs such as rRNA, snRNAs, RNase P RNA, and other ribozymes, and regulatory RNAs such as microRNAs and spliceosomal RNAs that direct protein complexes to specific RNA targets. Much less is known about long mRNA-like ncRNAs, which are typically poorly conserved at the level of both sequence and structure.
Most non-vertebrate genome projects have put little emphasis on a comprehensive annotation of ncRNAs. Indeed, most non-coding RNAs, with the notable exception of tRNAs and rRNAs, are difficult or impossible to detect with BLAST
in phylogenetically distant organisms. Hence, ncRNA annotation is not part of generic genome annotation pipelines. Dedicated computational searches for particular ncRNAs, for example, RNase P and MRP [2
], 7SK RNAs [4
], or telomerase RNA [6
], are veritable research projects in their own right. Despite best efforts, ncRNAs across the animal phylogeny remain to a large extent uncharted territory.
The main difficulty with ncRNA annotation is poor sequence conservation and indel patterns that often correspond to large additional "expansion domains". In many cases, the secondary structure is much better conserved than the primary sequence, providing a means of confirming candidate ncRNAs even in cases where sequence conservation is confined to a few characteristic motifs. Secondary structure conservation can also be utilized to detect homologs of some ncRNAs based on characteristic combinations of sequence and structure motifs using special software tools designed for this purpose.
] we described a protocol for a more detailed homology-based ncRNA annotation than what can be achieved with currently available automatic pipelines. Here, we apply this scheme to the genome of S. mansoni
, and by comparison with the newly sequenced S. japonicum
genome, identify ncRNAs in both of these clinically important schistosomes.
Schistosomes belong to an early-diverging group within the Digenea, but are clearly themselves highly derived [9
]. The flatworms are a long-branch group, suggesting rapid mutation rates (see [12
are comparatively large, estimated to be over 350 megabase pairs, and perhaps as high as 400 megabase pairs, for the haploid genome of S. mansoni
and S. japonicum
]. The other major schistosome species parasitizing humans probably have a genome of similar size, based on the similarity in appearance of their karyotypes [16
]. These large sizes may be characteristic of platyhelminth genomes in general: the genome of Schmidtea mediterranea
is even larger, with the current genome sequencing project reporting a size of ~480 million base pairs [17
Genome sequencing of the seven autosomes and the pair of sex chromosomes of S. mansoni
with about 8× coverage has lead to a genome assembly comprising 5,745 scaffolds (> 2 kb) covering 363 Mb [13
]. Similarly, shotgun sequencing of S. japonicum
with coverage of 5.4× decoded 397 Mb of sequence [15
]. These form about 25,000 scaffolds. Albeit both genome projects did not lead to complete finished genomes, we therefore know at least 90-95% of the genomic DNA sequences of S. japonicum
and S. mansoni
The protein-coding portion of the Schistosoma
genomes have received much attention in recent years. Published work includes transcriptome databases for both S. japonicum
] and S. mansoni
], microarray-based expression analysis [21
], characterization of promoters [22
], and physical mapping and annotation of protein-coding genes from both the S. mansoni
and S. japonicum
genome projects [18
]. Recently, a systematic annotation of protein-coding genes in S. japonicum
was reported [24
]. In contrast to other, better-understood, parasites such as Plasmodium
], however, not much is known about the non-coding RNA complement of schistosomes. Only the spliced leader RNA (SL RNA) of S. mansoni
], the hammer-head ribozymes encoded by the SINE-like retrotransposons Sm-α
], and secondary structure elements in the LTR retrotransposon Boudicca
] have received closer attention. Ribosomal RNA sequences have been available mostly for phylogenetic purposes [30
], and tRNAs have been studied to a limited degree [31
The wealth of available ESTs, in principle, provides a valuable resource for ncRNA detection. Since mostly poly-A ESTs have been generated, it is not surprising that most ESTs have been attributed to protein-coding genes [32
]. The large evolutionary distance, with 55% of the genes without homologs outside the genus [13
], makes it hard or even impossible to reliably distinguish ESTs of putative mRNA-like ncRNAs from non-coding portions of protein-coding transcripts.
In this contribution we therefore focus on a comprehensive overview of the evolutionary conserved non-coding RNAs in the genomes of S. mansoni and S. japonicum. We discuss representatives of 23 types of ncRNAs that were detected based on both sequence and secondary structure homology.