In this study, we utilized a strategy of analyzing millions of short reads from next generation sequencing experiments for the prediction of novel ncRNAs of both known and unknown classes. Although the deep-sequencing analyses used in this study focus on identifying shorter ncRNAs such as miRNAs, siRNAs and piRNAs by limiting the lengths of the RNA samples to the sizes of such small ncRNAs, assemblages of contiguously overlapping tags also overlap with longer ncRNAs such as snoRNAs, snRNAs and tRNAs.
TCs derived from two different classes of snoRNAs showed distinct features in their length and tag-depth distributions, and the use of these characteristic features along with the their signature motifs predicted novel snoRNAs. Proof-of-principle of this approach is provided by the successful recall of two previously known but not FlyBase-annotated snoRNAs as well as the de novo
identification of three known ncRNAs (two tRNAs and 7SL RNA) and an endogenous-siRNA cluster. We also found that the majority of experimentally detected snmRNAs [27
] (excluding those that are related to His
clusters) are overlapped by TCs, another demonstration of the validity of the approach. In fact, one TC (chr3R_3300274_3300719) overlapping snmRNA:331
corresponds to the 7SK RNA recently identified in Drosophila
], the boundaries of which fit better to the 5' and 3' ends of the TC than those of snmRNA:331
Characteristic features of snoRNAs were extracted from TCs that cover the full-length of annotated snoRNAs. However, there are also many short TCs partially overlapping with annotated snoRNAs, with strong positional preference in both 5'/3' ends of snoRNAs, which is consistent with the positional preferences of snoRNA-derived small RNAs (sdRNAs) [22
]. These positional preferences were also observed and used for novel snoRNA predictions in the Arabidopsis genome [24
]. We also found that these short TCs within snoRNAs were closely juxtaposed. Thus, more accumulation of deep-sequencing data would be expected to connect these TCs and identify more novel snoRNAs. We also examined the potential of making a simple merge of closely located TCs but this approach was compromised by also merging adjacent snoRNAs. Chen et al
] bypassed this problem in their snoRNA predictions by first anchoring the 3' ends of the novel snoRNA transcripts and then looking for their 5' ends. However, this method cannot be easily generalized for ncRNAs of uncharacterized classes. Alternatively, carefully designed computational approaches using the distribution of short RNA tags across annotated snoRNAs may also increase the number of novel snoRNAs predictions. Our candidates were tested by the snoRNA prediction software SnoReport
, which also refuses to use the modification target information of snoRNAs [40
], but it identified (using the default options) only 3 box H/ACA and 5 box C/D snoRNAs from our 7 and 26 snoRNA candidates, respectively. However, when we tested the performance of SnoReport
on the 115 box H/ACA and 134 box C/D snoRNAs that are annotated in the Drosophila melanogaster
genome, only 59 box H/ACA and 51 box C/D snoRNAs were successfully recalled.
Unlike the prediction of box C/D snoRNAs and putative ncRNAs of uncharacterized classes, the box H/ACA snoRNA prediction incorporated another filter that excluded non-intronic TCs. This was based on the fact that 92% of the known box H/ACA snoRNAs reside in introns, and reduced the number of predictions from 18 (based on tag-contig size, tag depth and presence of the H/ACA motif) to 7. It is uncertain how many of the 10 discarded TCs (excluding one snRNA candidate) may be genuine box H/ACA snoRNAs, but the high validation rate of the intronic subset (4 out of 4 tested) indicates that the incorporation of the location filter improved the specificity of the prediction.
The length and tag-depth distributions of unannotated TCs are similar to those of exon-derived TCs (Additional file 1
, Fig. S9A, B), which may indicate that some unannotated TCs might be assemblages of degradation products of unknown exons. However, it is equally possible that they may also represent degraded or processed fragments of bona fide
ncRNAs that can also be re-assembled, as is evidently the case for snoRNAs. Moreover, the large amount of unannotated TCs located in introns and intergenic regions (Additional file 1
, Fig. S9C, D) indicates that there are many more unknown transcripts yet to be investigated. Considering that we used a conservative threshold of tag-depths (≥100) for uncharacterised ncRNA candidates as the vast majority of exon-derived TCs (99.9%) have tag-depths less than 100, the novel ncRNA candidates shown in this study are just the tip of the iceberg. We tested 10 of the 29 putative ncRNA candidates in group 3, focusing on those that were most highly conserved, 6 of which returned positive signals in a restricted range of cells (see below). However, considering that some ncRNAs evolve at high rate [41
], the untested 19 ncRNA candidates in group 3 could equally likely be novel ncRNAs. Indeed, among the total of 100,193 unannotated TCs, only 26,395 overlap phastCons elements, and, surprisingly, there is no apparent difference in the distributions of lengths and tag-depths between TCs that overlap conserved sequences and those do not (Additional file 1
, Fig. S9E, F). This suggests that while conservation may be used as a positive guide to likely ncRNAs, the relative lack of conservation is not necessarily an index of lack of relevance of others.
In the experimental validation, 15 out of 21 selected candidate TCs were positive - 8 out of 10 selected snRNA and snoRNA candidates (80%) and 7 of 11 (64%) putative unclassified transcripts of either group 1 or group 3. This is a high rate of validation, given that the likelihood of detecting a signal in Northern blots is dependent on the expression level of the candidate in the tissue concerned. In fact, the confirmed candidates have generally greater tag-depths than the unconfirmed (Table ), and they are also more contributed by tags obtained from late embryos or S2 cells (Fig. ) that were the source material for Northern blots. In contrast, among the confirmed candidates, the TCs with weaker signals have a lower number of sequence reads from late embryonic stages than other confirmed candidates (Fig. ).
Figure 5 Number of sequences per million from each experiment for confirmed TCs. (a) heads, (b) adult body, (c) imaginal discs, (d) very early embryo (0-1 h), (e) early embryo (2-6 h), (f) mid embryo (6-10 h), (g) late embryo (12-24 h), (h) larvae: 1st instar (more ...)
In addition, there remain a large set of ~ 27,000 TCs that overlap TCs on complementary strands, which is characteristic of TCs mapping to transposons. They are also closely located to each other (≤100 bp), similar to TCs covering known transposon-derived sequences, and different to the ~ 100,000 TCs which were used for this study. We also observed that a large portion (37%) of these 27,000 TCs is found within reported siRNA/piRNA clusters [8
]. Although some of siRNA or piRNA clusters are not associated with transposons [13
], these preliminary observations indicate that some of these complementary TCs may be derived from unidentified transposons. In fact, about five thousand TCs in this set either slightly overlap with or are located close to (≤100 bp) existing transposons have sequences homologous to transposons, suggesting they could be unannotated parts of existing transposons generating siRNAs or piRNAs.