|Home | About | Journals | Submit | Contact Us | Français|
Small nucleolar RNAs (snoRNAs) are a class of non-coding RNAs that guide the post-transcriptional processing of other non-coding RNAs (mostly ribosomal RNAs), but have also been implicated in processes ranging from microRNA-dependent gene silencing to alternative splicing. In order to construct an up-to-date catalog of human snoRNAs we have combined data from various databases, de novo prediction and extensive literature review. In total, we list more than 750 curated genomic loci that give rise to snoRNA and snoRNA-like genes. Utilizing small RNA-seq data from the ENCODE project, our study characterizes the plasticity of snoRNA expression identifying both constitutively as well as cell type specific expressed snoRNAs. Especially, the comparison of malignant to non-malignant tissues and cell types shows a dramatic perturbation of the snoRNA expression profile. Finally, we developed a high-throughput variant of the reverse-transcriptase-based method for identifying 2′-O-methyl modifications in RNAs termed RimSeq. Using the data from this and other high-throughput protocols together with previously reported modification sites and state-of-the-art target prediction methods we re-estimate the snoRNA target RNA interaction network. Our current results assign a reliable modification site to 83% of the canonical snoRNAs, leaving only 76 snoRNA sequences as orphan.
SnoRNAs form a specific class of small (60–170 nucleotides, with few exceptions (1)) non-protein coding RNAs that is best known for guiding post-transcriptional modification of other non-protein coding RNAs such as ribosomal and small nuclear RNAs (rRNAs, snRNAs respectively) (2–7). Based on defined sequence motifs and secondary structure elements, snoRNAs are classified as either C/D box or H/ACA box.
C/D box snoRNAs guide 2′-O-methylation and H/ACA snoRNAs pseudouridylation of nucleotides on target molecules. The C box (RUGAUGA, R = A or G) and D box (CUGA) sequence motifs of C/D box snoRNAs, are brought into close proximity when the 5′ and 3′ ends of the molecule fold into a stem structure, forming a kink-turn (8,9). Most C/D box snoRNAs have additional, less conserved, C and D box motifs, the C' and D' boxes, in the central region of the snoRNA. C/D box snoRNAs carry out their function within ribonucleoprotein (RNP) complexes that additionally contain the 15.5K, NOP56, NOP58 and fibrillarin proteins (10,11), the latter catalysing 2′-O-methylation of ribose molecules in the target RNAs (12). Which nucleotide undergoes this modification is determined by the complementarity to the 7 to 21 nucleotides (nt) guide region that is located upstream of the D or D' box: the 5th nucleotide upstream of the D/D' box will undergo the 2′-O-methylation (13–15).
H/ACA box snoRNAs adopt a well-defined secondary structure consisting of two hairpins that are joined by a single-stranded region known as the H box (ANANNA, N = A, C, G or U) and further have an ACA box (AYA, Y = C or U) motif at the 3′ end (16,17). The H/ACA snoRNPs contain the H/ACA snoRNA and a set of four proteins, Dyskerin, Nhp2, Nop10 and Gar1, with Dyskerin acting as the pseudouridine synthase (18). Target recognition by H/ACA box snoRNAs also involves RNA-RNA interactions, of single-stranded regions within interior loops of the two hairpin structures in the snoRNA with the target RNA (19,20).
Canonical snoRNAs accumulate in the nucleolus, the primary site of ribosome synthesis. ScaRNAs (small Cajal body-specific RNAs), are a specific subset of snoRNAs that guide spliceosomal RNA modifications. They are enriched in the Cajal bodies, where the last steps of of spliceosomal RNA biogenesis take place (3). The import of snoRNAs into Cajal bodies is guided by specific sequence motifs, which for H/ACA box snoRNAs are the CAB boxes (UGAG) located in the hairpin loops of the two stem structures (21), whereas C/D box snoRNAs have a long UG dinucleotide repeat element (22). There is evidence that both motifs are recognized by the WDR79 protein, which facilitates the transport to Cajal bodies (22,23). Aside from these snoRNAs that have canonical structures, some long scaRNAs with hybrid structures that are able to function in both methylation and pseudouridylation, have been characterized (1,3). Moreover, the primate-specific Alu repeat elements can give rise to H/ACA box-like snoRNAs; these were coined AluACA RNAs and seem to accumulate in the nucleoplasm (24). The RNA component of the animal (but not of fungal or of other groups of eukaryotes) telomerase RNP (TERC) contains an H/ACA box snoRNA-like domain (25–29), which harbors a CAB box (30) and is essential for telomerase activity (25).
SnoRNAs can guide other types of RNA processing, beyond methylation and pseudouridylation (see ref. (31) for a recent review). For example, SNORD22, SNORD14, SNORD13, SNORD3 and SNORD118 are involved in the processing of ribosomal RNA precursors (32). Even though they have C and D box motifs, these snoRNAs do not seem to undergo the terminal end trimming that is characteristic to C/D box snoRNAs (33). This suggests that additional proteins probably assist these snoRNAs in their function, at the same time preventing the usual C/D box-specific trimming. Some evidences suggest specific functions for snoRNAs encoded in the imprinted 15q11-q13 region: the brain-specific C/D box SNORD115 family regulates the alternative splicing of the serotonin receptor 5-HT(2C) mRNA (34,35), and SNORD116 family members are part of longer RNAs that sequester the Fox family of splicing regulators (36). Many C/D box as well as H/ACA box snoRNAs seem to undergo some kind of processing, yielding smaller fragments whose function remains elusive (33,37). An exception is the H/ACA box snoRNA SCARNA15 whose microRNA-like function has been well documented (38). Whether this function can be more generally carried out by other snoRNAs remains unknown. Recent high-throughput sequencing-based studies identified C/D box-like snoRNAs either as short as 27 nucleotides (33), barely able to host an antisense region. Further there are long non-coding RNAs with snoRNA ends (sno-lncRNAs) described in (36,39). A summary of the currently known structural types of snoRNA is shown in Figure Figure11.
Despite a few genome-wide surveys, recent studies (22,33) have clearly demonstrated that the catalog of human snoRNA loci is far from complete. The snoRNA data resources (40,41) that used to be standard in the field have either ceased to exist or to be updated, as the focus of the research community has moved towards characterization of snoRNA genes in species other than human (42–46). A recent attempt to improve the accuracy of snoRNA gene annotation (47) clearly demonstrated that a well designed, uniform analysis strategy is needed to expand the catalog of snoRNAs while maintaining annotation accuracy. In this study we have taken a comprehensive approach, combining both: analysis of large-scale data generated by the ENCODE consortium data, as well as developing novel experimental methodology to construct an up-to-date catalog of snoRNA loci in the human genome. Furthermore we characterize their processing patterns, expression profiles across tissues, as well as their potential targets. The data collected in this study is publicly accessible via http://www.bioinf.uni-leipzig.de/publications/supplements/15-065.
A list of snoRNA genes currently annotated by HGNC was obtained from www.genenames.org (3 March 2014) and the corresponding sequence entries were retrieved from the NCBI Nucleotide database via accession numbers as identifiers. Retrieved sequences were then mapped to the hg19 human genome with BLAT to infer their genomic loci. To annotate the genomic coordinates of mature snoRNA genes, we took advantage of the massive sRNA-seq data produced by the ENCODE Consortium (48). We retrieved the BAM files containing the genomic loci of the reads from 114 sRNA-seq data sets (read length of 101 nt) from the UCSC ENCODE analysis hub (http://genome.ucsc.edu/ENCODE).
To select reads that could support mature snoRNA genes, we used the following criteria: first, we required that either the sRNA-seq read covers at least 75% of a snoRNA gene or the sRNA-seq read was longer than 90 nt, for cases (especially H/ACA box snoRNAs) where the length of the snoRNA gene was presumably too long to be covered in full by the sRNA-seq reads. Second, we required that the first and last genomic positions where the sRNA-seq read mapped were at most 5 nt away from the start and end position of the annotated snoRNA gene to which the read mapped. After thus identifying sRNA-seq reads associated with individual snoRNA genes, we redefined the boundaries of the mature snoRNA forms as the positions where most of the sRNA-reads associated with the locus started or ended, respectively. For snoRNA loci with too few sRNA-seq supporting reads, we manually curated the genomic coordinates of the mature forms based on the sRNA-seq reads profile and inspection of box motifs and secondary structure (see Supplementary Dataset S1). To further validate this procedure, we examined the distance between the 5′ and 3′ ends and the C and D box motifs, respectively. We found that, as shown before (33), the 5′ end of C/D box snoRNA was located 4–5 nt upstream of the C box motif, and the 3′ end at most 5 nt downstream of the D box motif. In turn, we used this information as another indication for curating the 5′ and 3′ end coordinates of the mature snoRNAs for which the sRNA-seq data did not sufficiently or completely cover the loci.
To uncover additional snoRNA genes that have supporting expression evidence, we first collected predictions of two computational tools, snoSeeker (49) and snoReport (50), that have been specifically designed to predict snoRNA genes. Due to the high computational demand of these tools, we restricted the search space to genomic regions that were supported by at least five reads in the combined set of sRNA-seq samples and extended these loci by 20 nt from the 5′ end and 100 nt from the 3′ end. The predictions of snoSeeker and snoReport were pooled and candidate snoRNA genes overlapping with already annotated snoRNA genes were removed. This step yielded 820,835 putative C/D box snoRNA loci and 316,076 H/ACA box snoRNA loci.
Because the sequence and structure constraints on snoRNAs appear to be weaker compared to, for example, tRNAs, we expect a higher false-positive rate of prediction for snoRNAs compared to tRNAs. Here we used the observation that C/D box snoRNAs undergo precise processing which leaves only 4–5 nt upstream of the C box, and 2–5 nt downstream of the D box (33) to further validate the C/D box snoRNA prediction. Small RNA-seq reads that mapped to C/D box snoRNA loci were considered ‘supportive’ of a snoRNA mature form if the 5′ end of the read was located 4–5 nt upstream of the inferred C box and the 3′ end of the read was located 2–5 nt downstream of the D box. For C/D box snoRNA genes with a predicted length of more than 100 nt, we could only enforce that the 5′ end is processed as expected, but we required that the sRNA-seq reads cover at least 75% of the length of the predicted snoRNA gene or are at least 90 nt in length. For H/ACA box snoRNAs, a read was labelled as supportive if the 5′ end of the read was located ±5 nt around the predicted 5′ end of the snoRNA locus, and the read either covered at least 75% of the length of the snoRNA locus or was at least 90 nt in length. 8,000 predicted C/D box snoRNAs and 7772 predicted H/ACA box snoRNAs had at least one supportive read, but only 121 and 114, respectively, remained when we required at least 1,000 supportive reads (corresponding to 0.087 TPM) in the entire data set. We chose this cut-off because more than 98% of already annotated snoRNAs in HUGO pass this cut-off. In the next step, candidate snoRNA loci were filtered for redundancy and loci overlapping with predictions obtained from deepBase, a survey of the human genome using snoStrip with known vertebrate snoRNAs as query, and GENCODE were removed. Finally, we removed candidates that overlapped with repeat annotation with more than 25% of their length and discarded those that did not have support by uniquely mapped reads. In the end, our de novo prediction yielded 17, 41 and 21 H/ACA box, C/D box and SNORD-like snoRNA loci, respectively. SNORD-like snoRNAs are non-canonical type of C/D box snoRNAs which are shorter than 50 nt in length and hence lack a functional C' and D' box. These putative snoRNAs can be found in Supplementary Dataset S1, filed as ‘de novo’.
In previous work (33), we found that core snoRNP proteins bind snoRNA-like RNAs, that are not reported in snoRNA databases. To capture these cases, we carried out a genome-wide scan for C/D box-like molecules that are supported by sRNA-seq evidence. We started from genomic regions defined by a degenerate C box (TGATGA, TGGTGA, TGATGT, TGATGC or TGTTGA) and a D box (CTGA or ATGA) separated by 10–90 nts. After applying filtering steps as done for canonical C/D box snoRNAs, we obtained 77 C/D-box like candidates that have at least 1,000 supportive reads in the sRNA-seq data, from which 38 are SNORD-like (shorter than 50 nt). These are marked as ‘C/D-box-like’ in Supplementary Dataset S1.
SnoRNA target prediction was performed using following data sources: the set of human snoRNA sequences recovered in this study, human ribosomal RNAs sequences (18S (X03205), 28S (U13369 nts 7935–12969), 5.8S (U13369 nts 6623–6779)) (40,41) and sequences of spliceosomal RNAs (U1, U2, U4, U4atac, U6, U6atac, U11, U12) (51) as target RNAs. Experimentally confirmed modification sites were obtained from literature (40,41,52–58), and from a recent high throughput study (59) for pseudouridine sites and from the newly developed RimSeq method for 2′-O-methylation sites.
At first, we predicted putative targets on human rRNAs and snRNAs using RNAsnoop (60) and Plexy (61) for human H/ACA box and human C/D box snoRNAs, respectively. Precomputed RNAup structure profiles of target RNAs (62) were provided to refine interaction predictions with RNAsnoop. Additionally, we used signs of evolutionary conservation as supporting evidence for putative snoRNA–target interactions. To that aim a set of annotated homologous snoRNA sequences and their predicted interactions in deuterostomian species, which were computed with the snoRNA annotation pipeline snoStrip (63), was used. To avoid contamination with repetitive sequences we excluded snoRNA genes overlapping with regions of the UCSC-Repeat-masker track (9 January 2015) from conservation analysis. Subsequently, the Interaction Conservation Index (ICI) (64) was computed for all snoRNA–target RNA interactions.
Information about target sites was gathered with respect to three categories for each snoRNA anti-sense element. First, any previously reported target site (r). Second, the best scoring human target prediction (h1) within the set of human target predictions considering the minimum free energy of the snoRNA–target RNA interaction duplex. And third, the best scoring conserved target prediction (c1) within the set of conserved interactions evaluated by the Interaction Conservation Index. The final assignment of an snoRNA anti-sense element to a target site was based on following rules:
Selected interactions are accepted, if the interaction is well conserved in deuterostomes with an ICI > 1.0 for box C/D box snoRNAs and an ICI > 0.8 for H/ACA box snoRNAs (see (64) for information on these thresholds). A predicted interaction is classified as highly confident if the resulting modification overlaps a confirmed modified position, that has been identified by a high-throughput approach, or has been reported in literature.
To identify 2′-O-methylated residues transcriptome-wide we adapted a well-established, low-throughput reverse transcriptase-based protocol (65), which is usually coupled with polyacrylamide gel analysis, and modified it to a high-throughput sequencing protocol. The method is based on the observation that cDNA synthesis is noticeably impaired in the presence of a 2′-O-methyl when deoxynucleotide triphosphate fragments (dNTPs) are limiting (65,66), giving rise to a characteristic pattern of gel banding immediately preceding the 2′-O-methyls, with strong bands at low dNTP concentrations (0.004 mM) (66), becoming weaker with increasing concentrations of dNTPs. These stoppages, which correspond to the position of a 2′-O-methylation site, will generate read ends when RNA fragments are reverse-transcribed under different dNTP concentrations, ligated to adapters and sequenced. 2′-O-methyl positions can be subsequently identified by calculating the ratio of reads starting at given position (5′ ends) to the reads covering it (readthrough reads + 5′ ends) and comparing this ratio to the control. The exact procedure can be found in Supplementary Text S1.
The expression level of a given snoRNA in a sample was calculated based on the total number of reads (uniquely and multi-mapping) from that sample that overlapped with the snoRNA locus. The normalization of read counts was done relative to the total number of reads obtained in the sample. The ENCODE project generated sRNA-seq samples from a range of cell types, both normal and malignant, as well as from distinct sub-cellular compartments (‘Cell’, ‘Cytosol’, ‘Chromatin’, ‘Nucleus’ and ‘Nucleolus’). Furthermore, to capture various types of small RNAs, the RNA was subjected to various treatments (tobacco acid phosphatase (‘TAP’) to remove cap structures, calf intestinal phosphatase and TAP (‘CIP-TAP’) to further remove 5′ and 3′ phosphates, as well as left untreated ‘No treatment’)). Unsurprisingly, hierarchical clustering of expression levels of snoRNA in the ENCODE samples revealed a strong dependency on the cellular department and the library preparation procedure (Supplementary Figure S1). Consequently, we restricted our analysis of snoRNA expression to samples that were obtained from the cellular compartment ‘cell’ with the TAP-only treatment, as these two factors covered the largest variety of cell types.
SnoRNAs that were more than 80% identical over their entire length to each other were grouped into a snoRNA ‘family’ (see Supplementary Dataset S2 for a list of snoRNAs and their corresponding cluster representatives). The expression level of a cluster representative was defined as the average expression level of all snoRNAs associated with that cluster. When replicates were available, we further averaged expression over replicates. The specificity of expression and the specificity of processing of individual snoRNAs was calculated as follows. We first computed the relative frequency of each snoRNA in the pool of snoRNAs in a given sample. Next, the specificity score S defined as
was calculated where pi is the normalized frequency of the snoRNA in sample i. The specificity score is maximal when the snoRNA is expressed in a single sample and minimal when the relative frequency of the snoRNA is the same across all samples (Supplementary Figure S2).
To directly compare snoRNA expression between normal and malignant cells, we averaged the snoRNA expression separately over normal and malignant cell types. The ratio of these quantities gives the fold-change of expression between normal and malignant cells.
To determine whether processed fragments are generated in a cell type-specific manner, we first separated the reads into those that correspond to the mature snoRNA and to shorter processed products. Because the sRNA-seq samples should in principle contain only full-length RNAs and based on the length distribution of snoRNAs (Supplementary Figure S3), we chose a maximum length of 40 nt for a read to be considered as corresponding to a processed RNA. This is consistent with the length of snoRNA-derived fragments that was reported before (33,37,67,68). Next, we calculated the proportion of processed reads among all reads associated with the snoRNA. Finally, we calculated a specificity score of snoRNA expression or of processing across tissues as described above for the specificity of snoRNA expression (Supplementary Dataset S3 and Supplementary Figure S4).
In contrast to other types of molecules such as mRNAs and microRNAs, relatively few studies attempted to sequence the full complement of mature human snoRNAs. Thus, the annotation of human snoRNA genes frequently started from computational predictions. Especially in the case of C/D box snoRNAs a consistent procedure for defining the 5′ and 3′ ends of their mature forms is lacking, and different pragmatic definitions such as the longest terminal stem, the longest evolutionarily conserved terminal stem, or the experimentally determined ends were frequently used. However, the data that we obtained in a recent study indicated that C/D box snoRNAs undergo uniform trimming at both the 5′ and the 3′ end (33), irrespective of the length of the terminal stem. Here, we use this observation to provide a complete catalog of curated mature snoRNA 5′ and 3′ ends based on small RNA sequencing data sets.
We first retrieved the 289 C/D box snoRNA, 136 H/ACA box snoRNA and 27 scaRNA genes that were annotated by the HUGO Gene Nomenclature Committee (HGNC) at the time when our study was initiated and mapped them to the human genome (hg19 assembly version from the University of California, Santa Cruz). We further obtained the genomic coordinates of small RNA sequencing reads from 114 data sets that were generated by the ENCODE consortium (48). Intersecting the loci of sequenced small RNAs with those of the HGNC snoRNAs, we identified, for each snoRNA gene, the 5′ and 3′ ends that were most represented among the small RNA sequencing (sRNA-seq) reads (see Materials and Methods for details). We found that these data confirmed the processing pattern that we described previously (33,69), namely that the 5′ end of the mature C/D box snoRNAs is located 4–5 nt upstream of the C box motif and the 3′ end is located up to 5 nt downstream of the D box motif (see Supplementary Figure S5). The curated loci of the mature HGNC snoRNAs are compiled in Supplementary Dataset S1 and Supplementary Dataset S4. For some snoRNAs e.g. SCARNA21, SNORD11B or SNORA58 the sequence inferred from the small RNA sequencing data differed considerably from the sequence defined by the HGNC. Other snoRNAs for which the curated coordinates differed significantly from their known annotation are SNORD81, SNORD49B, SNORD126, SNORD125, SNORD123, SNORD121A, SNORD11B, SNORD127, SNORD58C, SNORD12B, SNORD111B, SNORD105B, SNORD124, SNORD90, SNORD105 and SNORD70. Supplementary Dataset S5 contains visualizations of snoRNA loci including the HGNC sequence, the sRNA-seq read profile along these loci and the 5′ and 3′ ends that were inferred based on the sRNA-seq data. Inspection of sRNA-seq read profiles of three snoRNAs that were annotated in HGNC as H/ACA box snoRNAs (SNORA85, SNORA96, SNORA97) revealed that they are in fact C/D box snoRNAs (named ZL68, ZL5 and ZL6 in (33)) with slightly altered positions. These revised snoRNA sequences are now assigned the gene symbols SNORD142, SNORD143 and SNORD144.
To further update the human snoRNA catalog, we integrated data from several sources including a de novo genome-wide search (outlined in detail in Figure Figure2).2). Specifically, we collected snoRNAs from RFAM-based predictions that were generated by the GENCODE consortium (70), from deepbase (71), and from a snoStrip (63) dataset in deuterostomes (64) (see Materials and Methods for details). Additionally, we performed a genome-wide de novo screen by the workflow summarized in Figure Figure2.2. Due to the high computational demand of snoRNA gene finding programs, we restricted our analysis to genomic regions that showed signs of expression in the sRNA-seq data set generated by the ENCODE consortium. Finally, we used the snoReport (50) and snoSeeker (49) software to screen the extracted genomic regions for potential snoRNA genes. Additionally, we implemented a search algorithm that screens for potential C/D box-like snoRNA genes (33) (see Materials and Methods for a detailed description). Due to the vast number of potential snoRNA candidates collected from all these sources, we consolidated these initial candidates to a non-redundant set of putative snoRNA loci and excluded those that overlapped with repeat-annotated genomic regions. Furthermore, we defined a set of strict rules to identify snoRNA candidates whose expression as mature forms was strongly supported by the sRNA-seq data (see Materials and Methods). Finally, we screened and added snoRNAs from recently published literature (24,33,39,72). This analysis yielded more than 160 canonical human snoRNAs that are currently not covered by the human gene annotation (Table (Table11 and Supplementary Dataset S1). In order to distinguish candidates which have relatively close homologs among the already known snoRNAs, we used the Infernal software and RFAM sequence-structure models (73,74) to assign each snoRNA to the family with the closest homology and a P-value lower than 10−5. Table Table11 summarizes the results of these analyses. Finally, we applied at the HGNC for gene names for those snoRNA candidates that showed evolutionary conservation in hominoids and beyond, contained all expected sequence motifs, were found to be expressed as full-length snoRNAs in human and that folded into a canonical structure (H box, ACA box and hairpin-hinge-hairpin-tail structure for H/ACA box snoRNAs, and C box, D box, the typical kink-turn formed by these boxes, and a terminal stem of at least 2 bps for C/D box snoRNAs). For most of the novel snoRNAs, conservation analysis performed with snoStrip (63) could only recover homologs in primates (93 C/D and 8 H/ACA). For 13 C/D box snoRNAs and 7 H/ACA box snoRNAs no homologs at all could be retrieved. Reliably determining if these snoRNAs are indeed evolutionary new inventions, specific to human and primates is beyond the current methodology.
The primary function of snoRNAs is to guide the modification of specific sites in ribosomal and spliceosomal RNAs. There are, however, some well documented snoRNAs with non-canonical function, like SNORD115 that has been reported to regulate alternative splicing (75) or SNORD116 that forms the ends of longer non-coding RNAs (36). To provide an up-to-date annotation of the targets in our snoRNA catalog, we here combined target predictions based on state-of-the-art computational methods (31) with experimental data on snoRNA-guided RNA modifications. The computational target prediction follows three main steps. First, RNAsnoop (60) and Plexy (61) are used to predict human targets based on primary sequence features, secondary structure of the snoRNA, the accessibility of the target region, and the predicted minimum free energy of the snoRNA–target duplex. In a second step the evolutionary conservation of the predicted interaction within vertebrates is evaluated using the Interaction Conservation Index (ICI) (64)). In brief, the ICI combines stability of an individual interaction between snoRNA and target RNA within a single species with the range of conservation of the equivalent interaction among species for which a homologous snoRNA exists. Roughly, an ICI score > 1 can be interpreted as the specific interaction being better than alternative predictions in all species where a snoRNA homolog is present. We also used a coarse-grained encoding of the conservation termed ‘levelC’ that indicates the depth of conservation in the phylogenetic tree of eukaryotes. Lastly, we identified the highest-confidence interactions among the predicted interactions as those interactions, for which a corresponding snoRNA-guided RNA modification has also been reported in human. Data on snoRNA-guided modifications was gathered from snoRNAbase and available literature, or from very recently conducted experiments that were designed to identify RNA modifications in high-throughput. In particular, we obtained data on pseudouridine modifications to validate predicted interactions of H/ACA box snoRNAs from two studies (59,76). For 2′-O-methylation sites, however, no such high-throughput data exists. To fill this gap, we developed a novel method termed RimSeq, which ports the principles of 2′-O-methylation site identification used in primer extension assays (77–79) to a high-throughput approach using next generation sequencing. A detailed description and evaluation of the RimSeq procedure is outlined in Supplementary Text S1 with inferred modifications sites being displayed in Supplementary Dataset S6. Using the computational predictions and the data obtained from high-throughput experiments and modifications reported in literature we could identify ten novel high confidence interactions between snoRNAs and target molecules. For two target sites whose methylation has been reported to be guided by a known snoRNA we predicted an additional guiding snoRNA: the D'-box ASE of SNORD136 for 18S-683, and snoID_0337 for 18S-1326. Additionally, the methylations that were experimentally identified at 18S-1606 and 18S-1410 could be assigned to previously considered orphan snoRNAs SNORD73A/B and to novel snoRNA snoID_0340, respectively. Guiding H/ACA box snoRNAs could be assigned to two previously mapped pseudouridylation sites, 18S-681 and 28S-4266. Concerning the pseudouridylation sites that emerged from high-throughput data, we could predict guiding snoRNAs in three (18S-1046, 18S-1232, and 28S-2619) out of the four cases; we could not identify a guiding snoRNA for the pseudouridine at position 1177 in human 18S rRNA reported by Carlilie et al. (59). Details of this analysis are summarized in Table Table22 (see Supplementary Dataset S6 for a full listing).
For C/D box snoRNA target prediction we excluded the SNORD3 and SNORD13 snoRNA families that have established non-canonical functions in pre-rRNA cleavage (80,81). Hence, we obtained a total of 393 snoRNA sequences, of which 275 are canonical C/D box snoRNAs and 118 are members of the multi-copy (mc) gene families SNORD113, SNORD114, SNORD115, and SNORD116. For the majority (~83%) of these sequences we could annotate both a D and a D' box sequence motif (Table (Table3).3). In contrast, only a few (~21%) of the C/D box-like snoRNAs appear to possess both D and D' boxes. In many cases the D' box could not be reliably annotated either due to the short length of these snoRNA like genes or the lack of evolutionary conservation or sequence motif signals.
In total, we applied target prediction methods to 863 = (25 + 113 + 216) × 2 + (91 + 5 + 59) anti-sense elements (ASEs) covering all cataloged C/D box and C/D box-like snoRNAs. The snoRNA target prediction results are listed in detail in Supplementary Dataset S1. Table Table22 depicts high-confidence interactions, for which additional experimental evidence of RNA modification is available. Summarizing results obtained from target prediction and reported interactions, we can currently associate more than two thirds (~70%) of the C/D box snoRNAs with a specific rRNA or snRNA target. However, 118 C/D box snoRNA genes remain classified as orphan. Interestingly, the vast majority of these (91) were found to have two ASEs. Here, the question remains if these snoRNAs interact with their target RNAs in a way that fails to be recognized by our computational target prediction methods or if these snoRNAs execute entirely biologically different functions than guiding modifications. Excluding the multi-copy (mc) snoRNA genes, 48 C/D box snoRNAs with canonical features remain without a predicted or known target in rRNA or snRNA (Figure (Figure3A).3A). Of these, SNORD97 is reported as enriched in chromatin-associated RNAs (caRNAs) (82). Because a detailed analysis of the mc snoRNA families did not reveal convincing target predictions, we excluded these families from further analyses. Although most of the known and novel C/D box snoRNAs have both D and D' boxes, only a minority of those indeed interact with targets at both anti-sense elements (Figure (Figure3A3A).
For H/ACA box snoRNA target prediction, we only used canonical genes and excluded those sequences encoded within Alu repeats (ALUACAs), for which evolutionary conservation information cannot be reliably obtained. A canonical H/ACA box snoRNA forms a double stem-loop structure each possessing a respective ASE in its interior loop. Therefore, a total of 380 (2 x 190) ASEs (Table (Table3)3) were considered for target prediction. In total, our analysis of reported and predicted targets associated ~85% of the H/ACA box snoRNA sequences with at least one target Uracil in an rRNA or snRNA. About 55% of the guiding snoRNAs have targets for both ASEs, while the remaining appear to guide pseudouridylation at one site only. In the majority of these cases, the target site is predicted to interact with the 5′ stem (Figure (Figure3B).3B). Among the 30 snoRNAs for which no canonical targets were reported or predicted is SNORA73A/B. This is in agreement with the reported non-canonical interaction of the yeast homolog snR30 with the 18S RNA (83). The conserved potential for base-pairing of these molecules suggests that the mechanism is well conserved to vertebrates. Furthermore, there is evidence that SNORA73A functions as a putative regulator of chromatin function (82).
Finally, we summarized the evidence and features used to infer snoRNA–target RNA interaction sites as heatmaps depicted in Figure Figure44 (see Supplementary Figure S6 for a high resolution version). Blue and red colors indicate low and high evidence for the interaction, respectively. It is apparent that after our analysis only a small fraction of snoRNAs remains orphan, which is indicated by the blue color in the column ‘reported’ and by the low value of the Interaction Conservation Index (ICI, see Materials and Methods) for both ASEs. Several interactions, mainly for the newly identified snoRNAs, seem to be primate specific (column ‘levelC’: blue and column ‘ICI’: white/red). Interestingly, C/D box snoRNAs seem to have a single-guide tendency (column ‘ICI’ is white/red for either D or D’ box, but relatively rarely for both). For the 59 snoRNAs for which we could not identify a D' box, the classification as single-, double-guide or orphan snoRNA remains preliminary (grey cells on D' box side). Although the majority of C/D box snoRNAs encode both a D and a D’ box and have associated ASEs, for only ~17% we predicted high-scoring interactions for both ASEs. Among single-guide C/D box snoRNAs, the predicted interaction preferentially involves the D’ box-associated ASE (96 cases vs. 56 with guiding at the D box-associated ASE). This is in strong contrast to the pattern of evolutionary conservation, since the D box generally shows stronger conservation. H/ACA box snoRNAs, however, are predominantly (56%) double guiding. For those with one guiding ASE, the ASE is preferentially located in the 5′ stem (45 of cases compared to 26 that have the single guiding ASE in the 3′ stem).
The human scaRNAs can be grouped into tandem C/D box (4), tandem H/ACA box (1), hybrids of C/D and H/ACA boxes (5), canonical C/D box (2) and canonical H/ACA box (17). Thus, the pool of scaRNAs can potentially interact with target RNAs at 78 = (4 + 1 + 5) × 4 + (2 + 17) × 2 sites. Due to their intricate structure, we could not reliably annotate all potential ASEs for six scaRNAs, leaving a total of 71 ASEs that were subjected to further analysis. An evolutionarily conserved target could be recovered for 43 cases including seven sites that are newly predicted. In particular, the elongated isoform of SCARNA21, an H/ACA box snoRNA embedded in a C/D box snoRNA, was found to harbor three additional functional ASEs (Figure (Figure55 and Supplementary Text S2). Most intriguingly, the snRNA U12 residue targeted by the 5′ ASE of the H/ACA domain of this scaRNA is directly adjacent to the newly predicted target of D' box-associated ASE. The 3′ ASE of the H/ACA part is predicted to guide a modification in the U6atac snRNA, which is part of the minor spliceosome, as is the U12 snRNA. Thus, our results suggest an important role of SCARNA21 in the maturation of snRNAs of the minor spliceosome.
The plasticity of snoRNA expression across cell types has been relatively poorly studied, although changes in snoRNA expression have been observed in cancers (84). Due to the diverse set of both normal and malignant cell types profiled by the ENCODE consortium, this data set constitutes an excellent source to study cell type specific expression of snoRNAs in detail. Our analysis revealed that the pool of both H/ACA box and the C/D box snoRNAs is dominated by a few abundantly expressed snoRNAs (Figure (Figure6A).6A). As an illustration, 21 C/D box and 18 H/ACA box snoRNAs account for more than 80% of sRNA-seq reads captured for the respective snoRNA class. Of these abundantly expressed snoRNAs, only two of the C/D box family (SNORD83A and SNORD64) and only two of the H/ACA family (SNORA11 and SNORA51) lack well confirmed target sites on ribosomal RNAs and snRNAs. However, we previously predicted that SNORD83A targets 18S-468 (64), a site also known to be modified, whereas here we further predicted that SNORD64 targets U1-53. A conserved interaction between SNORA51 and 28S-1849 was predicted in (64), and SNORA11 appears to target 18S-1350. (see also Supplementary Dataset S1). The abundantly expressed snoRNAs thus appear to target largely rRNAs, and may therefore be essential for ribosome biogenesis. Consistently, these snoRNAs also show little variation in expression across cell types (Figure (Figure6B6B denoted by red stars; high resolution versions of these figures including gene names can be found in Supplementary Figure S2). Some snoRNAs do exhibit cell type-specific, particularly neuronal expression. Among them are the neuron-specific orphan SNORD115 and SNORD116 families (34,75,85) as well as snoRNAs with canonical ribosomal targets such as SNORD100 and SNORD33. The H/ACA box SNORA35 (86), which has the strongest cell type specificity among the H/ACA box snoRNAs, is predicted to target 18S-566 through the 5′ ASE and U7-7 through the 3′ ASE. A comprehensive list of snoRNAs that show cell type specific expression can be found in Table Table44.
Hierarchical clustering of a subset of sRNA-seq samples that have been generated from decapped (tobacco acid phosphatase (TAP)-treated) RNAs isolated from whole cells (Supplementary Figure S7), revealed a striking separation of normal and malignant cell lines. Several snoRNAs seem to be differentially expressed in all cancer cell lines compared to cells of non-malignant origin, consistent with the results of prior studies that identified snoRNAs as putative cancer biomarkers (87–91). Our results also parallel a recent finding of increased expression of a specific set of tRNAs in cancers, with possible consequences on translation in these cells (92). As an entry point into investigations into cancer-associated snoRNAs we compiled the list of snoRNAs with the most significant differential expression in cancer cell lines (see Supplementary Tables S1, S2A and S2B).
Among non-malignant cells and tissues, we found that cells of neuronal origin form one cluster, due to a relatively large number of neuron-specific snoRNAs. Other cell types show more similar profiles, although the mammary gland and hematopoietic cell types tend to cluster closer together, as do the muscle and adipose tissue. The remaining cell types (melanocytes, fibroblasts, osteoblasts, chondrocytes and placental tissue) form one big cluster with no clear boundaries (Supplementary Figures S7 and S8).
Several studies described snoRNA-derived fragments and suggested that, with some exceptions, the pattern of processing is conserved across snoRNAs and tissues (37,67). Furthermore, various groups proposed that snoRNA-derived fragments may have non-canonical functions (37,38,68,75,93–97). We asked whether the relative proportion of short (less than 40 nt) snoRNA-derived fragments differs between snoRNAs and whether it differs across cell types (see Materials and Methods) for a given snoRNA. We found that the majority of C/D box snoRNAs (75%) are found predominantly as mature forms in the data. That is, the proportion of processing products is <50% of the reads associated with the snoRNA. The cumulative distribution of this proportion is shown in Supplementary Figure S9. Furthermore, we found only minor differences in this proportion across the tissues where the snoRNAs are expressed. Notable exceptions are the SNORD115, 116, 113 and 114 families. A group of snoRNA comprising SNORD50, SNORD19, SNORD32B, SNORD123, SNORD111, SNORD72, SNORD93, SNORD23 and SNORD85 gives rise to over 90% of the processed fragments. However, we did not find evidence that the frequency of shorter forms is cell type-specific (Supplementary Dataset S3 and Supplementary Figure S4).
Among the small RNAs, snoRNAs have a relatively long history, going back to the late 1960s (98), and several hundred snoRNA genes have been catalogued. snoRNA-LBME-db (https://www-snorna.biotoul.fr/) is an outstanding resource in this domain (41), providing detailed information to more than 361 snoRNA genes and their target RNAs. This database has, however, not been updated lately and is missing out on the technological advances of deep sequencing. Indeed, the wide availability of deep sequencing technologies has prompted thorough investigations into the processing and expression patterns of all types of RNA molecules including snoRNAs (33,69), and the improved understanding of the biogenesis of these molecules, in turn, allows to build more accurate identification protocols when scanning large-scale data sets. A recent controversy concerning the criteria that were used in identifying novel snoRNA genes (47), demonstrates again that only a thorough, well defined strategy can be used to map snoRNA genes on a genome-wide scale. To that aim, we here combined known sequence and structure properties of snoRNAs, as well as recently described characteristic patterns of processing and expression evidence to generate an updated catalog of human C/D box and H/ACA box snoRNAs.
Our analysis suggests that although many genomic regions may give rise to potential RNA molecules that are processed by the snoRNA-processing machinery and may even be bound by the core proteins of the snoRNP complex (33,69), it is only about 700 snoRNAs that are expressed at a significant level. Even more challenging than the identification of novel snoRNA genes is the task of finding the target RNAs. The main reason is that the interaction typically involves only a short region making it necessary to take additional signs of evidence such as evolutionary conservation into consideration. On the other hand, making this strategy fall short on species-specific modifications that have been reported as well (99,100). In this study, we extended the snoRNA interaction network in human being able to suggest functions for many of the novel snoRNAs as well as assign snoRNA guides to three previously reported ‘orphan’ modifications and five modifications identified by high-throughput methods during this study. In total, we were able to reduce the percentage of reported orphan snoRNAs from ~40% to ~20% compared to data currently listed in snoRNA-LBME-db. Our thorough target prediction strategy could, however, not identify reliable targets on ribosomal RNAs and snRNAs for the multicopy snoRNA families SNORD113, SNORD114, SNORD115 and SNORD116, which once more supports the hypothesis that these snoRNAs act in a non-canonical manner. Among canonical, evolutionarily conserved snoRNAs we currently annotate still 76 as orphan. Clearly, high throughput protocols such as RimSeq and Ψ-seq applied to RNA extracted from various tissues have the potential to uncover not yet recognized modification sites and further reduce the list of orphan snoRNAs. How many of the orphan snoRNAs are to execute non-canonical functions remains difficult to answer and will in most cases require detailed experiments for each snoRNA in question.
The C-D'-C'-D box architecture of C/D box snoRNAs seems to be crucial for correct formation and function of the snoRNP complex (101) equipping each C/D box snoRNA with two potential guide regions for modification of other RNAs. Our analysis clearly revealed that C/D box snoRNAs that have a predicted or reported guide for both ASEs are only a minority constituting about 15% of all catalogued snoRNAs. Among the snoRNAs with a single target, the D' box ASE is surprisingly preferred over the ASE located at the generally more conserved D box. The underlying reason for this observation is not clear, but it might be that the ASE at the D' box is catalytically more active. In contrast to C/D box snoRNAs most H/ACA box snoRNAs function as double guides. This in accordance with higher constraints on the sequences through the need of structure formation, resulting in higher evolutionary conservation of the sequences.
Finally, our analysis paints a new picture of the plasticity of cell or tissue specific expression of snoRNAs. Although it has been long known that neurons specifically express a large number of snoRNAs, we were also able to identify several snoRNAs that show specific expression in cells other than neurons. Especially, there is a striking difference in snoRNA expression between normal and malignant cells. The big question here is if the changes in snoRNA expression are reflected in the processing of the target molecules such as rRNAs and whether this has a consequence for mRNA translation. Our study facilitates new avenues into this direction by providing a carefully curated catalog of snoRNAs and their associated snRNA and rRNA modifications that serves as a basis for any study on this topic.
Supplementary Data are available here.
Swiss National Science Foundation (SNF) [31003A_147013 to H.J., M.Z.]; DFG-funded Collaborative Research Center ‘Obesity Mechanisms’ [CRC1052 to S.K.]; Marie Curie Initial Training Network, RNPnet [289007 to R.G.] from the European Commission. Funding for open access charge: SNF.
Conflict of interest statement. None declared.