The repertoire of RNAs found in eukaryotic cells is unexpectedly complex, with virtually the entire non-repeat portions of many genomes being transcribed1
. Genic regions are often populated by interleaved transcription units, which give rise to both protein-coding RNAs and long and short non-coding RNAs1
. Promoter-associated short RNAs (PASRs) and termini-associated short RNAs (TASRs) are recent additions to the pantheon of short RNAs4
. Although their functions are unknown, several of their characteristics support biological significance. For example, PASRs and TASRs cluster at 5′ and 3′ termini of annotated genes4
. Overall, the presence of PASRs correlates with the expression of a given locus, but not all expressed loci generate these species. Moreover, the production of PASRs and TASRs from particular loci is a conserved feature of the human and mouse genomes4
. As part of our ongoing effort to understand the full repertoire of small RNAs, their mechanisms of biogenesis and their biological impacts, we analysed the small RNAs (<200 nucleotides (nt)) of HepG2 and HeLa cell lines using next-generation sequencing5
Nearly 80 million short sequence reads (30–35 bases) were generated, representing RNAs <200 nt in both cell lines (Supplementary Fig. 1
). Our sequencing protocols favoured RNAs with 5′ mono-, di- and tri-phosphate groups and capped RNAs. Nearly 30 million of these could be matched perfectly to the hg18 release of the human genome, with 9.5 million reads mapping to unique sites (Supplementary Fig. 1
Sequences derived from mitochondria, chromosome Y, repeats, annotated small RNAs, predicted RNA genes6
, and known and predicted small nucleolar RNA (sno)RNAs7
were excluded from further analysis ( and Supplementary Fig. 1
). This resulted in 232,805 sequences representing new small RNAs. Independent libraries from the same cell line only modestly overlap, indicating that our studies have not saturated small RNA content. Sequences were collapsed based on their mapping positions, assigned as the 5′ nucleotide of each read. This resulted in 102,159 distinct 5′ ends (‘rest’, ). A large fraction of sequences are derived from unannotated intergenic regions (). Sequences also matched the sense and antisense strands of exonic and intronic regions of annotated genes. Notably, nearly half of all reads could be assigned to the sense strand of annotated exons, with a strong representation of first exons.
Genomic distribution of small RNAs
We had previously noted a class of small RNAs, namely PASRs, which associated with transcriptional start sites (TSSs) and mapped both to promoter regions and to annotated first exons. We therefore plotted the distribution of the unannotated RNA category (rest, ) with respect to known TSSs. A clear pattern emerged with an enrichment of small RNAs on both strands directly adjacent to the TSS (). The sense and antisense strands surrounding TSSs showed mirror-image profiles, with small RNAs on the sense strand accumulating most strongly downstream of the annotated TSS and small RNAs on the antisense strand accumulating mainly upstream of the annotated TSS. This is similar to what had been previously observed on high-resolution genomic tiling arrays4
. A gap of ~50 nt on the antisense strand separated the precise TSS and PASRs, an observation we have yet to understand ().
This PASR class (previously defined as small RNAs mapping within 500 nt of an annotated TSS) is found in both genic and intergenic regions of the genome and comprises 16.2% or 17.7% of filtered sequence tags (un-collapsed or collapsed, respectively). PASRs contribute to the small RNAs placed within both strands of exons and introns and small RNAs annotated as intergenic (, shaded inner circles in each pie chart). On the basis of these definitions, PASRs form the most abundant individual class of defined small RNAs within the non-annotated fraction of our sequences.
Because PASRs strongly associate with TSSs, we posited that transcription initiation per se
might generate PASR 5′ ends. Thus, PASRs might contain cap structures. To probe this possibility, we prepared small RNA libraries using methods that require the presence of a 5′ phosphate and 3′ hydroxyl group8
. Capped RNAs should be refractory to capture by this cloning protocol, but could be made susceptible to cloning by removing cap structures. We therefore prepared three different small RNA libraries from HepG2 cells. One was from untreated RNA. The second was from RNA treated with tobacco acid pyrophosphatase (TAP) to leave clonable monophosphorylated ends on RNAs with caps, or with di- or tri-phosphate termini. The third library was from RNA treated with calf intestinal alkaline phosphatase (CIP) before TAP treatment. Pre-treating with CIP removes phosphates to leave unclonable 5′ OH termini on all uncapped RNAs. Sequence tags corresponding to the 5′ end of the capped U4 small nuclear (sn)RNA are enriched in libraries by TAP treatment and further enriched by CIP addition before TAP treatment (). MicroRNA 21, which has a 5′ monophosphate terminus, is lost from the library on CIP treatment, as is 5S ribosomal RNA, which has a 5′ triphosphate terminus (). Small RNAs defined as PASRs follow the pattern established by U4, consistent with them bearing some type of cap structure. The observation that PASRs are revealed to the cloning protocol by TAP alone indicates that they also contain 3′ OH termini. Considered together, these data indicate that PASRs are likely to arise either as independent capped transcripts emanating from annotated TSSs on both genomic strands or as processing products from longer capped RNAs. A candidate for the latter are PALRs (promoter-associated long RNAs), which often extend through the first exon and into the first intron4
CAGE tagging protocols take advantage of the 5′ cap structure to capture sequence reads from the 5′ ends of long RNAs2
. Substantial databases of such tags have been produced from long polyadenylated RNAs from more than 20 human tissues9,10
. Plotting CAGE tags with respect to annotated TSSs revealed patterns similar to those observed for PASR class small RNAs (). In both genic and intergenic regions, we also observed a strong tendency for a precise identity between CAGE and PASR 5′ ends ().
Correlation of sRNAs and CAGE tags
We also noted a substantial population of small RNAs mapping more than 500 nt away from annotated TSSs (), which contributed to intronic, exonic and intergenic classes (, outer portions of each pie). Similarly we noted a large population of CAGE tags located >500 base pairs (bp) from the TSS (59.0%) that could be assigned to exons, introns and intergenic regions (11.36, 18.8 and 28.9% of total uncollapsed CAGE tags, respectively; Supplementary Table 1
Certainly, a fraction of these could arise as products from unannotated TSSs, giving rise to both CAGE tags and PASRs. However, the correlation between the 5′ ends of non-PASR small RNAs and CAGE tags was less precise (± 10 nt) than was noted at annotated TSSs (). This indicated that non-PASR classes of small RNAs might arise by mechanisms that differ from canonical PASRs.
Both small RNAs and CAGE tags accumulate more strongly in internal exons than in introns or in intergenic space, if these regions are normalized by their cumulative length ( and not shown). By examining the distribution of both CAGE tags and small RNAs across annotated internal exons, we noted a strongly decreasing number of CAGE sequences beginning about 20 nt from the 3′ end of the average exon/intron boundary (splice donor site; ). If CAGE tags crossed splice junctions, they would not be co-linear with the genome at these sites and would, therefore, not have been mapped in the initial analyses10
, possibly giving rise to the observed pattern. We therefore extracted previously published CAGE tags, which had failed to map to the genome, and probed these against sequences of known exon–exon junctions. We uncovered a substantial population of CAGE tags that crossed splice junctions, and that therefore must have arisen from at least partially processed mRNAs ().
Correlation between CAGE tags, sRNAs and internal exons of annotated transcripts
CAGE tags are well established as markers of capped 5′ ends. Certainly, internal exons might contain unknown sites of transcriptional initiation that could give rise to both CAGE tags and small RNAs, which would then be defined as PASRs. However, we observe numerous tags that both initiate less than 20 bases from exon boundaries and cross exon–exon junctions. Very short exons splice inefficiently11
, and naturally occurring 5′ exons less than 20 bases in length are rare (not shown). Thus, the CAGE tags that we observe probably represent cleaved products of mature mRNAs that somehow acquire a 5′ modification analogous to a cap structure that renders them sensitive to the CAGE tagging method. Such a reaction would represent a previously unrecognized RNA processing pathway and a previously unknown fate for spliced mRNAs. Although the CAGE tags used for comparison in this study were derived from polyadenylated RNAs, we cannot determine whether small RNAs originated from poly(A)+
Generation of CAGE tags from internal exons is not confined to a small number of genes. In fact, 49% of all human genes generate a CAGE tag mapping to an internal exon. For 2% of them, one or more CAGE tags are found in all of the internal exons (). This exceeds the number expected by chance (P
value <0.001, ). Prevalent and systematic generation of both small RNAs and CAGE tags from internal exons is illustrated by APOB
(), a gene encoding the apolipoprotein B protein that regulates cholesterol metabolism12
. In this instance, mapping of acetylated histone H3 (ref. 13
) is consistent with a single prevalent TSS, which correlates with the presence of both CAGE tags and PASRs (). However, both CAGE tags and small RNAs are even more abundant in internal APOB
exons, and these often coincide at specific sites (, insets). This provokes a model in which mature transcripts from the APOB
gene are processed post-transcriptionally and in which processing products become modified by some type of cap structure. This possibility gains support from sequencing of small RNAs recovered by immunoprecipitation with a methylguanosine cap antibody14
. This not only recovers PASRs but also enriches small RNAs that map less than 10 bp away from a CAGE tag in internal exons (P
value <0.001, see also ). In libraries prepared from unfractionated small RNAs, 10.0% of filtered sequences mapping within internal exons lie within 10 bp of a CAGE tag. This number increases to 27.0% in libraries prepared from RNAs immunoprecipitated with the anti-m7
As illustrated by APOB, genic CAGE tags and small RNAs are approximately ten times more likely to map within exonic than intronic regions (). As with CAGE tags crossing exon–exon junctions, this result is consistent with a model in which CAGE tags can be derived from products of processed mRNAs.
Considering the prevalence of the PASR class, we sought to probe its potential biological function. As with many genes, PASRs are found at the annotated TSSs of the MYC
. We synthesized a collection of 30–35-nt, single-stranded RNAs that share their 5′ ends with three PASRs from the sense genomic strand and two from antisense strand upstream of the annotated TSS (). These were transfected individually into HeLa cells, and their effect on the abundance of MYC
mRNA was measured (). In each case, transfection of the PASR mimetic reduced the expression of c-MYC
mRNA. The consequences of these effects were measured by co-transfection of a MYC
-responsive luciferase reporter construct, which showed reduced activity in the presence of each PASR (). Similar results were obtained for five PASR mimetics corresponding to the connective tissue growth factor (CTGF
) gene (Supplementary Fig. 2
). The presence of PASRs is associated with marks of active transcription, including association with RNA polymerase II, histone H3 and H4 acetylation, and H3K4 tri-methylation, as well an increased susceptibility to DNase treatment (Supplementary Fig. 3
). Our data indicate a causal connection between PASRs and active MYC expression, although we have not yet investigated the impact of delivering ectopic PASRs on the active marks with which the presence of endogenous species is correlated.
Regulation of gene expression by PASRs
Profiling of small RNAs, defined as those less than 200 nt in length, has revealed a substantial complexity in the output of both genic and intergenic regions of the genome. These studies have raised two possibilities for the origin of PASRs. First, they may be produced as capped, independent transcription products from promoters that also generate long RNAs. Second, they may be generated as post-transcriptional processing products of longer RNAs that initiate at annotated TSSs.
A notable outcome of these studies is the finding that both CAGE tags and small RNAs decorate not only intergenic spaces but also internal exons of protein coding and non-coding transcripts. The existence of a large class of CAGE tags that are both adjacent to and cross splice junctions provides a prima facie case for the conclusion that long RNAs are metabolized into short RNAs that bear cap-like structures at their 5′ ends (). The long RNAs, which ultimately give rise to short RNAs, could be primary transcription products or processing products themselves. Moreover, our studies indicate that CAGE tags are capturing not only TSSs but also the 5′ ends of post-transcriptionally processed RNAs.
A proposed model for the metabolism of genic transcripts into a diversity of long and short RNAs
A key question remains as to whether the group of small RNAs that arise from internal exons represents transition products from mature mRNAs into recyclable ribonucleotides. Several lines of evidence argue against these representing simple degradation intermediates. First, there is a strong correlation between the precise 5′ ends of CAGE tags, derived from long RNAs, and small RNAs identified in our study. These maps were produced from various RNA and tissue sources, and by different groups. The results from independent samples are consistent and indicative of discrete processing sites. Second, based on chemical modification in the CAGE procedure and affinity purification for small RNA libraries, both types of tags significantly enrich under conditions that favour capped RNAs. Third, CAGE tags and small RNA species arise only from a discrete, although substantial, subset of genes, and the abundance of the non-PASR class does not correlate simply with the expression level of their generative loci (see Supplementary Methods
Several studies have indicated that RNA interference directed to promoter regions and apparently non-transcribed portions of genes can have a regulatory impact. In some cases that impact is silencing15,16
, whereas in others activation was surprisingly observed17
. Our analysis of PASRs indicates that providing their synthetic mimetics in trans
can have a consistent, although modest, impact on gene expression. Although in the two cases tested, MYC
, increasing PASR levels decreased expression, it remains possible that the outcome of manipulating PASRs will be gene-specific, consistent with accumulating evidence that destroying promoter-associated RNA (PASR) species can have both positive and negative impacts17
The functions of small RNAs corresponding to intergenic and exonic regions remain obscure. Such species could have regulatory roles, per se
, or they could participate more globally in a bookkeeping or quality control mechanism by which the cell records its transcriptional output and splicing patterns. This has been previously hypothesized as a role for non-protein-coding RNAs18,19
. What is clear is that the transcriptional product of cells is captured in small, stable RNA populations to a degree that was unanticipated, and that at least a subset of these can serve as components of regulatory circuits.