|Home | About | Journals | Submit | Contact Us | Français|
Large-scale sequencing projects have revealed an unexpected complexity in the origins, structures and functions of mammalian transcripts. Many loci are known to produce overlapping coding and non-coding RNAs with capped 5′ ends that vary in size. Methods that identify the 5′ ends of transcripts will facilitate the discovery of novel promoters and 5′ ends derived from secondary capping events. Such methods often require high input amounts of RNA not obtainable from highly refined samples such as tissue microdissections and subcellular fractions. Therefore, we have developed nanoCAGE (Cap Analysis of Gene Expression), a method that captures the 5′ ends of transcripts from as little as 10 nanograms of total RNA and CAGEscan, a mate-pair adaptation of nanoCAGE that captures the transcript 5′ ends linked to a downstream region. Both of these methods allow further annotation-agnostic studies of the complex human transcriptome.
Analysis of the mammalian transcriptome and transcriptional network in ex vivo cells requires technologies that provide a comprehensive and unbiased view of the tissue-specific promotome (the complete set of promoters) from small amounts of RNA and the intron-exon structure of the transcripts associated with different transcription start sites (TSSs), which are marked by a cap-site in most eukaryotic RNA polymerase II-derived RNAs.
Among sequencing-based techniques to measure gene expression, tag-based methods are common. They involve reading a short sequence of a transcript that is still long enough to be mapped onto the genome. We have used Cap Analysis Gene Expression (CAGE)1–3, a cap-trapping based method which allows for systematic 5′ end profiling of capped RNAs, for the first comprehensive single-base resolution maps of TSS and promoters from human and mouse tissues4 and for deciphering transcriptional networks in the human leukemia cell line THP-15. Such large-scale characterization of TSSs showed an unprecedented complexity of the transcriptome. In contrast to classic gene models, the emerging view suggests that most genes have multiple TSSs differing by multiple bases4 and driven by various core promoters and that newly capped 5′ ends can also be created post-transcriptionaly6. Transcription can be initiated by promoters that are broad in shape, often associated with CpG islands, or by sharp promoters, which are narrow in shape and are often associated with TATA-boxes4. These promoter structures have functional implications, being associated to tissue specificity, as for example sharp promoters are, different exon usages, translation initiation sites or classes of non-coding RNAs (ncRNAs). Within the locus of a coding gene, transcription can start within and downstream of the open reading frame such as for the non-coding RNAs that can originate in genomic regions corresponding to the 3′ ends of protein coding geness4. Additionally, the capped transcriptome includes non-coding RNAs that are associated with initiation and termination of transcription6,7.
However, there are outstanding problems that could not yet be addressed with the existing technologies. CAGE requires a large quantity of starting material (~50μg of total RNA) precluding TSS transcriptome analysis of small samples, such as homogeneous cells preparation after microdissection or samples derived from cellular sub-fractionation.
Furthermore, newly identified promoters must be assigned to gene models. Although CAGE identifies new promoters, determining their connection to either downstream known gene structures or to independent novel RNAs is limited to low-throughput gene-by-gene validations. RNA shotgun sequencing approaches (RNA-seq) have been unable to distinguish multiple 5′ ends of a given gene, identifying only their most extreme boundaries at best. This constrains the functional annotation of promoters, from which accurate inference of transcriptional regulatory networks depends5 and limits the study of ncRNAs overlapping known genes. Paired-end sequencing of full-length cDNA, like the GIS (Gene Identification Signature) ditag approach8, allows for the determination of TSS and termination sites in polyadenylated mRNAs, but does not yield information on internal exons. In addition it requires large quantities of purified mRNAs.
Here we present nanoCAGE and CAGEscan technologies, which provide a genome-wide profiling of TSSs from small quantities of RNA and link them to the anatomy of transcribed RNAs. nanoCAGE was carried out with as little as 10 ng of total RNA, the equivalent of the RNA content of a thousand cells. CAGEscan provided important insights on the complexity of the promotome-transcript structure, identifying among others, RNAs that originate from a given TSS but terminate in unrelated downstream genes. Our data also provide an estimate of RNA types that populate the various cell compartments, suggesting a nuclear role for intron- and intergenic regions-derived RNAs, as well as for retrotransposon elements and antisense RNAs.
The classic CAGE protocol consists of many biochemical processing steps2,9, whereas nanoCAGE takes advantage of a peculiar property of reverse transcriptase, called “template switching”10 to select 5′ ends of capped transcripts. Template switching (TS) exploits the ability of the reverse-transcriptase to extend the cDNA using the mRNA’s cap as a template: the resulting synthesized first strand cDNA carries one to three C nucleotides that correspond to the cap structure11,12. These Cs hybridize to the ribo-G at the 3′ end of a template switching oligonucleotide (Fig. 1a). The reverse transcriptase extends cDNA polymerization using the TS oligonucleotide as template, providing extra 3′ sequence to the first-strand cDNA (Fig. 1a). which is used to prime second-strand cDNA synthesis. Although TS has been observed for blunt DNA/RNA hybrids10, we show that its efficiency on capped RNAs is far greater, therefore preferentially capturing capped, full-length transcripts (Fig. 2). TS does not require purification steps, and thus avoids loss of material.
To target non-coding, non-polyadenylated RNAs (poly-A−) and RNAs whose 3′ end has been truncated during the isolation of specific cells from the tissue of interest, we developed conditions to allow random-priming of the reverse-transcription (RT) reaction in combination with 5′ template-switching. Due to DNA-dependent polymerase activity of the reverse-transcriptase, annealing of TS oligonucleotides and RT primers to each other generates small artefactual DNA fragments that become the predominant PCR templates in subsequent steps, impairing libraries preparation (not shown). The prevalence of these artifacts has, so far, precluded the development of random primer-based cap-switch methods for whole transcriptome analysis with total RNA. To overcome this problem, we designed a “semi-suppressive PCR” method, in which the linkers at the 5′ and 3′ ends of the cDNA carry similar (but not identical) complementary sequences (Fig. 1b, Supplementary Fig. 1). Consequently, DNA templates bearing the same ending sequences (such as non-oriented cDNAs carrying two 5′ or two 3′ linkers) or small size templates (such as primer-derived artifacts) are less efficiently amplified during PCR. As a result, the majority of PCR products consist of long cDNAs properly flanked by the adapter sequences present in the TS oligonucleotide and RT primer, as they are the most efficiently amplified templates (Fig. 1b). Additionally, the removal of primer dimers by the semi-suppressive PCR enables the use of TS primer in concentrations as high as 10μM. This maximizes the efficiency of template switching since the concentration of TS oligonucleotides is the reaction-limiting factor (data not shown). The ability to use random primers considerably extends the power of this technique since (i) poly-A− transcripts constitute at least one third of the transcriptome13, (ii) long polyadenylated RNAs are often damaged by ex vivo sample preparation, such as laser capture micro-dissection of fixed tissues and (iii) PCR of oligo-dT primed cDNAs introduces strong size and representational biases regardless of potential RNA degradation.
TS oligonucleotides and RT primers used in the nanoCAGE protocol contain EcoP15I restriction sites14 to systematically generate 25 bp fragments corresponding to the 5′ end of the template-switched captured cDNAs, thus producing nanoCAGE tag libraries (Fig. 1c). Although this enzymatic cleavage might be dispensable when reading short reads with several second-generation sequencers, the standardization of tag length overcomes biases during the second round PCR and simplifies DNA molar quantification and sequencing. Additionally, EcoP15I tagging allows the introduction of a DNA sequence “barcode” at the 3′ end (Fig. 1c) and thus pooling of different libraries prior to their sequencing, resulting in dramatic cost savings15.
CAGEscan was built upon nanoCAGE, but modified to accommodate paired-end sequencing for TSS determination at 5′ ends coupled with 3′ end sequencing of cDNAs at random priming sites. Rather than cleaving the cDNAs, in CAGEscan we added adapter sequences allowing for paired-end sequencing in the Illumina Genome Analyzers16 (Fig. 1d). Thus, CAGEscan yields collections of 3′ end reads “scanning” transcripts defined by their common 5′ end, as obtained by sequencing of both the 5′ end and the 3′ end of the template-switched captured cDNAs. Yet, unlike nanoCAGE libraries that contain uniformly short sequences, CAGEscan libraries show a broader size range including fragments longer than 1 kb, which perform poorly on currently available second-generation sequencing platforms17. This problem is minimized by exclusively using highly concentrated random primers and commercial reverse transcriptases, which show little strand displacement activity. Thus, CAGEscan sequencing templates are kept relatively short, regardless the length of the original mRNA molecules.
Due to the selectivity of the template switch for capped molecules, both protocols were used on total RNA, without ribosomal RNAs (rRNA) depletion. Notably, the usage of 3′ random primer allows the detection of non-coding, non-polyadenylated RNAs13, which have been so far poorly characterized.
In order to validate the reproducibility, the efficiency and the precision of nanoCAGE, we prepared libraries from serially diluted total RNA from cultured hepatocellular carcinoma cells (Hep G2) and compared them to reference TSS data.
Two duplicate sets of nanoCAGE libraries from 10, 50, 250 and 1,250 nanograms of total RNA were synthesized and sequenced with an Illumina Genome Analyzer. Extracted tags were aligned to the human genome (NCBI Build 36.1)18. We clustered TSS from all the libraries that were located less than 20 bp apart on the same genome strand4. Clusters separated by less than 400 bp were grouped in promoter regions5. We then compared expression levels, measured as number of tags per promoter region in a given library, for each pair of replicates with the same quantity of RNA. The Pearson correlation coefficients between replicas were 0.97, 0.96, 0.97 and 0.99 for, respectively, 10, 50, 250 and 1,250 ng. Replicates sequencing depth were in some cases substantially different (Supplementary Table 1) and higher correlations were observed for deeper sequenced libraries (0.99 for 1250 ng replicas). Satisfactory reproducibility was demonstrated for all RNA concentrations tested. We then pooled the tags into virtual libraries for each RNA quantity, and compared them with each other. Pearson correlation coefficients varied between 0.987 to 0.999 (Supplementary Table 2), showing similar snapshots of the transcriptome with tiny quantities of starting total RNA within a range of 10 to 1,250 ng.
A similar template-switching approach has been used with fragmented, uncapped RNA molecules19 showing that TS could also be used on uncapped 5′ ends. However, this protocol required prior depletion of ribosomal-RNA otherwise reverse-transcription of total RNA with random hexamers yielded 90–94% of ribosomal sequences20 In our hands, only 11% of Hep G2 nanoCAGE tags matched rRNA sequences, showing an 8-fold reduction in non-capped rRNA content. This demonstrates the strong preference of template-switching for capped over non-capped RNAs. To prove efficient capture of the 5′ ends of capped transcripts, we prepared nanoCAGE libraries from 100 ng of decapped, fragmented or both decapped and fragmented total RNA and analyzed the distribution of tags mapping to RefSeq transcript models21. In a library prepared with untreated RNA, 52% of the tags mapped to first exons or proximal promoters in RefSeq (defined as 500 bp to RefSeq TSS) and 31% detected potentially new promoters in intergenic regions (Fig. 2a, Supplementary Fig. 2 and Supplementary Table 3). We noted that tags mapping to internal exons and 3′ UTRs (15% in total) can also derive from genuinely capped transcripts co-localized within the boundaries of longer transcripts referenced by RefSeq4,6,9. The proportion of tags matching the 5′ end of known transcripts was halved to 23% after RNA decapping. Furthermore, this number dropped to 8.5% upon RNA fragmentation, suggesting that a large number of uncapped RNA molecules is needed to compete with capped molecules. Upon combining decapping and fragmentation, the preferential capture of 5′ ends was almost completely abolished, demonstrating that nanoCAGE distinguishes capped ends from other 5′ ends and preferentially captures the 5′ end of capped transcripts. The semi-suppressive PCR did not impair the detection of relatively short transcripts, as we detected expression for 78% of RefSeq transcripts (23,512/29,996), including 44% (271/615) of the subset shorter than 250 bp. In that respect NanoCAGE tags outperform the FANTOM3 dataset, in which only 5% (28/615) of the short RefSeq transcripts are detected (Supplementary Table 1). Furthermore, the EcoP15I cleavage did not introduce any bias as we found the CAGCAG EcoP151 restriction site in 81% of the detected transcripts for both the nanoCAGE and the FANTOM3 libraries, which were made with a different restriction enzyme, MmeI1 (see also Supplementary Fig. 3).
To confirm the precision of template-switching in detecting TSS we compared promoters identified by nanoCAGE with those found by two methods that are using different protocols for cap selection and for which Hep G2 libraries were available4,22: Deep-RACE22 (Rapid Amplification of cDNA Ends), based on oligo-capping23, and CAGE, 4. The Deep-RACE data was limited to 18 different promoters in 17 loci22. As exemplified with histone cluster 1, H3 (HIST1H3C, Fig. 2b), the main TSS was the same between nanoCAGE and Deep-RACE or FANTOM3 CAGE data in four and seven cases respectively. The HIST1H3C locus also exemplifies the ability of our random-primed approach to uncover TSSs of non-polyadenylated transcripts (HIST1H3C RefSeq model lacks any 5′ UTR information). When allowing only 4 bp discrepancy between the TSS uncovered by each methodology, the results of nanoCAGE were in agreement with Deep-RACE and FANTOM3 CAGE for 11 out of 18 promoters (Supplementary Table 4) and for 17 of the 18 Deep-RACE-validated TSS respectively. Interestingly, the two alternative promoters of PPP2R4 uncovered by Deep-RACE and CAGE were also detected by nanoCAGE and their relative differential expression levels were consistent between all three approaches (data not shown). To extend this result we compared the location of all the promoters detected by the two genome-wide libraries, nanoCAGE and CAGE. Although a large number of promoters are broad in size4, for 66% of the promoters the distance between TSS detected by both techniques was less than 5 bp (Supplementary Fig. 4, Supplementary Table 5).
Even for cells grown in culture, starting material becomes a limiting factor when cellular sub-compartments are selectively fractionated to explore specific RNA content. As part of the ENCODE project, attempts to produce CAGE libraries from nuclear RNA subfractions of the K562 myelogenous leukemia cell line (the nucleolus, the nucleoplasm, the chromatin-bound RNAs as well as from polysomal poly-A− RNA consisting mostly of rRNA) were unsuccessful due to the paucity of mRNA (not shown). Using nanoCAGE, four libraries were synthesized and between 9.5 and 13.8 million tags were sequenced for each of them. Comparing to standard poly-A− CAGE libraries, which were sequenced at the same depth, the complexity of detected 5′ ends was consistent between the two technologies for each cellular compartment (Supplementary Fig. 5). We have also found differences in TSS specificity among different compartments (not shown).
The functional significance of novel 5′ ends is limited by the lack of information on the entire transcript. To better understand the structure of the transcripts associated with novel TSS and to better characterize the differences between nuclear and cytoplasmic transcriptomes, CAGEscan libraries were prepared in technical duplicates from four different Hep G2 cultures. In a first series of experiments, nuclear and cytoplasmic fractions were analyzed either as total RNA or as a subfraction depleted of polyadenylated transcripts (poly-A−) (Supplementary Table 1).
CAGEscan mate pair sequences were aligned against the human genome (NCBI build 36.1). The poly-A− cytoplasmic, poly-A− nuclear, total cytoplasmic and total nuclear fractions yielded together a total of 2,109,392 unique paired-end tags (Supplementary Table 6). Each of these associated a TSS to a downstream sequence in a random location. Selecting mate pairs starting within 50 bp of RefSeq transcript’s TSS and using their intron-exon structure, we estimated the length of the RNA from which mate pairs were derived. The resulting median length was 449 bp (1st and 3rd quartile: 304 and 693 bp, Supplementary Fig. 6). In comparison, the median length of RefSeq transcript models is 2,422 bp (1st and 3rd quartile: 1,509 and 3,799 bp), thus suggesting the majority of RNA Pol II transcripts are competent to produce CAGEscan mate pairs, although CAGEscan is not optimal to map the 3′ ends of transcript.
CAGEscan allows the association of TSSs detected by CAGE to otherwise orphan intergenic, intron or 3′ UTR regions. Mate pairs were then annotated with respect to RefSeq transcript models21 complemented with a proximal promoter, defined as the region comprising the 500 bp directly upstream of their 5′ end. This showed that an average of 4.24% of the transcripts were matching RefSeq transcripts, with 1.85% of them being strictly consistent with current gene models (that is starting within RefSeq promoter or 5′ UTR exon and ending within a RefSeq exon) while the rest was likely representing alternative mRNA splice forms and non-coding RNAs. These were located antisense of RefSeq, within their introns or in intergenic regions. Furthermore, the latter represented 87.5% of the total signal (Fig. 3 and Supplementary Table 7). We observed specific differences between sub-cellular fractions and between the poly-A− and total RNA (dominated by poly-A+ molecules). Paired-end tags starting and ending within RefSeq promoters or 5′ UTRs were twice more prevalent than mate pairs ending within any RefSeq exons in the poly-A− libraries than the total RNA libraries. Those may correspond in part to promoter-associated long and short RNAs (PALRs and PASRs)7. Nuclear fractions showed more paired-end tags starting and ending in intergenic and intronic regions than cytoplasmic fractions (Fig. 3a). By preparing 12 additional cytoplasmic and nuclear CAGEscan libraries from 6 HepG2 independent biological replicas, we confirmed higher abundance of intronic (Fig. 3b) and intergenic transcripts in nuclear fractions (P = 0.019 and P = 0.004 respectively, paired Student’s t-test).
To better reconstruct transcript models, we grouped paired-end tags with overlapping 5′ ends into CAGEscan clusters (Fig. 4a). Alignment patterns of the corresponding 3′ end tags on the genome recapitulate the potential structure of the transcript resulting from a common promoter. By pooling together the poly-A− cytoplasmic, poly-A− nuclear, total cytoplasmic and total nuclear fractions libraries, we obtained 854,849 distinct CAGEscan clusters, with an average of 2.47 reads per cluster. Clustered independently, the cytoplasmic libraries produced in technical duplicates yielded between 72,666 and 34,822 clusters (with 3.5±0.6 reads per cluster) and the corresponding nuclear libraries yielded between 309,682 and 147,963 clusters (with 1.6±0.3 reads per cluster). 9% of the CAGEscan clusters (all libraries pooled) started upstream of the translation initiation site of RefSeq; of them, 76.5% reached into their 3′ UTR or into their downsteam intergenic region, associating a promoter to a 3′ UTR for 11,131 RefSeq transcripts. Comparable ratios (9.5%±2.65 and 80%±11 respectively) were obtained when considering the four libraries separately (Supplementary Table 6). The region surrounding the FTL gene, for which the complete RefSeq model is tiled with paired-end sequences, illustrates such 5′ mate pair driven clustering (Fig. 4b). We observed subcellular compartment-specific antisense expression, with most of these antisense CAGEscan clusters ending in close proximity to the promoter of FTL (Fig 4b and Supplementary Fig. 7). A total of 37,818 clusters were antisense to 9,638 RefSeqs (Supplementary Table 6). Antisense RNAs were generally more prevalent in the nuclear fractions.
By aligning mate pairs to all exon-exon junction combinations of each transcript, we uncovered 8,462 splice junctions linked to 11,964 TSS. This also revealed the existence of 312 exon skipping events amongst 297 independent transcripts. Furthermore, clustering paired-end tags uncovered 1,569 CAGEscan clusters that initiated within the 5′ UTR of a given transcript, reached into the next downstream independent gene model in 1,198 pairs of distinct consecutive transcripts.
Pervasive and regulated transcription of retrotransposon elements (RE) have been observed in total RNA extracts using CAGE24. Using CAGEscan, we observed that expressed RE are more abundant in the nuclear RNA fractions than in the cytoplasmic ones. Globally, long and short interspersed nuclear elements (LINEs, SINEs) were the most highly expressed RE in HepG2, followed by long terminal repeats (LTRs). All three were strongly over-represented in the nuclear poly-A− fraction (Fig. 5a–c). As expected srpRNA (signal recognition particle) repeats were enriched in the cytoplasmic compartment (Fig. 5d).
So far, an unbiased analysis of the transcriptome and promoter usage from challenging samples such as biopsies, homogeneous population of ex vivo cell types or subcellular compartments has been hampered by the low quantity of RNA and by the use of fixatives that are detrimental to RNA integrity. The simplicity of both nanoCAGE and CAGEscan combined with decreasing sequencing cost opens the possibility for a truly high-throughput library production of pooled, multiplexed libraries followed by parallel sequencing, with applications ranging from drug screening, biopsy analysis, and whole transcriptome association studies. Therefore, we expect nanoCAGE to become the technique of choice for micro-dissected samples in experimental biology and molecular pathology. Identification of novel 5′ ends that are compartment-specific demonstrates the need and usefulness of nanoCAGE. As an added advantage, the fixed length tags generated by nanoCAGE can easily be turned into concatemers that will be advantageous when sequenced with long-read high throughput sequencers.
By linking TSSs to downstream sequences, CAGEscan provides insights into the architecture of transcripts and thus into their possible functions. Although the ability of fully scanning the 3′ end of long transcripts is currently limited by the paired-end read length range that can be simultaneously sequenced, we believe that further development of sequencing technology will overcome this limitation. CAGEscan profoundly differs from traditional inferences based on gene models such as RefSeq, which fails to grasp the complexity of the transcriptional landscape. CAGEscan analysis is data-driven and hypothesis-free, which allowed us to find non-coding RNAs, evidence of transcriptional read-through between neighboring loci, as well as novel forms of protein-coding genes. The expression level of CAGEscan promoters is indicated by the frequency of the 5′ read of the mate pairs. Furthermore CAGEscan offers a unique perspective into the relations of non-coding RNAs to the neighboring genomic/transcriptomic elements. Such novel transcription maps will be instrumental in identifying the functions of novel ncRNAs that overlap regulatory regions and are likely to regulate transcription25 or processing and recapping of RNAs26.
This work was founded by a grant of the 6th Framework of the European Union commission to the Neuro Functional Genomics consortium, by a grant of the 7th Framework to PC and SG (Dopaminet), a Grant-in-Aids for Scientific Research (A) No.20241047 for PC and a Research Grant for RIKEN Omics Science Center from MEXT to YH. Work in this project is also partially supported by the National Human Genome Research Institute grants U54 HG004557.. CP was supported by the Japanese Society for the Promotion of Science long-term fellowship number P05880. SG was funded by a career developmental award from “The Giovanni Armenise-Harvard Foundation”. We thank Alistair Forrest for critical discussions and Mylene Josserand for experimental assistance.
Statement of competing interest
CP, PC and RS are inventors of the Japanese patent application held by RIKEN on the moderately suppressive PCR step of the nanoCAGE protocol.
Author contributionsCP, RS and PC conceived the nanoCAGE technology. CP and PC conceived the CAGEscan technology. CP, HT, RS, MS and SO designed and performed the experiments. CP, NB, HT, TL and MV analyzed the data and interpreted the results. CP, NB, SG and PC supervised the study. DL, NH, VO, IB, HG, JD, PK, HW, CAD and TRG provided material. JS provided software. JK, YH, SG and PC provided salary support. The text and figures were drafted by PC, NB, HT, and MV, and edited by CP, NB, TL, COD and PC.