The FANTOM5 project investigates transcription initiation activities in more than 1,000 human and mouse primary cells, cell lines and tissues using CAGE. Based on manual curation of sample information and development of an ontology for sample classification, we assemble the resulting data into a centralized data resource (http://fantom.gsc.riken.jp/5/). This resource contains web-based tools and data-access points for the research community to search and extract data related to samples, genes, promoter activities, transcription factors and enhancers across the FANTOM5 atlas.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0560-6) contains supplementary material, which is available to authorized users.
MicroRNAs are small non-coding RNAs that inhibit the translation of target mRNAs. In humans, most microRNAs are transcribed by RNA polymerase II as long primary transcripts and processed by sequential cleavage of the two RNase III enzymes, DROSHA and DICER, into precursor and mature microRNAs, respectively. Although the fundamental functions of microRNAs in RNA silencing have been gradually uncovered, less is known about the regulatory mechanisms of microRNA expression. Here, we report that telomerase reverse transcriptase (TERT) extensively affects the expression levels of mature microRNAs. Deep sequencing-based screens of short RNA populations revealed that the suppression of TERT resulted in the downregulation of microRNAs expressed in THP-1 cells and HeLa cells. Primary and precursor microRNA levels were also reduced under the suppression of TERT. Similar results were obtained with the suppression of either BRG1 (also called SMARCA4) or nucleostemin, which are proteins interacting with TERT and functioning beyond telomeres. These results suggest that TERT regulates microRNAs at the very early phases in their biogenesis, presumably through non-telomerase mechanism(s).
telomerase reverse transcriptase; microRNA; RNA-dependent RNA polymerase; cancer
The mesencephalic dopaminergic (mDA) cell system is composed of two major groups of projecting cells in the Substantia Nigra (SN) (A9 neurons) and the Ventral Tegmental Area (VTA) (A10 cells). Selective degeneration of A9 neurons occurs in Parkinson’s disease (PD) while abnormal function of A10 cells has been linked to schizophrenia, attention deficit and addiction. The molecular basis that underlies selective vulnerability of A9 and A10 neurons is presently unknown.
By taking advantage of transgenic labeling, laser capture microdissection coupled to nano Cap-Analysis of Gene Expression (nanoCAGE) technology on isolated A9 and A10 cells, we found that a subset of Olfactory Receptors (OR)s is expressed in mDA neurons. Gene expression analysis was integrated with the FANTOM5 Helicos CAGE sequencing datasets, showing the presence of these ORs in selected tissues and brain areas outside of the olfactory epithelium. OR expression in the mesencephalon was validated by RT-PCR and in situ hybridization. By screening 16 potential ligands on 5 mDA ORs recombinantly expressed in an heterologous in vitro system, we identified carvone enantiomers as agonists at Olfr287 and able to evoke an intracellular Ca2+ increase in solitary mDA neurons. ORs were found expressed in human SN and down-regulated in PD post mortem brains.
Our study indicates that mDA neurons express ORs and respond to odor-like molecules providing new opportunities for pharmacological intervention in disease.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-729) contains supplementary material, which is available to authorized users.
NanoCAGE; Odors; Odorant receptors; Dopaminergic neurons; Ventral midbrain
Most RNA molecules are co- or post-transcriptionally modified to alter their chemical and functional properties to assist in their ultimate biological function. Among these modifications, the addition of 5′ cap structure has been found to regulate turnover and localization. Here we report a study of the cap structure of human short (<200 nt) RNAs (sRNAs), using sequencing of cDNA libraries prepared by enzymatic pretreatment of the sRNAs with cap sensitive-specificity, thin layer chromatographic (TLC) analyses of isolated cap structures and mass spectrometric analyses for validation of TLC analyses. Processed versions of snoRNAs and tRNAs sequences of less than 50 nt were observed in capped sRNA libraries, indicating additional processing and recapping of these annotated sRNAs biotypes. We report for the first time 2,7 dimethylguanosine in human sRNAs cap structures and surprisingly we find multiple type 0 cap structures (mGpppC, 7mGpppG, GpppG, GpppA, and 7mGpppA) in RNA length fractions shorter than 50 nt. Finally, we find the presence of additional uncharacterized cap structures that wait determination by the creation of needed reference compounds to be used in TLC analyses. These studies suggest the existence of novel biochemical pathways leading to the processing of primary and sRNAs and the modifications of their RNA 5′ ends with a spectrum of chemical modifications.
We provide here a protocol for the preparation of cap-analysis gene expression (CAGE) libraries, which allow measuring the expression of eukaryotic capped RNAs and simultaneously map the promoter regions. The presented protocol simplified the previously published ones and moreover produces tags that are 27 nucleotides long, which facilitates mapping to the genome. The protocol takes less than 5 days to complete and presents a notable improvement compared to previously published versions.
Cap-analysis gene expression; RNAseq; transcriptome; sequencing; RNA
Cap-Analysis gene expression (CAGE) provides accurate high-throughput measurement of RNA expression. CAGE allows mapping of all the initiation sites of both capped coding and noncoding RNAs. In addition, transcriptional start sites (TSSs) within promoters are characterized at single nucleotide resolution. The latter allows the regulatory inputs driving gene expression to be studied, which in turn enables the construction of transcriptional networks. Here we provide an optimized protocol for the construction of CAGE libraries based on the preparation of 27 nucleotide (nt) long tags corresponding to initial bases at the 5’ ends of capped RNAs. We have optimized the methods using simple steps based on filtration, which altogether takes 4 days to complete. The CAGE tags can be readily sequenced with Illumina sequencers and upon modification, they are also amenable to sequencing using other platforms.
Cap-analysis gene expression (CAGE); transcriptome; promoter; sequencing; RNA
RNAi pathways have evolved as important modulators of gene expression that act in the cytoplasm by degrading RNA target molecules via the activity of short (21-30nt) RNAs1-6 RNAi components have been reported to play a role in the nucleus as they are involved in epigenetic regulation and heterochromatin formation7-10. However, although RNAi-mediated post-transcriptional silencing (PTGS) is well documented, mechanisms of RNAi-mediated transcriptional gene silencing (TGS) and in particular the role of RNAi components in chromatin, especially in higher eukaryotes, are still elusive. Here we show that key RNAi components Dicer-2 (Dcr2) and and Argonaute-2 (AGO2) AGO2 associate with chromatin, with strong preference for euchromatic, transcriptionally active loci and interact with core transcription machinery. Notably Dcr2 and AGO2 loss of function show that transcriptional defects are accompanied by perturbation of Pol II positioning on promoters. Further, both Dcr2 and Ago2 null mutations as well as missense mutations compromising the RNAi activity impair global Pol II dynamics upon heat shock. Finally, AGO2 RIP-seq experiments reveal that, AGO2 is strongly enriched in small-RNAs encompassing promoter as well as other parts of heat shock and other gene loci on both sense and antisense, with a strong bias for antisense, particularly after heat shock. Taken together our results reveal a new scenario in which Dcr2 and AGO2 are globally associated with transcriptionally active loci and may play a pivotal role in shaping the transcriptome by controlling RNA Pol II processivity.
Cap analysis of gene expression (CAGE) is a sequencing based technology to capture the 5’ ends of RNAs in a biological sample. After mapping, a CAGE peak on the genome indicates the position of an active transcriptional start site (TSS) and the number of reads correspond to its expression level. CAGE is prominently used in both the FANTOM and ENCODE project but presently there is no software package to perform the essential data processing steps.
Here we describe MOIRAI, a compact yet flexible workflow system designed to carry out the main steps in data processing and analysis of CAGE data. MOIRAI has a graphical interface allowing wet-lab researchers to create, modify and run analysis workflows. Embedded within the workflows are graphical quality control indicators allowing users assess data quality and to quickly spot potential problems. We will describe three main workflows allowing users to map, annotate and perform an expression analysis over multiple samples.
Due to the many built in quality control features MOIRAI is especially suitable to support the development of new sequencing based protocols.
The MOIRAI source code is freely available at
CAGE; Pipeline; Next generation sequencing
Next generation sequencing based technologies are being extensively used to study transcriptomes. Among these, cap analysis of gene expression (CAGE) is specialized in detecting the most 5’ ends of RNA molecules. After mapping the sequenced reads back to a reference genome CAGE data highlights the transcriptional start sites (TSSs) and their usage at a single nucleotide resolution.
We propose a pipeline to group the single nucleotide TSS into larger reproducible peaks and compare their usage across biological states. Importantly, our pipeline discovers broad peaks as well as the fine structure of individual transcriptional start sites embedded within them. We assess the performance of our approach on a large CAGE datasets including 156 primary cell types and two cell lines with biological replicas. We demonstrate that genes have complicated structures of transcription initiation events. In particular, we discover that narrow peaks embedded in broader regions of transcriptional activity can be differentially used even if the larger region is not.
By examining the reproducible fine scaled organization of TSS we can detect many differentially regulated peaks undetected by previous approaches.
CAGE; Peak finding; Reproducibility; Hierarchical stability
Deciphering the most common modes by which chromatin regulates transcription, and how this is related to cellular status and processes is an important task for improving our understanding of human cellular biology. The FANTOM5 and ENCODE projects represent two independent large scale efforts to map regulatory and transcriptional features to the human genome. Here we investigate chromatin features around a comprehensive set of transcription start sites in four cell lines by integrating data from these two projects.
Transcription start sites can be distinguished by chromatin states defined by specific combinations of both chromatin mark enrichment and the profile shapes of these chromatin marks. The observed patterns can be associated with cellular functions and processes, and they also show association with expression level, location relative to nearby genes, and CpG content. In particular we find a substantial number of repressed inter- and intra-genic transcription start sites enriched for active chromatin marks and Pol II, and these sites are strongly associated with immediate-early response processes and cell signaling. Associations between start sites with similar chromatin patterns are validated by significant correlations in their global expression profiles.
The results confirm the link between chromatin state and cellular function for expressed transcripts, and also indicate that active chromatin states at repressed transcripts may poise transcripts for rapid activation during immune response.
Fantom; Encode; Cage; Transcription start sites; Chromatin states; Gene expression
By coupling laser capture microdissection to nanoCAGE technology and next-generation sequencing we have identified the genome-wide collection of active promoters in the mouse Main Olfactory Epithelium (MOE). Transcription start sites (TSSs) for the large majority of Olfactory Receptors (ORs) have been previously mapped increasing our understanding of their promoter architecture. Here we show that in our nanoCAGE libraries of the mouse MOE we detect a large number of tags mapped in loci hosting Type-1 and Type-2 Vomeronasal Receptors genes (V1Rs and V2Rs). These loci also show a massive expression of Long Interspersed Nuclear Elements (LINEs). We have validated the expression of selected receptors detected by nanoCAGE with in situ hybridization, RT-PCR and qRT-PCR. This work extends the repertory of receptors capable of sensing chemical signals in the MOE, suggesting intriguing interplays between MOE and VNO for pheromone processing and positioning transcribed LINEs as candidate regulatory RNAs for VRs expression.
vomeronasal receptors; main olfactory epithelium; vomeronasal organ; VNO; MOE; V1Rs; V2Rs
Brain function is shaped by postnatal experience and vulnerable to disruption of Methyl-CpG-binding protein, Mecp2, in multiple neurodevelopmental disorders. How Mecp2 contributes to the experience-dependent refinement of specific cortical circuits and their impairment remains unknown. We analyzed vision in gene-targeted mice and observed an initial normal development in the absence of Mecp2. Visual acuity then rapidly regressed after postnatal day P35–40 and cortical circuits largely fell silent by P55-60. Enhanced inhibitory gating and an excess of parvalbumin-positive, perisomatic input preceded the loss of vision. Both cortical function and inhibitory hyperconnectivity were strikingly rescued independent of Mecp2 by early sensory deprivation or genetic deletion of the excitatory NMDA receptor subunit, NR2A. Thus, vision is a sensitive biomarker of progressive cortical dysfunction and may guide novel, circuit-based therapies for Mecp2 deficiency.
PROM1 is the gene encoding prominin-1 or CD133, an important cell surface marker for the isolation of both normal and cancer stem cells. PROM1 transcripts initiate at a range of transcription start sites (TSS) associated with distinct tissue and cancer expression profiles. Using high resolution Cap Analysis of Gene Expression (CAGE) sequencing we characterize TSS utilization across a broad range of normal and developmental tissues. We identify a novel proximal promoter (P6) within CD133+ melanoma cell lines and stem cells. Additional exon array sampling finds P6 to be active in populations enriched for mesenchyme, neural stem cells and within CD133+ enriched Ewing sarcomas. The P6 promoter is enriched with respect to previously characterized PROM1 promoters for a HMGI/Y (HMGA1) family transcription factor binding site motif and exhibits different epigenetic modifications relative to the canonical promoter region of PROM1.
PROM1 protein; human; AC133 antigen; transcription start site; promoter regions; genetic; melanoma; cancer stem cells
Changes in environmental conditions lead to expression variation that manifest at the level of gene regulatory networks. Despite a strong understanding of the role noise plays in synthetic biological systems, it remains unclear how propagation of expression heterogeneity in an endogenous regulatory network is distributed and utilized by cells transitioning through a key developmental event.
Here we investigate the temporal dynamics of a single-cell transcriptional network of 45 transcription factors in THP-1 human myeloid monocytic leukemia cells undergoing differentiation to macrophages. We systematically measure temporal regulation of expression and variation by profiling 120 single cells at eight distinct time points, and infer highly controlled regulatory modules through which signaling operates with stochastic effects. This reveals dynamic and specific rewiring as a cellular strategy for differentiation. The integration of both positive and negative co-expression networks further identifies the proto-oncogene MYB as a network hinge to modulate both the pro- and anti-differentiation pathways.
Compared to averaged cell populations, temporal single-cell expression profiling provides a much more powerful technique to probe for mechanistic insights underlying cellular differentiation. We believe that our approach will form the basis of novel strategies to study the regulation of transcription at a single-cell level.
Analyzing the RNA pool or transcription start sites requires effective means to convert RNA into cDNA libraries for digital expression counting. With current high-speed sequencers, it is necessary to flank the cDNAs with specific adapters. Adding template-switching oligonucleotides to reverse transcription reactions is the most commonly used approach when working with very small quantities of RNA even from single cells.
Here we compared the performance of DNA-RNA, DNA-LNA and DNA oligonucleotides in template-switching during nanoCAGE library preparation. Test libraries from rat muscle and HeLa cell RNA were prepared in technical triplicates and sequenced for comparison of the gene coverage and distribution of the reads within transcripts. The DNA-RNA oligonucleotide showed the highest specificity for capped 5′ ends of mRNA, whereas the DNA-LNA provided similar gene coverage with more reads falling within exons.
While confirming the cap-specific preference of DNA-RNA oligonucleotides in template-switching reactions, our data indicate that DNA-LNA hybrid oligonucleotides could potentially find other applications in random RNA sequencing.
CAGE; Template-switching; LNA; Transcriptome; Quantitative sequencing
Efficient isolation of specific, intact, living neurons from the adult brain is problematic due to the complex nature of the extracellular matrix consolidating the neuronal network. Here, we present significant improvements to the protocol for isolation of pure populations of neurons from mature postnatal mouse brain using fluorescence activated cell sorting (FACS). The 10-fold increase in cell yield enables cell-specific transcriptome analysis by protocols such as nano-CAGE and RNA seq.
FACS; parvalbumin; pyramidal; nanoCAGE; RNA seq
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific sub-cellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic sub-cellular localizations are also poorly understood. Since RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell’s regulatory capabilities are focused on its synthesis, processing, transport, modifications and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations taken together prompt to a redefinition of the concept of a gene.
LINE-1 (L1) retrotransposons are mobile genetic elements comprising ∼17% of the human genome. New L1 insertions can profoundly alter gene function and cause disease, though their significance in cancer remains unclear. Here, we applied enhanced retrotransposon capture sequencing (RC-seq) to 19 hepatocellular carcinoma (HCC) genomes and elucidated two archetypal L1-mediated mechanisms enabling tumorigenesis. In the first example, 4/19 (21.1%) donors presented germline retrotransposition events in the tumor suppressor mutated in colorectal cancers (MCC). MCC expression was ablated in each case, enabling oncogenic β-catenin/Wnt signaling. In the second example, suppression of tumorigenicity 18 (ST18) was activated by a tumor-specific L1 insertion. Experimental assays confirmed that the L1 interrupted a negative feedback loop by blocking ST18 repression of its enhancer. ST18 was also frequently amplified in HCC nodules from Mdr2−/− mice, supporting its assignment as a candidate liver oncogene. These proof-of-principle results substantiate L1-mediated retrotransposition as an important etiological factor in HCC.
► L1 retrotransposons promote tumorigenesis in hepatocellular carcinoma (HCC) ► Germline L1 and Alu insertions in MCC activate β-catenin/Wnt signaling ► L1 mobilization in tumor cells accelerates transformation of the HCC genome ► A tumor-specific L1 insertion interrupts a negative feedback loop regulating ST18
L1 retrotransposons, which are widespread in the human genome, can mobilize and activate oncogenes in the livers of individuals infected with the hepatitis B or hepatitis C virus, promoting the development and growth of hepatocellular carcinoma. Genes identified by the L1 insertions present new options for cancer screening and intervention.
Non-coding RNAs (ncRNAs) are involved in an increasing number of cellular events1. Some ncRNAs are processed by DICER and DROSHA ribonucleases to give rise to small double-stranded RNAs involved in RNA interference (RNAi)2. The DNA-damage response (DDR) is a signaling pathway that originates from the DNA lesion and arrests cell proliferation3. So far, DICER or DROSHA RNA products have not been reported to control DDR activation. Here we show that DICER and DROSHA, but not downstream elements of the RNAi pathway, are necessary to activate DDR upon oncogene-induced genotoxic stress and exogenous DNA damage, as studied also by DDR foci formation in mammalian cells and zebrafish and by checkpoint assays. DDR foci are sensitive to RNase A treatment, and DICER- and DROSHA-dependent RNA products are required to restore DDR foci in treated cells. Through RNA deep sequencing and studies of DDR activation at an inducible unique DNA double-strand break (DSB), we demonstrate that DDR foci formation requires site-specific DICER- and DROSHA-dependent small RNAs, named DDRNAs, which act in a MRE11-RAD50-NBS1 (MRN) complex-dependent manner. Chemically synthesized or in vitro-generated by DICER cleavage, DDRNAs are sufficient to restore DDR in RNase A-treated cells, also in the absence of other cellular RNAs. Our results describe an unanticipated direct role of a novel class of ncRNAs in the control of DDR activation at sites of DNA damage.
DICER; DROSHA; small non coding RNAs; DNA damage response (DDR); ATM; cellular senescence; zebrafish
Mutations in the MECP2 gene are found in a large proportion of girls with Rett Syndrome. Despite extensive research, the principal role of MeCP2 protein remains elusive. Is MeCP2 a regulator of genes, acting in concert with co-activators and co-repressors, predominantly as an activator of target genes or is it a methyl CpG binding protein acting globally to change the chromatin state and to supress transcription from repeat elements? If MeCP2 has no specific targets in the genome, what causes the differential expression of specific genes in the Mecp2 knockout mouse brain? We discuss the discrepancies in current data and propose a hypothesis to reconcile some differences in the two viewpoints. Since transcripts from repeat elements contribute to piRNA biogenesis, we propose that piRNA levels may be higher in the absence of MeCP2 and that increased piRNA levels may contribute to the mis-regulation of some genes seen in the Mecp2 knockout mouse brain. We provide preliminary data showing an increase in piRNAs in the Mecp2 knockout mouse cerebellum. Our investigation suggests that global piRNA levels may be elevated in the Mecp2 knockout mouse cerebellum and strongly supports further investigation of piRNAs in Rett syndrome.
Rett Syndrome; MeCP2; piRNAs; LINE 1; short RNAs
Template switching (TS) has been an inherent mechanism of reverse transcriptase, which
has been exploited in several transcriptome analysis methods, such as CAGE, RNA-Seq and
short RNA sequencing. TS is an attractive option, given the simplicity of the protocol,
which does not require an adaptor mediated step and thus minimizes sample loss. As such,
it has been used in several studies that deal with limited amounts of RNA, such as in
single cell studies. Additionally, TS has also been used to introduce DNA barcodes or
indexes into different samples, cells or molecules. This labeling allows one to pool
several samples into one sequencing flow cell, increasing the data throughput of
sequencing and takes advantage of the increasing throughput of current sequences. Here, we
report TS artifacts that form owing to a process called strand invasion. Due to the way in
which barcodes/indexes are introduced by TS, strand invasion becomes more problematic by
introducing unsystematic biases. We describe a strategy that eliminates these artifacts
in silico and propose an experimental solution that suppresses biases
Retrotransposons are mobile genetic elements that employ a germ line “copy-and-paste” mechanism to spread throughout metazoan genomes1. At least 50% of the human genome is derived from retrotransposons, with three active families (L1, Alu and SVA) associated with insertional mutagenesis and disease2-3. Epigenetic and post-transcriptional suppression block retrotransposition in somatic cells4-5, excluding early embryo development and some malignancies6-7. Recent reports of L1 expression8-9 and copy number variation10-11 (CNV) in the human brain suggest L1 mobilization may also occur during later development. However, the corresponding integration sites have not been mapped. Here we apply a high-throughput method to identify numerous L1, Alu and SVA germ line mutations, as well as 7,743 putative somatic L1 insertions in the hippocampus and caudate nucleus of three individuals. Surprisingly, we also found 13,692 and 1,350 somatic Alu and SVA insertions, respectively. Our results demonstrate that retrotransposons mobilize to protein-coding genes differentially expressed and active in the brain. Thus, somatic genome mosaicism driven by retrotransposition may reshape the genetic circuitry that underpins normal and abnormal neurobiological processes.
Cap analysis of gene expression (CAGE) is a 5′ sequence tag technology to globally determine transcriptional starting sites in the genome and their expression levels and has most recently been adapted to the HeliScope single molecule sequencer. Despite significant simplifications in the CAGE protocol, it has until now been a labour intensive protocol.
In this study we set out to adapt the protocol to a robotic workflow, which would increase throughput and reduce handling. The automated CAGE cDNA preparation system we present here can prepare 96 ‘HeliScope ready’ CAGE cDNA libraries in 8 days, as opposed to 6 weeks by a manual operator.We compare the results obtained using the same RNA in manual libraries and across multiple automation batches to assess reproducibility.
We show that the sequencing was highly reproducible and comparable to manual libraries with an 8 fold increase in productivity. The automated CAGE cDNA preparation system can prepare 96 CAGE sequencing samples simultaneously. Finally we discuss how the system could be used for CAGE on Illumina/SOLiD platforms, RNA-seq and full-length cDNA generation.