Non-coding RNA (ncRNA) transcripts are RNA molecules that do not code for proteins, but elicit function by other mechanisms. The vast majority of RNA produced in a cell is non-coding ribosomal RNA, produced from relatively few loci, however more recently complementary DNA (cDNA) cloning, tag sequencing, and genome tiling array studies suggest that ncRNAs also account for the majority of RNA species produced by a cell. ncRNA based regulation has been referred to as a ‘hidden layer’ of signals or ‘dark matter’ that control gene expression in cellular processes by poorly described mechanisms. These terms have appeared as ncRNAs until recently have been ignored by expression profiling and cDNA annotation projects and their mode of action is diverse (e.g. influencing chromatin structure and epigenetics, translational silencing, transcriptional silencing). Here, we highlight recent functional genomics strategies toward identifying and assigning function to ncRNA transcription.
non-coding RNA; Sequencing; transcription; annotation
The analysis of CAGE (Cap Analysis of Gene Expression) time-course has been proposed
by the FANTOM5 Consortium to extend the understanding of the sequence of events
facilitating cell state transition at the level of promoter regulation. To identify
the most prominent transcriptional regulations induced by growth factors in human
breast cancer, we apply here the Complexity Invariant Dynamic Time Warping motif
EnRichment (CIDER) analysis approach to the CAGE time-course datasets of MCF-7 cells
stimulated by epidermal growth factor (EGF) or heregulin (HRG). We identify a
multi-level cascade of regulations rooted by the Serum Response Factor (SRF)
transcription factor, connecting the MAPK-mediated transduction of the HRG stimulus
to the negative regulation of the MAPK pathway by the members of the DUSP family
phosphatases. The finding confirms the known primary role of FOS and FOSL1, members
of AP-1 family, in shaping gene expression in response to HRG induction. Moreover,
we identify a new potential regulation of DUSP5 and RARA (known to antagonize the
transcriptional regulation induced by the estrogen receptors) by the activity of the
AP-1 complex, specific to HRG response. The results indicate that a divergence in
AP-1 regulation determines cellular changes of breast cancer cells stimulated by
Understanding the normal state of human tissue transcriptome profiles is essential for recognizing tissue disease states and identifying disease markers. Recently, the Human Protein Atlas and the FANTOM5 consortium have each published extensive transcriptome data for human samples using Illumina-sequenced RNA-Seq and Heliscope-sequenced CAGE. Here, we report on the first large-scale complex tissue transcriptome comparison between full-length versus 5′-capped mRNA sequencing data. Overall gene expression correlation was high between the 22 corresponding tissues analyzed (R > 0.8). For genes ubiquitously expressed across all tissues, the two data sets showed high genome-wide correlation (91% agreement), with differences observed for a small number of individual genes indicating the need to update their gene models. Among the identified single-tissue enriched genes, up to 75% showed consensus of 7-fold enrichment in the same tissue in both methods, while another 17% exhibited multiple tissue enrichment and/or high expression variety in the other data set, likely dependent on the cell type proportions included in each tissue sample. Our results show that RNA-Seq and CAGE tissue transcriptome data sets are highly complementary for improving gene model annotations and highlight biological complexities within tissue transcriptomes. Furthermore, integration with image-based protein expression data is highly advantageous for understanding expression specificities for many genes.
Classically or alternatively activated macrophages (M1 and M2, respectively) play distinct and important roles for microbiocidal activity, regulation of inflammation and tissue homeostasis. Despite this, their transcriptional regulatory dynamics are poorly understood. Using promoter-level expression profiling by non-biased deepCAGE we have studied the transcriptional dynamics of classically and alternatively activated macrophages. Transcription factor (TF) binding motif activity analysis revealed four motifs, NFKB1_REL_RELA, IRF1,2, IRF7 and TBP that are commonly activated but have distinct activity dynamics in M1 and M2 activation. We observe matching changes in the expression profiles of the corresponding TFs and show that only a restricted set of TFs change expression. There is an overall drastic and transient up-regulation in M1 and a weaker and more sustainable up-regulation in M2. Novel TFs, such as Thap6, Maff, (M1) and Hivep1, Nfil3, Prdm1, (M2) among others, were suggested to be involved in the activation processes. Additionally, 52 (M1) and 67 (M2) novel differentially expressed genes and, for the first time, several differentially expressed long non-coding RNA (lncRNA) transcriptome markers were identified. In conclusion, the finding of novel motifs, TFs and protein-coding and lncRNA genes is an important step forward to fully understand the transcriptional machinery of macrophage activation.
The immediate-early response mediates cell fate in response to a variety of extracellular stimuli and is dysregulated in many cancers. However, the specificity of the response across stimuli and cell types, and the roles of non-coding RNAs are not well understood. Using a large collection of densely-sampled time series expression data we have examined the induction of the immediate-early response in unparalleled detail, across cell types and stimuli. We exploit cap analysis of gene expression (CAGE) time series datasets to directly measure promoter activities over time. Using a novel analysis method for time series data we identify transcripts with expression patterns that closely resemble the dynamics of known immediate-early genes (IEGs) and this enables a comprehensive comparative study of these genes and their chromatin state. Surprisingly, these data suggest that the earliest transcriptional responses often involve promoters generating non-coding RNAs, many of which are produced in advance of canonical protein-coding IEGs. IEGs are known to be capable of induction without de novo protein synthesis. Consistent with this, we find that the response of both protein-coding and non-coding RNA IEGs can be explained by their transcriptionally poised, permissive chromatin state prior to stimulation. We also explore the function of non-coding RNAs in the attenuation of the immediate early response in a small RNA sequencing dataset matched to the CAGE data: We identify a novel set of microRNAs responsible for the attenuation of the IEG response in an estrogen receptor positive cancer cell line. Our computational statistical method is well suited to meta-analyses as there is no requirement for transcripts to pass thresholds for significant differential expression between time points, and it is agnostic to the number of time points per dataset.
Cells respond to stimuli through a set of genes that are primed for rapid activation. These genes, known as immediate-early genes (IEGs), are regulated at the level of transcription of the messenger RNA, and at subsequent RNA processing levels. These rapid responders are then rapidly switched off in normal cells. Immediate-early genes are involved in many cellular processes, including differentiation and proliferation, that are often dysregulated in cancer where they become continuously active. We characterise IEGs in a genome-wide sequencing dataset that captures their transcriptional response over time. Using a novel analysis technique, we identify both protein-coding and non-coding genes that are activated comparably to IEGs and investigate their properties. We examine how IEGs are switched off, including through microRNAs, small non-coding RNAs that act to control the level of key IEGs. We identify a novel set of microRNAs responsible for the attenuation of the IEG response in an estrogen receptor positive cancer cell line.
Analysis of the myeloid transcriptome by integrating 91 samples from the myeloid lineage and AML cell lines to predict novel regulatory interactions, enhancers, miRNAs, and lincRNAs.
The generation of myeloid cells from their progenitors is regulated at the level of transcription by combinatorial control of key transcription factors influencing cell-fate choice. To unravel the global dynamics of this process at the transcript level, we generated transcription profiles for 91 human cell types of myeloid origin by use of CAGE profiling. The CAGE sequencing of these samples has allowed us to investigate diverse aspects of transcription control during myelopoiesis, such as identification of novel transcription factors, miRNAs, and noncoding RNAs specific to the myeloid lineage. We further reconstructed a transcription regulatory network by clustering coexpressed transcripts and associating them with enriched cis-regulatory motifs. With the use of the bidirectional expression as a proxy for enhancers, we predicted over 2000 novel enhancers, including an enhancer 38 kb downstream of IRF8 and an intronic enhancer in the KIT gene locus. Finally, we highlighted relevance of these data to dissect transcription dynamics during progressive maturation of granulocyte precursors. A multifaceted analysis of the myeloid transcriptome is made available (www.myeloidome.roslin.ed.ac.uk). This high-quality dataset provides a powerful resource to study transcriptional regulation during myelopoiesis and to infer the likely functions of unannotated genes in human innate immunity.
transcriptome; CAGE; hematopoiesis
Cap analysis of gene expression (CAGE) is a high-throughput method for transcriptome analysis that provides a single base-pair resolution map of transcription start sites (TSS) and their relative usage. Despite their high resolution and functional significance, published CAGE data are still underused in promoter analysis due to the absence of tools that enable its efficient manipulation and integration with other genome data types. Here we present CAGEr, an R implementation of novel methods for the analysis of differential TSS usage and promoter dynamics, integrated with CAGE data processing and promoterome mining into a first comprehensive CAGE toolbox on a common analysis platform. Crucially, we provide collections of TSSs derived from most published CAGE datasets, as well as direct access to FANTOM5 resource of TSSs for numerous human and mouse cell/tissue types from within R, greatly increasing the accessibility of precise context-specific TSS data for integrative analyses. The CAGEr package is freely available from Bioconductor at http://www.bioconductor.org/packages/release/bioc/html/CAGEr.html.
Antisense (AS) transcripts are RNA molecules that are transcribed from the opposite strand to sense (S) genes forming S/AS pairs. The most prominent configuration is when a lncRNA is antisense to a protein coding gene. Increasing evidences prove that antisense transcription may control sense gene expression acting at distinct regulatory levels. However, its contribution to brain function and neurodegenerative diseases remains unclear. We have recently identified AS Uchl1 as an antisense to the mouse Ubiquitin carboxy-terminal hydrolase L1 (Uchl1) gene (AS Uchl1), the synthenic locus of UCHL1/PARK5. This is mutated in rare cases of early-onset familial Parkinson's Disease (PD) and loss of UCHL1 activity has been reported in many neurodegenerative diseases. Importantly, manipulation of UchL1 expression has been proposed as tool for therapeutic intervention. AS Uchl1 induces UchL1 expression by increasing its translation. It is the representative member of SINEUPs (SINEB2 sequence to UP-regulate translation), a new functional class of natural antisense lncRNAs that activate translation of their sense genes. Here we take advantage of FANTOM5 dataset to identify the transcription start sites associated to S/AS pair at Uchl1 locus. We show that AS Uchl1 expression is under the regulation of Nurr1, a major transcription factor involved in dopaminergic cells' differentiation and maintenance. Furthermore, AS Uch1 RNA levels are strongly down-regulated in neurochemical models of PD in vitro and in vivo. This work positions AS Uchl1 RNA as a component of Nurr1-dependent gene network and target of cellular stress extending our understanding on the role of antisense transcription in the brain.
antisense transcription; long non-coding RNA; Parkinson's disease; Nurr1; dopaminergic cells
Transcriptional Regulatory Networks (TRNs) coordinate multiple transcription factors (TFs) in concert to maintain tissue homeostasis and cellular function. The re-establishment of target cell TRNs has been previously implicated in direct trans-differentiation studies where the newly introduced TFs switch on a set of key regulatory factors to induce de novo expression and function. However, the extent to which TRNs in starting cell types, such as dermal fibroblasts, protect cells from undergoing cellular reprogramming remains largely unexplored. In order to identify TFs specific to maintaining the fibroblast state, we performed systematic knockdown of 18 fibroblast-enriched TFs and analyzed differential mRNA expression against the same 18 genes, building a Matrix-RNAi. The resulting expression matrix revealed seven highly interconnected TFs. Interestingly, suppressing four out of seven TFs generated lipid droplets and induced PPARG and CEBPA expression in the presence of adipocyte-inducing medium only, while negative control knockdown cells maintained fibroblastic character in the same induction regime. Global gene expression analyses further revealed that the knockdown-induced adipocytes expressed genes associated with lipid metabolism and significantly suppressed fibroblast genes. Overall, this study reveals the critical role of the TRN in protecting cells against aberrant reprogramming, and demonstrates the vulnerability of donor cell's TRNs, offering a novel strategy to induce transgene-free trans-differentiations.
Standard culture of human induced pluripotent stem cells (hiPSCs) requires basic Fibroblast Growth Factor (bFGF) to maintain the pluripotent state, whereas hiPSC more closely resemble epiblast stem cells than true naïve state ES which requires LIF to maintain pluripotency. Here we show that chemokine (C-C motif) ligand 2 (CCL2) enhances the expression of pluripotent marker genes through the phosphorylation of the signal transducer and activator of transcription 3 (STAT3) protein. Moreover, comparison of transcriptomes between hiPSCs cultured with CCL2 versus with bFGF, we found that CCL2 activates hypoxia related genes, suggesting that CCL2 enhanced pluripotency by inducing a hypoxic-like response. Further, we show that hiPSCs cultured with CCL2 can differentiate at a higher efficiency than culturing with just bFGF and we show CCL2 can be used in feeder-free conditions in the absence of LIF. Taken together, our finding indicates the novel functions of CCL2 in enhancing its pluripotency in hiPSCs.
The fibrillins and latent transforming growth factor binding proteins (LTBPs) form a superfamily of extracellular matrix (ECM) proteins characterized by the presence of a unique domain, the 8-cysteine transforming growth factor beta (TGFβ) binding domain. These proteins are involved in the structure of the extracellular matrix and controlling the bioavailability of TGFβ family members. Genes encoding these proteins show differential expression in mesenchymal cell types which synthesize the extracellular matrix. We have investigated the promoter regions of the seven gene family members using the FANTOM5 CAGE database for human. While the protein and nucleotide sequences show considerable sequence similarity, the promoter regions were quite diverse. Most genes had a single predominant transcription start site region but LTBP1 and LTBP4 had two regions initiating different transcripts. Most of the family members were expressed in a range of mesenchymal and other cell types, often associated with use of alternative promoters or transcription start sites within a promoter in different cell types. FBN3 was the lowest expressed gene, and was found only in embryonic and fetal tissues. The different promoters for one gene were more similar to each other in expression than to promoters of the other family members. Notably expression of all 22 LTBP2 promoters was tightly correlated and quite distinct from all other family members. We located candidate enhancer regions likely to be involved in expression of the genes. Each gene was associated with a unique subset of transcription factors across multiple promoters although several motifs including MAZ, SP1, GTF2I and KLF4 showed overrepresentation across the gene family. FBN1 and FBN2, which had similar expression patterns, were regulated by different transcription factors. This study highlights the role of alternative transcription start sites in regulating the tissue specificity of closely related genes and suggests that this important class of extracellular matrix proteins is subject to subtle regulatory variations that explain the differential roles of members of this gene family.
•We examine expression, promoter use and enhancers for the fibrillin/LTBP gene family.•Promoter switching was observed for most family members.•Multiple enhancers were identified for all family members.•Family members overlapped in tissue specificity with some unique expression patterns.•A degree of redundancy among family members is possible.
FANTOM, Functional Annotation of Mammals; CAGE, cap analysis of gene expression; ECM, extracellular matrix; TB domain, latent transforming growth factor β binding domain; Fibrillin; Latent transforming growth factor β binding protein; Transcription start sites; Gene regulation; Extracellular matrix; Promoter
Humans are composed of hundreds of cell types. As the genomic DNA of each somatic cell is identical, cell type is determined by what is expressed and when. Until recently, little has been reported about the determinants of human cell identity, particularly from the joint perspective of gene evolution and expression. Here, we chart the evolutionary past of all documented human cell types via the collective histories of proteins, the principal product of gene expression. FANTOM5 data provide cell-type–specific digital expression of human protein-coding genes and the SUPERFAMILY resource is used to provide protein domain annotation. The evolutionary epoch in which each protein was created is inferred by comparison with domain annotation of all other completely sequenced genomes. Studying the distribution across epochs of genes expressed in each cell type reveals insights into human cellular evolution in terms of protein innovation. For each cell type, its history of protein innovation is charted based on the genes it expresses. Combining the histories of all cell types enables us to create a timeline of cell evolution. This timeline identifies the possibility that our common ancestor Coelomata (cavity-forming animals) provided the innovation required for the innate immune system, whereas cells which now form the brain of human have followed a trajectory of continually accumulating novel proteins since Opisthokonta (boundary of animals and fungi). We conclude that exaptation of existing domain architectures into new contexts is the dominant source of cell-type–specific domain architectures.
CAGE; transcriptome; protein domains; evolution
Obesity confers an increased risk of developing specific cancer forms. Although the mechanisms are unclear, increased fat cell secretion of specific proteins (adipokines) may promote/facilitate development of malignant tumors in obesity via cross-talk between adipose tissue(s) and the tissues prone to develop cancer among obese. We searched for novel adipokines that were overexpressed in adipose tissue of obese subjects as well as in tumor cells derived from cancers commonly associated with obesity. For this purpose expression data from human adipose tissue of obese and non-obese as well as from a large panel of human cancer cell lines and corresponding primary cells and tissues were explored. We found expression of ceruloplasmin to be the most enriched in obesity-associated cancer cells. This gene was also significantly up-regulated in adipose tissue of obese subjects. Ceruloplasmin is the body's main copper carrier and is involved in angiogenesis. We demonstrate that ceruloplasmin is a novel adipokine, which is produced and secreted at increased rates in obesity. In the obese state, adipose tissue contributed markedly (up to 22%) to the total circulating protein level. In summary, we have through bioinformatic screening identified ceruloplasmin as a novel adipokine with increased expression in adipose tissue of obese subjects as well as in cells from obesity-associated cancers. Whether there is a causal relationship between adipose overexpression of ceruloplasmin and cancer development in obesity cannot be answered by these cross-sectional comparisons.
Cap analysis of gene expression (CAGE) is a 5′ sequence tag technology to globally determine transcriptional starting sites in the genome and their expression levels and has most recently been adapted to the HeliScope single molecule sequencer. Despite significant simplifications in the CAGE protocol, it has until now been a labour intensive protocol.
In this study we set out to adapt the protocol to a robotic workflow, which would increase throughput and reduce handling. The automated CAGE cDNA preparation system we present here can prepare 96 ‘HeliScope ready’ CAGE cDNA libraries in 8 days, as opposed to 6 weeks by a manual operator.We compare the results obtained using the same RNA in manual libraries and across multiple automation batches to assess reproducibility.
We show that the sequencing was highly reproducible and comparable to manual libraries with an 8 fold increase in productivity. The automated CAGE cDNA preparation system can prepare 96 CAGE sequencing samples simultaneously. Finally we discuss how the system could be used for CAGE on Illumina/SOLiD platforms, RNA-seq and full-length cDNA generation.
Mesothelioma is a highly malignant tumor that is primarily caused by occupational or environmental exposure to asbestos fibers. Despite worldwide restrictions on asbestos usage, further cases are expected as diagnosis is typically 20–40 years after exposure. Once diagnosed there is a very poor prognosis with a median survival rate of 9 months. Considering this the development of early pre clinical diagnostic markers may help improve clinical outcomes.
Microarray expression arrays on mesothelium and other tissues dissected from mice were used to identify candidate mesothelial lineage markers. Candidates were further tested by qRTPCR and in-situ hybridization across a mouse tissue panel. Two candidate biomarkers with the potential for secretion, uroplakin 3B (UPK3B), and leucine rich repeat neuronal 4 (LRRN4) and one commercialized mesothelioma marker, mesothelin (MSLN) were then chosen for validation across a panel of normal human primary cells, 16 established mesothelioma cell lines, 10 lung cancer lines, and a further set of 8 unrelated cancer cell lines.
Within the primary cell panel, LRRN4 was only detected in primary mesothelial cells, but MSLN and UPK3B were also detected in other cell types. MSLN was detected in bronchial epithelial cells and alveolar epithelial cells and UPK3B was detected in retinal pigment epithelial cells and urothelial cells. Testing the cell line panel, MSLN was detected in 15 of the 16 mesothelioma cells lines, whereas LRRN4 was only detected in 8 and UPK3B in 6. Interestingly MSLN levels appear to be upregulated in the mesothelioma lines compared to the primary mesothelial cells, while LRRN4 and UPK3B, are either lost or down-regulated. Despite the higher fraction of mesothelioma lines positive for MSLN, it was also detected at high levels in 2 lung cancer lines and 3 other unrelated cancer lines derived from papillotubular adenocarcinoma, signet ring carcinoma and transitional cell carcinoma.
The international Functional Annotation Of the Mammalian Genomes 4 (FANTOM4) research collaboration set out to better understand the transcriptional network that regulates macrophage differentiation and to uncover novel components of the transcriptome employing a series of high-throughput experiments. The primary and unique technique is cap analysis of gene expression (CAGE), sequencing mRNA 5′-ends with a second-generation sequencer to quantify promoter activities even in the absence of gene annotation. Additional genome-wide experiments complement the setup including short RNA sequencing, microarray gene expression profiling on large-scale perturbation experiments and ChIP–chip for epigenetic marks and transcription factors. All the experiments are performed in a differentiation time course of the THP-1 human leukemic cell line. Furthermore, we performed a large-scale mammalian two-hybrid (M2H) assay between transcription factors and monitored their expression profile across human and mouse tissues with qRT-PCR to address combinatorial effects of regulation by transcription factors. These interdependent data have been analyzed individually and in combination with each other and are published in related but distinct papers. We provide all data together with systematic annotation in an integrated view as resource for the scientific community (http://fantom.gsc.riken.jp/4/). Additionally, we assembled a rich set of derived analysis results including published predicted and validated regulatory interactions. Here we introduce the resource and its update after the initial release.
Perturbation and time-course data sets, in combination with computational approaches, can be used to infer transcriptional regulatory networks which ultimately govern the developmental pathways and responses of cells. Here, we individually knocked down the four transcription factors PU.1, IRF8, MYB and SP1 in the human monocyte leukemia THP-1 cell line and profiled the genome-wide transcriptional response of individual transcription starting sites using deep sequencing based Cap Analysis of Gene Expression. From the proximal promoter regions of the responding transcription starting sites, we derived de novo binding-site motifs, characterized their biological function and constructed a network. We found a previously described composite motif for PU.1 and IRF8 that explains the overlapping set of transcriptional responses upon knockdown of either factor.
MicroRNAs (miRNAs) are short single stranded noncoding RNAs that suppress gene expression through either translational repression or degradation of target mRNAs. The annealing between messenger RNAs and 5′ seed region of miRNAs is believed to be essential for the specific suppression of target gene expression. One miRNA can have several hundred different targets in a cell. Rapidly accumulating evidence suggests that many miRNAs are involved in cell cycle regulation and consequentially play critical roles in carcinogenesis.
Introduction of synthetic miR-107 or miR-185 suppressed growth of the human non-small cell lung cancer cell lines. Flow cytometry analysis revealed these miRNAs induce a G1 cell cycle arrest in H1299 cells and the suppression of cell cycle progression is stronger than that by Let-7 miRNA. By the gene expression analyses with oligonucleotide microarrays, we find hundreds of genes are affected by transfection of these miRNAs. Using miRNA-target prediction analyses and the array data, we listed up a set of likely targets of miR-107 and miR-185 for G1 cell cycle arrest and validate a subset of them using real-time RT-PCR and immunoblotting for CDK6.
We identified new cell cycle regulating miRNAs, miR-107 and miR-185, localized in frequently altered chromosomal regions in human lung cancers. Especially for miR-107, a large number of down-regulated genes are annotated with the gene ontology term ‘cell cycle’. Our results suggest that these miRNAs may contribute to regulate cell cycle in human malignant tumors.
The international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM2, comprised 60,770 full-length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein-coding genes, indicating that a number of cDNAs still remained to be collected and identified. To pursue the complete gene catalog that covers all predicted mouse genes, cloning and sequencing of full-length enriched cDNAs has been continued since FANTOM2. In FANTOM3, 42,031 newly isolated cDNAs were subjected to functional annotation, and the annotation of 4,347 FANTOM2 cDNAs was updated. To accomplish accurate functional annotation, we improved our automated annotation pipeline by introducing new coding sequence prediction programs and developed a Web-based annotation interface for simplifying the annotation procedures to reduce manual annotation errors. Automated coding sequence and function prediction was followed with manual curation and review by expert curators. A total of 102,801 full-length enriched mouse cDNAs were annotated. Out of 102,801 transcripts, 56,722 were functionally annotated as protein coding (including partial or truncated transcripts), providing to our knowledge the greatest current coverage of the mouse proteome by full-length cDNAs. The total number of distinct non-protein-coding transcripts increased to 34,030. The FANTOM3 annotation system, consisting of automated computational prediction, manual curation, and final expert curation, facilitated the comprehensive characterization of the mouse transcriptome, and could be applied to the transcriptomes of other species.