|Home | About | Journals | Submit | Contact Us | Français|
Non-coding RNA (ncRNA) transcripts are RNA molecules that do not code for proteins, but elicit function by other mechanisms. The vast majority of RNA produced in a cell is non-coding ribosomal RNA, produced from relatively few loci, however more recently complementary DNA (cDNA) cloning, tag sequencing, and genome tiling array studies suggest that ncRNAs also account for the majority of RNA species produced by a cell. ncRNA based regulation has been referred to as a ‘hidden layer’ of signals or ‘dark matter’ that control gene expression in cellular processes by poorly described mechanisms. These terms have appeared as ncRNAs until recently have been ignored by expression profiling and cDNA annotation projects and their mode of action is diverse (e.g. influencing chromatin structure and epigenetics, translational silencing, transcriptional silencing). Here, we highlight recent functional genomics strategies toward identifying and assigning function to ncRNA transcription.
Transcription of RNA is a fundamental step in the expression of genetic information and to date most research has focused on the protein coding fraction (messenger RNA) as the primary mode of information encoding the genome to a cellular phenotype. However the non-coding region of a mammalian genome far exceeds the coding fraction , and whole transcriptome studies now recognize that non-coding RNAs (ncRNAs) constitute both the greatest fraction of total RNA in a eukaryotic cell and the greatest fraction of transcribed bases within a genome . It has been long known that housekeeping ncRNAs, involved in translation (rRNA and tRNA) account for approximately 95% of total RNA. More recently within the remaining 5%, new classes of ncRNAs are being identified. These include very short RNA species under 200 bases such as microRNAs (miRNA) , small interfering RNAs (siRNA) , piwi interacting RNAs (piRNAs) , miRNA offset RNAs (moRs) , transcription initiation RNAs (tiRNAs) , promoter and terminator associated short RNAs (PASRs/TASRs) , promoter upstream transcripts (PROMPTS) , and large ncRNAs (lincRNAs) ranging from medium length mRNA like transcripts such as totally intronic RNAs (TINs)  and AIR  and very long ncRNAs such as XIST  and TSIX  which is 40 kb. The subject of ncRNAs and the different classes has well been described in other recent reviews [14–17]. This review however focuses on current and potential future strategies to identify putative ncRNAs and functionally annotate them.
The classical identification of ncRNAs has been based on full-length cDNA sequences that lack open reading frames of 100 amino acids (aa) or greater [18, 19]. Despite this somewhat arbitrary cutoff (short peptides of less than 100 aa have been confirmed experimentally ), this working definition is a good separator of coding and non-coding sequences. Additional reported features that might help to discriminate specific non-coding classes from coding RNAs include RNA length, sub-cellular localization, protein interactions, splicing structure, 5′ and 3′ end modifications and abundance.
In the case of small ncRNAs a simple size selection for RNAs less than 300 bases will by definition select RNAs lacking open reading frames greater than 100 aa. Many groups have used this strategy to identify new miRNA, siRNA, and piRNA members [21–25], and also identify novel RNA classes such as miRNA offset RNAs (moRs)  and transcription initiation RNAs (tiRNAs) . For small ncRNAs this approach has been highly successful, and considering that genuine human coding mRNAs are estimated to possess an average of the combined UTRs length of 1263 [4, 5] it may even be reasonable to extend size selection further (up to 500 bp or 1 kb) to enrich for longer ncRNAs.
For ncRNAs with lengths similar to mRNAs selective enrichment is more difficult and other strategies must be considered. To date a brute force approach of sequencing subtracted libraries of full-length cDNAs and then computational filtering of cDNAs that contain open reading frames has identified most of these mRNA like ncRNAs (see FANTOM3 [18, 26]). However literature is accumulating that fractionation of RNAs based upon sub-cellular localization and protein complexes can enrich for specific classes of ncRNAs. For example it has long been known that snRNAs (small nuclear RNAs) and snoRNAs (small nucleolar RNAs) [27, 28] localize to the nucleus, and more recently longer ncRNAs have been shown to localize to the nucleus [29–33], thus profiling the nuclear fraction is likely to enrich for non-coding transcripts, as currently being explored in the ENCODE project. In addition some non-coding RNAs are reported to interact with chromatin or chromatin modifying enzymes and their activity thought to be via modification of chromatin state, thus isolation of chromatin should be considered [34–36]. In the case of cytoplasmic RNAs, polysome fractionation protocols have been used to identify/enrich for translated mRNAs , conversely it may be possible to enrich for cytoplasmic ncRNAs such as the Nkx2.2 antisense RNA  by depleting the polysome fraction.
RNA immunoprecipitation is the next major tool for enriching for ncRNAs. Although truly catalytic RNAs (ribozymes and riboswitches [39–42]) do exist in nature the majority of ncRNAs to date for which a function has been determined have been shown to involve ribonucleoprotein complexes containing a protein component (e.g. the RNA induced silencing complex RISC, polycomb group repression complex PRC2, signal recognition particle SRP [34, 43, 44]) thus RNA immunoprecipitation using antibodies against RNA-binding proteins is likely to be useful for enriching for these ncRNAs.
The previous paragraphs specifically referred to the isolation of RNA enriched for non-coding transcripts. Having generated this material the next step is how to determine the sequences in this population. Full-length cDNA sequencing, whole genome tiling arrays  and short tag sequencing methods can all be used to identify non-coding transcripts, however each has specific advantages and disadvantages. Full-length cDNA sequencing is the gold standard as it provides the full-length sequence required to determine exonic structure and confirm non-coding potential; however, it is expensive and time consuming due to the handling of individual clones. Tiling arrays and tag sequencing have the advantage of being lower cost and providing expression patterns across multiple samples (thereby allowing an estimate of tissue restriction etc.), however they cannot provide the complete picture of connection between distant exons. Cap analysis of gene expression (CAGE) , paired-end tags  and whole transcriptome RNA shotgun sequencing (RNAseq) [48, 49] can all be used to identify novel transcribed regions of the genome. The disadvantage of these techniques is that the additional stage of determining the full-length cDNA sequence is required to confirm the non-coding status of the transcript.
Having identified a putative ncRNA the next stage is to annotate the sequence. Bioinformatic analysis can provide a host of annotations such as genomic structure (spliced, unspliced, cleaved), genomic location (intergenic or overlapping), overlap orientation (sense/antisense ), cross species conservation, potential RNA folding structure  and correlation with public or in house genome wide datasets (histone modifications, transcription factor binding and expression of RNAs by RNAseq, tiling array and small RNA sequencing).
Bioinformatic analysis of FANTOM3 ncRNAs found they were generally shorter than coding mRNAs and contained fewer exons, with the largest fraction being single exon transcripts . In addition other pseudo-mRNA like transcripts were observed, some derived from pseudo genes, some derived from transcripts that partially overlap coding sequences . As many short RNAs may derive from longer precursors, future annotation will require consideration on motifs in RNA that are targets of RNA-cleaving proteins, such as the hairpin cleaved by Drosha and subsequently by Dicer, to produce miRNA . Further motifs or consensus sequences might be identified in the future, which will further help to classify yet unknown cleavage motifs in other types of RNAs .
A key task in functional annotation of these putative ncRNAs is identification of the full-length sequence. This is required to (i) confirm the transcript is non-coding rather than an additional un-translated exon of a longer coding gene, (ii) provide a full-length sequence for folding and overlap analysis and (iii) provide a full-length reagent for over-expression, localization and interaction studies. To address this, the full-length sequence of putative ncRNAs must be determined experimentally with rapid amplification of cDNA ends (RACE) currently the method of choice [56–58]. Potentially enrichment arrays designed against novel non-coding regions identified by these methods could be used to specifically enrich for ncRNAs prior to cDNA synthesis and full-length cDNA sequencing, using a similar strategy to targeted re-sequencing of genomic regions [59–61].
Having identified and partially annotated a ncRNA the next question comes to determining or validating a predicted function. The observation that a ncRNA is dynamically expressed and has a distinct sub-cellular localization may be a good indication of regulation and possible function, however classical functional approaches as have been applied to coding loci, such as knock-down, knock-out [62–65] and over-expression (potentially of wild type, constitutively active, dominant negative and deletion mutants) should all be considered for function determination.
In the case of miRNAs both the bioinformatic prediction [66, 67] of targets and high throughput experimental systems for functional validation are well developed. Both over-expression (either by synthetic RNAs or lentiviral or plasmid expression constructs) and knock-down (by locked nucleic acid antagomirs  or miRNA sponges ) can be used to increase or decrease miRNA concentrations and the effects monitored by phenotypic screen [70–72], expression profiling [73, 74] or high throughput proteomics . In addition mRNA targets responsible for the phenotype can be confirmed using luciferase reporter assays  or even more recently, by immunoprecipitation of silencing complex associated RNAs and deep sequencing . Finally for several miRNAs including mir-155, mir-17-92 and mir-223, knockout mice have been generated [62, 64, 65].
For other classes of ncRNAs such a systematic approach has not as yet evolved, however for some examples both over-expression [77, 78] and knock-down [79, 80] have been used to test their function. To adapt this to a systematic ncRNA screen however has similar issues to systematic screens of protein coding sequences. Knock-down using synthetic siRNA or shRNA reagents is more convenient than over-expression and easier to apply genome wide as reagents can be ordered for any known or predicted ncRNA and then tested in a systematic fashion. Over-expression however requires full-length cDNA resources and whereas for coding sequences an open reading frame with UTR truncation may be enough, for non-coding sequences, truncation may remove critical RNA folding domains required for the ncRNA function. A whole genome over-expression screen of all ncRNA will have to wait until a full set of full-length clones is available in suitable expression vectors.
As a recent example, siRNAs against six large intergenic ncRNAs (lincRNAs) showed that each caused significant perturbations of the transcriptome when assessed by standard mRNA arrays, but each had distinct subsets of targets . These six were identified as part of a larger bioinformatic screen of chromatin state maps which identified 3300 putative lincRNAs. The maps also identified the previously known lincRNA HOTAIR, which binds to the polycomb repressive complex 2 (PRC2) and so the authors tested whether any of the additional lincRNAs were also bound to this complex. Approximately 20% of the lincRNAs were bound to PRC2 (found using RNA immunoprecipitation against PRC2 components SUZ12 and EZH2) and that lincRNAs were also significantly associated with other chromatin modifying complexes such as CoREST a repressor of neuronal genes and SMCX, a histone H3K4me3 demethylase (combined all three made up 38% of lincRNAs associated with one of these components). Finally comparison of the genes affected by each lincRNA knockdown found an overlap with the effect of PRC2 knockdown, suggesting a potential function in RNA mediated transcriptional gene silencing. In another paper, small RNA directed against the promoter of the UBC gene caused silencing of the gene, first by histone modifications and then DNA methylation changes. This was dependent upon the activities of AGO1, DNMT3A, HDAC1 and DNMT1 .
The application of so called ‘next generation’ DNA sequencers is revolutionizing our understanding of the genome and the transcriptome . Sequencers such as Applied Biosystems SOLiD platform , Illumina's Genome Analyser  and Helicos Biosciences Heliscope  can generate very deep relatively unbiased views of the transcriptome, however all of these are short read technologies and putative ncRNAs identified will still require RACE and/or full-length cDNA cloning. So called third or fourth generation platforms such as Oxford nanopore  and Pacific Biosciences  promise to provide very long sequences (potentially full-length) and potentially direct RNA sequencing in the near future. With this it will be possible to identify and measure expression of independent transcripts from complex overlapping (coding/non-coding, sense/antisense) loci.
In function determination, additional strategies may need to be developed. Some RNAs such as the promoter associated tiRNAs are expressed at very low levels, as such a phenotype derived from over-expression may not be biologically relevant (as promoter targeted sRNAs are known to effect gene expression). Similarly knock-down may not be meaningful. In this case very fine control of over-expression or knockdown levels may be required or perhaps deletion of the element is required to be more definitive. In other examples there may be a fine balance between sense and antisense transcription, knock-down of one may inadvertently knockdown the other.
In some cases the ncRNA sequence itself may have no function. It could be the product of spurious transcription, it could be an RNA degradation product that is observed in very deep sequencing data or it could be that although the RNA itself has no functional elements the promoter that drives its expression and the act of transcription itself through the region helps open up chromatin, and recruit additional transcription factors to adjacent promoters . In these cases knockdown or over-expression of the RNA may have no effect and deletion may be required.
Additional pathways to ncRNA generation are also being identified. Very recently TERT has been shown to have an RNA-dependent RNA polymerase activity and that this can be used to make dsRNAs that can then be processed by DICER . Another recent paper has also found small capped RNAs which appear to be processed from longer coding (and non-coding?) transcripts . These short capped RNAs ‘paint’ the exons of known genes and occasionally are generated across splice junctions. This suggests they are generated by a post-transcriptional mechanism where full-length spliced transcripts are broken down (or specifically cleaved) into smaller products, and some of them are recapped, possibly by a recently identified novel cytoplasmic capping enzyme . It will require further testing to determine whether these RNAs are functional and whether there are different types of cap that drive localization and function of these novel RNAs to specific cell compartments.
Finally, it will be critical to incorporate these ncRNA components into our understanding of transcriptional systems. We have published previously on the transcription factor based regulation of transition from one stable state (transcriptional basin) to another , the next stage will be to see how non-coding regulators fit into the system.
The National Institutes of Health grant no. U54 HG004557; research grant to the RIKEN Omics Science Center from the Ministry of Education, Culture, Sports, Science and Technology (MEXT).
The authors thank Thomas Rodgers for precious assistance in editing the manuscript and the colleagues at OSC for broad support.
Alistair R. R. Forrest obtained his PhD in Bioinformatics in 2006 from the Institute for Molecular Bioscience at the University of Queensland, Australia and has been involved in the FANTOM (Functional Annotation of Mammals) project since 2001. He joined RIKEN in 2007 as a CJ Martin travelling fellow and is currently a Senior Scientist within the LSA technology Development group at RIKEN, Yokohama, Japan.
Rehab F. Abdelhamid graduated from the Faculty of Sciences, Mansoura University Egypt in 1999. She obtained her PhD in 2007 from Institute of Applied Beam Sciences, Ibaraki University, Japan and worked as postdoctoral researcher in the venture business laboratory at the same university. Now she is doing postdoctoral training in the Functional Genomics Technology Team in RIKEN, Japan.
Piero Carninci obtained his doctoral degree in 1989 from the University of Trieste, Italy. He has joined RIKEN in 1995 as postdoctoral fellow. He is currently serving as Leader of the Functional Genomics Technology Team, the Omics Resource Development Unit and serves as Deputy Project Director of the LSA Technology Development Group at the RIKEN Yokohama Institute.