The classical identification of ncRNAs has been based on full-length cDNA sequences that lack open reading frames of 100 amino acids (aa) or greater [18
]. Despite this somewhat arbitrary cutoff (short peptides of less than 100 aa have been confirmed experimentally [20
]), this working definition is a good separator of coding and non-coding sequences. Additional reported features that might help to discriminate specific non-coding classes from coding RNAs include RNA length, sub-cellular localization, protein interactions, splicing structure, 5′ and 3′ end modifications and abundance.
In the case of small ncRNAs a simple size selection for RNAs less than 300 bases will by definition select RNAs lacking open reading frames greater than 100 aa. Many groups have used this strategy to identify new miRNA, siRNA, and piRNA members [21–25
], and also identify novel RNA classes such as miRNA offset RNAs (moRs) [6
] and transcription initiation RNAs (tiRNAs) [7
]. For small ncRNAs this approach has been highly successful, and considering that genuine human coding mRNAs are estimated to possess an average of the combined UTRs length of 1263 [4
] it may even be reasonable to extend size selection further (up to 500 bp or 1 kb) to enrich for longer ncRNAs.
For ncRNAs with lengths similar to mRNAs selective enrichment is more difficult and other strategies must be considered. To date a brute force approach of sequencing subtracted libraries of full-length cDNAs and then computational filtering of cDNAs that contain open reading frames has identified most of these mRNA like ncRNAs (see FANTOM3 [18
]). However literature is accumulating that fractionation of RNAs based upon sub-cellular localization and protein complexes can enrich for specific classes of ncRNAs. For example it has long been known that snRNAs (small nuclear RNAs) and snoRNAs (small nucleolar RNAs) [27
] localize to the nucleus, and more recently longer ncRNAs have been shown to localize to the nucleus [29–33
], thus profiling the nuclear fraction is likely to enrich for non-coding transcripts, as currently being explored in the ENCODE project. In addition some non-coding RNAs are reported to interact with chromatin or chromatin modifying enzymes and their activity thought to be via modification of chromatin state, thus isolation of chromatin should be considered [34–36
]. In the case of cytoplasmic RNAs, polysome fractionation protocols have been used to identify/enrich for translated mRNAs [37
], conversely it may be possible to enrich for cytoplasmic ncRNAs such as the Nkx2.2 antisense RNA [38
] by depleting the polysome fraction.
RNA immunoprecipitation is the next major tool for enriching for ncRNAs. Although truly catalytic RNAs (ribozymes and riboswitches [39–42
]) do exist in nature the majority of ncRNAs to date for which a function has been determined have been shown to involve ribonucleoprotein complexes containing a protein component (e.g. the RNA induced silencing complex RISC, polycomb group repression complex PRC2, signal recognition particle SRP [34
]) thus RNA immunoprecipitation using antibodies against RNA-binding proteins is likely to be useful for enriching for these ncRNAs.
The previous paragraphs specifically referred to the isolation of RNA enriched for non-coding transcripts. Having generated this material the next step is how to determine the sequences in this population. Full-length cDNA sequencing, whole genome tiling arrays [45
] and short tag sequencing methods can all be used to identify non-coding transcripts, however each has specific advantages and disadvantages. Full-length cDNA sequencing is the gold standard as it provides the full-length sequence required to determine exonic structure and confirm non-coding potential; however, it is expensive and time consuming due to the handling of individual clones. Tiling arrays and tag sequencing have the advantage of being lower cost and providing expression patterns across multiple samples (thereby allowing an estimate of tissue restriction etc.), however they cannot provide the complete picture of connection between distant exons. Cap analysis of gene expression (CAGE) [46
], paired-end tags [47
] and whole transcriptome RNA shotgun sequencing (RNAseq) [48
] can all be used to identify novel transcribed regions of the genome. The disadvantage of these techniques is that the additional stage of determining the full-length cDNA sequence is required to confirm the non-coding status of the transcript.