In recent years established views of transcription have been challenged by the observation that a much larger portion of the human and mouse genomes is transcribed than can be accounted for by currently annotated coding and noncoding genes. The bulk of these findings have come from experiments using “tiling” microarrays with probes that cover the non-repetitive genome at regular intervals
[1]–
[9], or from sequencing efforts of full-length cDNA libraries enriched for rare transcripts
[10],
[11]. Additionally, capped analysis of gene expression (CAGE) in human and mouse show that a significant number of sequenced 5′ tags map to intergenic regions
[12]. Estimates of the proportion of transcripts that map to locations separate from known exons range from 47% to 80% and are distributed approximately equally between introns and intergenic regions. Dubbed transcriptional “dark matter”
[13], the “hidden” transcriptome
[1], or transcripts of unknown function (TUFs)
[4],
[14], the exact nature of much of this additional transcription is unclear, but it has been presumed to comprise a combination of novel protein coding transcripts, extensions of existing transcripts, noncoding RNAs (ncRNAs), antisense transcripts, and biological or experimental background. Determining the relative contributions of each of these potential sources is important for understanding the nature and possible biological function of transcriptional dark matter.
Homology searches for transcripts mapping outside known annotation boundaries
[10], as well as cDNA sequencing efforts, indicate that it is still possible to find new exons of protein coding genes
[10],
[15],
[16]. The genomic positions of TUFs are also biased towards known transcripts
[8], suggesting that at least a portion may represent extensions of current gene annotations. Nevertheless, the majority of dark matter transcripts is thought to be noncoding
[2],
[4],
[5],
[10]. Previous efforts to characterize dark matter transcripts have revealed the existence of thousands of ncRNAs with evidence for tissue-specific expression
[17],
[18], as well as over a thousand large intervening noncoding RNAs (lincRNAs) originating from intergenic regions bearing chromatin marks associated with transcription
[19]. Other studies have reported new classes of ncRNAs, such as those that cluster close to the transcription start sites (TSSs) of protein coding genes
[20]–
[24]. These promoter-associated RNAs (pasRNAs) typically initiate in the nucleosome free regions that mark a TSS, with transcription occurring in both directions. Finally, results from the ENCODE pilot project have suggested a highly interleaved structure of the human transcriptome, with an estimate that as much as 93% of the human genome may give rise to primary transcripts
[9]. Though this estimate was based on a combination of sources that included rapid amplification of cDNA ends coupled to detection on tiling arrays (RACE-tiling), manually curated GENCODE annotations, and paired-end sequencing of long cDNAs (GIS-PET), it was dominated by the results of RACE-tiling experiments that alone found 80% genome coverage, compared to 64.6% and 66.4% for GENCODE annotations and GIS-PET, respectively.
The fact that most TUFs do not appear to be under evolutionary selective pressure
[25] has prompted suggestions that at least some of the transcriptional dark matter may constitute “leaky” background transcription
[9],
[26]. Consistent with this notion, many of the intergenic and intronic transcripts are detected at low levels, close to the detection limit of qPCR or Northern blots
[13]. Presumably as a consequence, validation rates for unannotated transcribed regions detected in tiling array experiments have varied between 25% and 70%
[1],
[5],
[27], and a comparison
[13] of human chromosome 22 data from three major tiling array studies done on different platforms
[1],
[3],
[27] also revealed little overlap of expressed probes, with 89% of overlapping positive probes mapping to exons or introns of known transcripts. While this low overlap may be due to differences in the samples analyzed
[4], there is also evidence that some dark matter transcripts may be due to experimental artifacts. For example, a reassessment of the analysis parameters used in the tiling array study by Kampa et al.
[2] revealed a similar number of transcribed fragments in real and randomized microarray data
[28]. These issues make it difficult to assess the level of false positives in tiling array experiments.
Transcriptome sequencing (RNA-Seq) has emerged as a new technology that does not suffer from many of the limitations of array platforms such as cross-hybridization
[29]. The technique has a wide dynamic range spanning at least four to five orders of magnitude
[30],
[31] and allows accurate quantitation of expression levels, as determined by experiments using externally spiked-in RNA controls and quantitative PCR
[30]. These characteristics make RNA-Seq suitable to accurately assess the relative proportion of sequence from the known versus the dark matter transcriptome. Comparisons between studies of eukaryotic transcriptomes have shown that the estimated proportion of transcriptional dark matter reported in RNA-Seq studies is consistently lower than estimates from tiling arrays
[32]. Although most RNA-Seq studies to date have focused on polyadenylated (PolyA+) RNA, which would be enriched for coding transcripts, this cannot fully account for the differences, as most tiling array studies show nearly the same degree of nonexonic transcription for PolyA+ as for total RNA sources
[1]–
[9]. Indeed, it was reported that even in the most mature form of PolyA+ RNA isolated from the cytosol, approximately half of the transcribed sequence does not correspond to known exons
[5]. Moreover, RNA-Seq data from Arabidobsis rRNA-depleted total RNA samples contained a relatively small proportion (3.5%) of intergenic reads
[33]. These results may not be characteristic of the larger and more complex human and mouse transcriptomes, but they do present an example in which the proportion of dark matter transcripts is relatively low in a more heterogeneous RNA pool. Other studies, in contrast, reported a higher proportion of nonexonic reads in yeast
[34] and for total RNA in human
[35], leaving unresolved the question of the quantity and character of dark matter transcripts.
To investigate the extent and nature of transcriptional dark matter, we have analyzed a diverse set of human and mouse tissues and cell lines using tiling microarrays and RNA-Seq. A meta-analysis of single- and paired-end read RNA-Seq data reveals that the proportion of transcripts originating from intergenic and intronic regions is much lower than identified by whole-genome tiling arrays, which appear to suffer from high false-positive rates for transcripts expressed at low levels. The majority of RNA-Seq reads that map to intergenic regions either display a high degree of correlation with neighboring genes or are associated with more than 10,000 potential novel exonic fragments we identified in human and mouse. A genome-wide analysis of “de novo” splice junctions in human samples further revealed 2,789 previously uncharacterized transcript fragments that have no overlap with exons of known gene annotations, 1,259 of which map to intergenic regions. We also find 4,544 additional exons for annotated transcripts, 723 of which extend transcripts at the 5′ end and include likely alternative promoters. The novel exons from spliced transcripts are supported by EST data, are generally more conserved, and derive from coding as well as noncoding transcripts. We conclude that analysis of data from tiling arrays leads to vast overestimates of the proportion of transcriptional dark matter. However, the mammalian transcriptome does contain thousands of unannotated transcripts, exons, promoters, and termination sites. Intriguingly, there is a strong overlap of short intergenic transcripts with DNase I hypersensitive sites, suggesting that they may be the equivalent of pasRNAs for distant enhancers.