There has been a recent debate about whether there is pervasive transcription of the human genome and what the number and abundance of intergenic transcripts is 
. Until recently, a key missing component to this debate has been an analysis of ultra deep RNA-seq data sampling a wide array of tissue types. Without this, insufficient read depth can result in a failure to identify low abundance intergenic transcripts, and limited tissue sampling results in missed tissue specific expression. During the course of this study, the ENCODE project released a large scale analysis of RNA-seq data that provided clear evidence that the human genome is pervasively transcribed 
. We analyzed a distinct, complementary set of RNA-seq data that also fulfills these requirements of read depth and tissue breadth, covering both polyadenylated and nonpolyadenylated RNA fractions. In strong agreement with the ENCODE results, we observed that approximately 85% of the genome is transcribed, supporting prior observations of pervasive transcription based on tiling arrays that have been recently questioned 
There is an apparent discrepancy between this observed pervasive transcription and the relative paucity of annotated lincRNAs, the most numerous intergenic RNAs. It should be expected that intergenic regions encode far more lincRNAs than are currently annotated. Indeed, here we found that there are many more lincRNAs than previously known, even after aggressive filtering that removed the vast majority of previously annotated long noncoding RNAs and newly discovered intergenic transcripts (Dataset S2
). These observations clearly demonstrate that the human genome is pervasively transcribed, and that lincRNAs make up an extremely common class of intergenic transcripts.
In agreement with prior observations of smaller lincRNA annotation sets, our analyses of the expanded lincRNA catalog presented here revealed that most lincRNAs are expressed at lower levels than protein coding genes 
. Though most lincRNAs are expressed at only a few copies per cell, we found that many lincRNAs are highly expressed with nearly 4,000 expressed at >FPKM 10 and nearly 1,000 expressed at >FPKM 30, rivaling the expression of many messenger RNAs. We chose to apply an expression cutoff to remove very lowly expressed transcripts from the catalog of lincRNAs. However, it may be the case that there exist many functional lincRNAs with very low expression levels, below our expression filter cutoff. For example, the functional human lincRNA HOTTIP is expressed in approximately one out of three cells 
. Furthermore, recent findings have shown that the intergenic transcriptome may be vastly more complex than currently appreciated when very lowly expressed transcripts are considered 
. It is possible that some of these are functional transcripts despite their apparent low expression, perhaps having brief bursts of expression during stages of the cell cycle or functioning in single cells in a heterogeneous population as has been previously observed 
. Therefore, while we have provided the most complete lincRNA catalog to date, there may be additional lowly expressed, yet potentially functional lincRNAs that were excluded here.
In order to minimize any potential contamination of the lincRNA catalog with protein coding transcripts, the filtering approach used was very aggressive. In fact, most previously annotated noncoding RNAs failed to pass our filters and were therefore excluded from the lincRNA catalog (Table S3
and Dataset S9
). The vast majority of these transcripts (including most GENCODEv6 “lincRNAs” and “processed transcripts”) overlap known or predicted protein coding genes, pseudogenes, or non-lincRNA noncoding RNAs (e.g. microRNAs)(Table S3
). Some of these removed transcripts may be functional long noncoding RNAs, such as GAS5 (removed because it contains 10 snoRNA genes within its introns). However, in order to most confidently identify only lincRNAs, rather than potential unannotated extensions of known genes, these were removed.
Of those previously annotated noncoding RNAs that are intergenic, more than half contain predicted ORFs longer than 100 amino acids. For example, two previously characterized functional human lincRNAs were found to contain ORFs longer than 100 amino acids, Xist and HOTAIR. These results demonstrate that our filtering approach, which eliminates all transcripts with ORFs larger than 100 amino acids, may have removed some lincRNAs with large, nonfunctional ORFs. However, the use of a 100 amino acid ORF cutoff, a commonly used threshold to define potential protein coding genes, is justifiable because ORFs of this size infrequently occur by chance and instead indicate potential for protein coding capacity 
Rather than discard all transcripts with large ORFs, as we did here, one option to discriminate between transcripts that are coding versus noncoding is to analyze the frequency of synonymous codon substitutions (PhyloCSF) 
. However, this approach is limited to ORFs that can be aligned across species, potentially missing recently evolved or otherwise nonconserved novel protein coding genes. Importantly, our approach of removing all transcripts with large open reading frames effectively removed transcripts with significant predicted coding potential (), indicating that using an ORF size cutoff is at least as conservative as filtering based on PhyloCSF analysis. The lack of engagement of the ribosome, observed with ribosomal profiling data, confirms the stringency of the ORF cutoff filter (). Further analysis of these removed large ORF-containing intergenic transcripts is outside the scope of this study, but we have included these annotations for investigators interested in further analyzing their coding potential in search of novel protein coding genes (Dataset S10
Despite the fact that most previously annotated noncoding RNAs failed to pass our filters, our lincRNA catalog contains significantly more lincRNAs than previously known (>94% of lincRNAs are entirely novel at each expression level). This is the result of two unique features of our study. First, the RNA-seq read depth and diversity of tissues surveyed allowed for the detection of rare and tissue specific transcripts that were previously unknown. Many of these novel transcripts passed all filters and are annotated as novel lincRNAs in our catalog. Second, in contrast to prior lincRNA annotation efforts that were restricted to identification of only spliced or polyadenylated lincRNAs 
, we sought to generate annotations of a more complete set of human lincRNAs regardless of splicing or polyadenylation status. The reasons for taking this approach are manifold. Two of the most well known and abundant functional human lincRNAs, NEAT1 and MALAT1, are single exon genes (as are approximately 5% of protein coding genes) 
, suggesting that non-spliced transcripts may make up an important class of lincRNA. Additionally, numerous functional nonpolyadenylated noncoding RNAs have been described 
. Even long noncoding RNAs which can be spliced are often found in their unprocessed forms 
, a distinct property of long noncoding RNAs that would result in missed lincRNAs if splicing were a required attribute. Therefore, we chose not to exclude any lincRNAs from this catalog due to lack of splicing or polyadenylation. Importantly, because nonspliced, nonpolyadenylated transcripts could theoretically be erroneously de novo
assembled from reads derived from contaminating genomic DNA in RNA-seq data, we took multiple measures to mitigate any contributions of genomic DNA contaminant reads (see Methods).
Due to inherent limitations of de novo
transcriptome assembly using short reads of finite depth, it is not always possible to unequivocally determine the complete structure of a transcript. This is particularly true for lowly expressed transcripts where the number of reads available is limited, and for genomic regions to which reads cannot be uniquely mapped. In the case of shallow read depth, exons of multi-exonic transcripts may lack reads connecting the exons, and de novo
assembly could result in separate annotation of each exon as a distinct transcript. In support of this, we found that lower expressed lincRNAs discovered from de novo
transcript assembly were less likely to have multi-exonic structures (Table S5
). Additionally, the annotated 5′ and 3′ ends of the lincRNAs may represent truncations of the full length transcripts. Indeed, our analysis of PET tag data revealed that while the majority of our lincRNA catalog is overlapped by at least one PET tag, in most cases there is minimal PET tag support for the annotated 5′ and 3′ ends of the lincRNAs (Table S6
). It is therefore the case that some lincRNA annotations in the catalog we provide (Dataset S2
), particularly single exon lincRNA annotations, may represent fragments of larger transcripts.
Furthermore, considering the reported prevalence of low level overlapping transcripts throughout intergenic sequence 
, it is not clear that full lincRNA structures can be unequivocally deconvoluted using short read RNA-seq technology. The determination of full lincRNA structures will be an important future effort in the field and may rely upon new datasets of longer read length and greater read depth, use of multiple orthogonal data types in the same tissue, new technologies such as ultra long read next generation sequencing, and further improvements in software for de novo
In addition, the majority of RNA-seq data we analyzed lacks strand information and as a result most of the lincRNAs in our catalog are of ambiguous strandedness. Prior annotations have relied upon splice site orientation to infer the strandedness of the transcript 
. While this is a reasonable approach that we too have adopted when applicable in the present lincRNA catalog, stranded RNA-seq data is needed to most confidently assign strandedness to de novo
While determining the isoforms and full structures of all lincRNAs is clearly desirable, these incomplete lincRNA structure annotations are nonetheless of tremendous practical value. Knowledge of the structure of a portion of a transcript is often sufficient to test for differential expression or perform RNAi knockdown experiments, and facilitates the cloning and sequencing of the full length transcript. Because of this, instead of placing additional restrictions upon lincRNA annotations, our filtering strategy was aimed toward identification of as many transcripts as possible that fit within the definition of a lincRNA. However, for investigators interested in more refined lincRNA annotations, we have provided multiple more restrictive lincRNA catalogs (Datasets S4, S5, S6).
A key question in the field is whether the transcripts resulting from pervasive transcription of intergenic regions are functional or the result of noisy transcription. The lincRNAs we describe are specifically regulated and contain conserved sequence, attributes inconsistent with transcriptional noise (). Furthermore, lincRNAs were found to be strongly enriched for intergenic TASs compared to nonexpressed intergenic regions (). This striking finding supports the possibility that many intergenic SNPs mark regions that function as lincRNAs rather than DNA elements. Because nearly half of all TASs are intergenic, it is possible that lincRNAs play a significant role in the majority of human traits and diseases thus far analyzed in GWASs. One functional lincRNA (MIAT) was first identified during the experimental interrogation of an intergenic TAS 
, and another lincRNA PTCSC3, was identified nearby a TAS found from a papillary thyroid carcinoma GWAS, perhaps representing the first of many such discoveries to come from intergenic TASs. The finding that lincRNAs are strongly enriched for TASs provides a new opportunity to revisit intergenic trait-associated regions with unknown functional mechanisms by testing whether the overlapping lincRNA is involved in the observed phenotype.
This noncoding RNA catalog represents a major step toward achieving a more complete understanding of this exciting frontier. We have identified a large number of putative lincRNAs with characteristics suggesting functionality. However, many of these lincRNAs are low expressed and definitive proof of functionality for a lincRNA requires functional experiments. High throughput functional genomic approaches, such as RNAi and cDNA overexpression screens, will serve as crucial tools for future efforts to uncover the roles of lincRNAs in diverse biological systems. With the requisite technology now available for these next generation experimental approaches, the time is ripe for this dark matter of the human genome to step further into the spotlight.