|Home | About | Journals | Submit | Contact Us | Français|
More than 98% of a typical vertebrate genome does not code for proteins. Although non-coding regions are sprinkled with short (<200 bp) islands of evolutionarily conserved sequences, the function of most of these unannotated conserved islands remains unknown. One possibility is that unannotated conserved islands could encode non-coding RNAs (ncRNAs); alternatively, unannotated conserved islands could serve as promoter-distal regulatory factor binding sites (RFBSs) like enhancers. Here we assess these possibilities by comparing unannotated conserved islands in the human and mouse genomes to transcribed regions and to RFBSs, relying on a detailed case study of one human and one mouse cell type. We define transcribed regions by applying a novel transcript-calling algorithm to RNA-Seq data obtained from total cellular RNA, and we define RFBSs using ChIP-Seq and DNAse-hypersensitivity assays. We find that unannotated conserved islands are four times more likely to coincide with RFBSs than with unannotated ncRNAs. Thousands of conserved RFBSs can be categorized as insulators based on the presence of CTCF or as enhancers based on the presence of p300/CBP and H3K4me1. While many unannotated conserved RFBSs are transcriptionally active to some extent, the transcripts produced tend to be unspliced, non-polyadenylated and expressed at levels 10 to 100-fold lower than annotated coding or ncRNAs. Extending these findings across multiple cell types and tissues, we propose that most conserved non-coding genomic DNA in vertebrate genomes corresponds to promoter-distal regulatory elements.
In completely sequenced vertebrate genomes, only ~1.5% of genomic DNA codes for proteins (1–3). An additional 3.5% of the genome lacks coding sequences but is nonetheless conserved across vertebrate phylogeny, strongly suggesting its functional importance (1–3). This conserved, non-coding 3.5% of the genome clusters into >700000 unannotated conserved islands, 90% of which are <200bp (‘Materials and Methods’ section). The vast majority of these conserved islands have no known function.
Two possible functions for unannotated conserved islands are (i) to encode enhancers and other distal regulatory sequences and (ii) to encode non-coding RNAs (ncRNAs). Indeed, tens of thousands of vertebrate conserved islands have already been found to overlap enhancers (4–7), which function at a distance to regulate the expression of associated genes. Most of these enhancers were identified in a genome-wide manner based on the presence of the co-activator p300/CBP and of H3K4me1-modified histones (8). Similarly, ~10000 conserved islands have been found to overlap promoters or exonic sequences of ncRNAs (9–11), whose function may be to act in cis or in trans to regulate gene expression (12–15). However, it remains unclear how many conserved islands will ultimately prove to have enhancer-related, ncRNA-related or other functions, since both enhancers and ncRNAs remain to be completely identified. Unfortunately, comprehensive identification of enhancers and ncRNAs will ultimately require genome-scale experiments to be conducted on all cell types in a vertebrate body, since both enhancers and ncRNAs function in subsets of tissues and cell types (5,11,16).
A conceptual and experimental challenge in distinguishing whether a conserved island is an enhancer or a promoter of an ncRNA is that many enhancers produce short (<2kb) ncRNAs called enhancer RNAs (eRNAs) (7,17–20). Genomic sequence conservation at an enhancer has traditionally been thought to reflect the importance of regulatory factor binding sites (RFBSs), which recruit transcription factors initially to the enhancer and ultimately, through DNA looping, to associated promoters (21–23). However, the synthesis of eRNAs raises the possibility that conservation of sequences at enhancers may also reflect their importance for promoting eRNA transcription or for encoding functions of eRNA transcripts. Since eRNAs, much like other ncRNAs, could in theory act in cis or in trans to regulate gene expression, any given enhancer could have important functions both as a traditional enhancer and as a promoter of a non-coding eRNA. Thus, neither the presence of RNA Polymerase II (RNAPII) nor evidence of transcriptional initiation at an unannotated conserved island is on its own sufficient to determine whether its sequence conservation reflects conservation of enhancer function, conservation of ncRNA promoter function or both.
Here we estimate, using genome-wide approaches, how many conserved islands function as enhancers (and other distal regulatory elements) and how many encode ncRNAs. We comprehensively define distal regulatory elements using ChIP-Seq and DNAseI-hypersensitivity (24) assays based on data from published sources (1,18). We also comprehensively define transcribed regions of the genome by applying a novel transcript calling algorithm to RNA-Seq data obtained from total cellular RNA. Applying these approaches to mouse cortical neurons and a human (HeLa) cell line, we find that whereas hundreds of unannotated conserved islands are transcribed in these cell types into ncRNAs, tens of thousands of unannotated conserved islands can be identified as distal regulatory elements. These conserved, promoter-distal regulatory elements are distinguishable from conventional ncRNA promoters based on the low expression level, and non-polyadenylated status of the transcripts they synthesize, as well as by their lack of the promoter-specific H3K4me3 mark (25–27). We find similar ratios of conserved ncRNAs to conserved distal regulatory elements when expanding our analysis to 10 different human cell lines. Our results suggest that the underlying reason for the conservation of most unannotated conserved bases in vertebrate genomes is their importance within promoter-distal regulatory elements.
Our goal was to compare three kinds of genomic loci: conserved sequences, RFBSs and transcribed regions. To identify unannotated transcribed regions we developed a novel algorithm, Haar-wavelet Transcript Calling (HaTriC), which is described in the Supplementary Methods. Our strategy was to first identify each kind of locus and second to relate them to one another.
The conserved islands were obtained from the PhastCons scores (as compared with 30 other vertebrates) (2) using a coarse-graining procedure that identified bins of at least 10bps where the average score was >0.9 (Supplementary Figure S3). Summary statistics for the conserved islands can be found in Supplementary Table S4 and Supplementary Figure S3. The assignment of conserved islands to various non-overlapping categories was carried out as follows:
Calculating how many conserved islands overlap exons of unannotated transcripts is complicated by the fact that HaTriC does not provide the exon–intron structure of transcribed regions. Hence, we used a statistical approach where we assumed that the unannotated transcribed regions have the same distribution of exon numbers, exon lengths and exon conservation as the long annotated ncRNAs (Supplementary Table S7). Using these assumptions, it is possible to estimate the number of conserved islands explained by unannotated transcripts in a given cell type (Figure 1 and and33A).
RFBSs were identified from publicly available DNAseI hypersensitivity and ChIP-Seq datasets. To understand how RFBSs are related to transcribed regions, we categorized them based on their proximity to promoters, enhancers, introns, exons and novel transcribed regions. We classified RFBSs as conserved or non-conserved based on overlap with conserved islands. Each RFBS was assigned to a category according to the scheme outlined below, and the results are reported in Supplementary Table S3. We start with the full set of peaks, and once a peak has been assigned to a category, it cannot be assigned to any further categories.
To understand transcription across the genome, we combined annotation, de novo transcript-calling and a targeted search near enhancers and RFBSs (Table 1) to produce a set of transcribed regions. To characterize the transcriptome, we assigned each read uniquely to a transcribed region. To define the transcribed regions in the most accurate way possible, as reported in Supplementary Table S7, Supplementary Figure S4 and Table 1 in the main text, we combined the annotation, the HaTriC transcript-caller, and a targeted search close to enhancers and RFBSs. Below, we describe how each category of transcribed regions was defined, as well as the criteria for assigning reads to each category. We considered the categories sequentially, and at each step we identified (and removed from further analysis) all reads that overlapped regions in the current category.
The Supplementary Methods contain details on how the annotations of the mouse and human genomes were assembled. There is also a list of all the datasets used in this study (Supplementary Table S6). As far as possible, given the availability of datasets, we performed parallel analyses in mouse neurons and human HeLa cells.
Our strategy to understand the function of conserved elements required a comprehensive accounting of the transcriptome, including an ability to comprehensively identify non-coding transcripts and distinguish ncRNAs from protein-coding genes. To achieve this understanding, we developed an algorithm to define transcribed regions of the genome de novo (i.e., without relying on annotation) using short-read RNA-Seq data. We applied this algorithm to strand-specific RNA-Seq from mouse cortical neurons and HeLa cells. Although other RNA-Seq studies have applied de novo transcript detection in both yeast (31) and mouse (11,32), with a few exceptions [e.g., (28,33,34)], these and other studies of gene expression have only detected or defined mature polyadenylated transcripts. Here we consider polyadenylated transcripts as well as non-polyadenylated transcripts, since unannotated conserved sequences could give rise to either type of transcript.
To define transcribed regions of the genome de novo, we developed a computational approach, Haar-wavelet Transcript Calling (HaTriC). HaTriC is an iterative algorithm that combines evidence from multiple length scales to determine if a given region is actively transcribed (see Supplementary Methods for full description). Each iteration of the HaTriC algorithm involves the following three steps: (i) At each genomic locus, the change in RNA-Seq read density between upstream and downstream regions is computed [as Haar-wavelet coefficients, similar to (35)]. (ii) The genomic loci with the biggest changes in RNA-Seq density are selected. These loci represent a set of candidate boundaries of transcribed regions. (iii) Each candidate transcribed region, defined as the sequence between two candidate boundaries, is classified as either transcribed or not transcribed based on its average read density. This classification is straightforward, since the densities of the candidate regions have a bimodal distribution, with the higher density mode corresponding to transcribed regions (Supplementary Figure S1). By applying the above procedure iteratively, excluding the regions that were called as transcribed in previous rounds, we are able in each successive round to detect transcribed regions with lower read density. As regions with high read density are removed, the distribution of candidate regions' read densities goes from bimodal to unimodal, at which point the iteration terminates because no additional transcribed regions are detected. The parameters for the algorithm (‘Materials and Methods’ section) are optimized by maximizing the fraction of transcribed regions (on one strand of one chromosome) that have transcriptional initiation marker H3K4me3-modified histones (25–27) bound at their start.
To identify transcribed regions, we applied HaTriC to RNA-Seq reads obtained from sequencing ribosomal RNA-depleted total RNA from mouse neurons (~140 million reads) (18) and HeLa cells (~50 million reads). We obtained ~10000 transcribed regions in each cell type (Supplementary Table S1). These transcribed regions do not necessarily represent specific RNA transcripts [as do the transcripts defined in some other studies, e.g., (11)] but can instead correspond to multiple overlapping transcripts synthesized from the same strand. For example, the RNA-Seq reads falling within a transcribed region that corresponds to a gene typically reflect pre-mRNA transcripts (corresponding to reads aligning both to introns and exons) as well as mature mRNA transcripts (corresponding to exons only). The purpose of this approach is to identify regions that are transcribed rather than to define precisely the exon–intron structure of specific transcripts.
To evaluate the quality of transcribed regions identified by HaTriC, we compared transcribed regions with annotated protein-coding genes. Comparing the transcribed regions with the RefSeq (36), UCSC (37) and Ensembl (38) gene annotations, we found, in accordance with another recent study (28), that most transcribed regions either overlap coding genes on the correct strand (76%) or are found within 10kb of the start of an annotated coding gene (18%), despite the fact that these two sets of regions together represent just 15% of the genome (Supplementary Table S1). The majority of the transcribed regions that overlap annotated genes (78%) match the gene annotation in an unambiguous manner (Supplementary Methods, Supplementary Table S1). These results demonstrated sufficient accuracy in calling transcribed regions to allow us to begin to categorize regions lacking any annotation.
We asked to what extent novel, HaTriC-defined ncRNAs coincide with evolutionarily conserved sequences. To identify the conserved sequences, we segregated the genome into conserved and non-conserved regions using a simple coarse-graining procedure performed on PhastCons conservation scores derived from alignments of 30 vertebrate species (2) (‘Materials and Methods’ section). In each genome (human and mouse), we identified 1 million conserved islands, most of them <200bp (Supplementary Figure S3). We define ~300000 of these as annotated conserved islands based on their overlap with the promoters or exons of annotated coding or non-coding genes (RefSeq, UCSC, Ensembl, lncRNA and macroRNA annotations combined (9,11,36,37,38), Supplementary Table S4). The remaining 700 000 (presumed non-coding) conserved islands we define as unannotated conserved islands (55% extragenic, 45% intragenic).
We addressed how many unannotated conserved islands might encode promoters or exons of novel extragenic ncRNAs. Our strategy was to asssess the overlap of unannotated conserved islands with expressed sequences. To identify novel ncRNAs, we applied HaTriC and found ~200 ncRNAs that were already annotated in RefSeq, Ensembl, UCSC, the macroRNA or lincRNA datasets and ~250 unannotated extragenic ncRNAs (Supplementary Table S1) (Although 200 annotated ncRNAs may seem like a small number to find, this result is consistent with the reported lower levels of expression and greater tissue specificity of ncRNAs (39), especially those lacking strong experimental support.). Annotated and novel ncRNAs accounted for 2% of transcribed regions called by HaTriC and an estimated 3% of the total number of RNA-Seq reads but <0.01% of the length of the genome (Table 1 and Supplementary Table S1). Given the small number of novel ncRNA loci that are actively transcribed in either mouse neurons or HeLa cells, relatively few unannotated conserved islands are estimated to function as promoters (<400) or exons (<1800) of novel extragenic ncRNAs in these cells (Figure 1 and ‘Materials and Methods’ section). These estimates depend on assumptions about the gene structure of novel ncRNAs and the overlap of their exons with conservation islands, since HaTriC does not define intron-exon structure directly. Our assumptions are based on the gene structure of annotated ncRNAs in RefSeq, UCSC and Ensembl. A recently published survey of human lincRNAs (39) provided an independent set of gene structures that have less exonic overlap with conserved islands (Supplementary Table S7). Thus, our above estimates may overemphasize the extent of conservation explained by novel ncRNAs.
We also addressed how many unannotated conserved islands might encode novel non-coding AS transcripts, which have been proposed to regulate sense gene expression (40–43). Previously described AS transcripts include<2kb promoter AS transcripts (34,44,45), synthesized upstream of genic promoters; <2kb eRNAs from intronic enhancers (18,19); and other, sometimes longer AS transcripts (46). It remains unclear how common these transcripts are, how highly expressed they are relative to annotated genes, and how frequently they overlap unannotated conserved islands. We found a substantial number of AS transcribed regions, accounting for 15% of all transcribed regions detected by HaTriC (Supplementary Table S1). Most AS transcribed regions correspond to promoter AS transcripts (Table 1). Because AS transcripts are generally short (Supplementary Figure S4c,d), the total fraction of the genome transcribed into AS in these cell types is small. Thus, in vertebrate genomes as in yeast (44,45), AS transcription is composed predominantly of short (<2kb), lowly expressed transcripts synthesized from promoters. Accordingly, AS transcripts do not explain the function of many unannotated conserved islands, since AS transcripts originate predominantly from conserved islands that are already annotated as promoter regions.
Our analysis suggests that few unannotated conserved islands encode ncRNAs or serve as their promoters. One reason we could be underestimating the overlap of conservation with ncRNAs is that our ability to detect ncRNAs spanning the 45% of conserved islands that are intronic is limited to detection of anti-sense ncRNAs. This limitation arises because our method does not allow us to distinguish specific ncRNAs that overlap pre-mRNAs or mRNAs on the same strand. However, the similar numbers of H3K4me3 binding sites at extragenic and intronic loci (Supplementary Table S3) suggest that this limitation is unlikely to result in a dramatic revision to our findings. A second potential reason that we could be underestimating the extent of overlap between conserved islands and ncRNAs would be if there were additional, less highly expressed novel ncRNAs not detected by HaTriC. Like any algorithm for identifying transcribed regions or transcripts, HaTriC has the least statistical power to detect and define lowly expressed transcripts, especially if the transcripts are short. For example, few eRNAs are detected by HaTriC (Supplementary Table S1). To evaluate the extent of this challenge, we asked how many RNA-Seq reads could be explained by different sources of annotation, including annotated transcripts HaTriC-defined transcripts, and transcripts associated with enhancers or other promoter-distal RFBSs (Table 1 and Supplementary Methods). We found that 99.8% of reads can be explained by these combined sources of genome annotation. Of the remaining <0.2% of unaccounted reads, >99% could be explained by a negative binomial background model, suggesting a low level of technical noise (Supplementary Figure S2).
We cannot fully rule out the alternative possibility that these reads are derived from tens of thousands of very lowly expressed ncRNAs. We estimate, however, that the expression levels of such transcripts would be <1 transcript per 100 cells (Supplementary Methods). This estimate, detailed in the supplement, is based on a reference point of 240000 mRNAs per cell (47). While we note that this reference point is imprecise, our overall copy number distributions accord well with those obtained using reference points based on digital in situ hybridization (48) (data not shown). Even if our copy number estimates were off by an order of magnitude, these results imply that if the unexplained reads do not represent technical noise, they may well represent biological (transcriptional) noise (28,49).
Having observed that few unannotated conserved islands appear to coincide with ncRNAs, we investigated an alternative hypothesis, that most unannotated conserved islands function as promoter-distal RFBSs. Such sites include enhancers, insulators and silencers (1,50). Each of these types of RFBS is marked by DNAseI Hypersensitivity Sites (DHSs), since the relatively open chromatin associated with RFBSs is vulnerable to DNAseI digestion (1,51,52). Thus, to identify RFBSs, we examined DHSs [Supplementary Table S2a (1)]. We found that DHSs overlapped ~9000 unannotated conserved islands, suggesting that a large number of unannotated conserved islands function as RFBSs. This overlap is highly significant (P-value <10−16, hypergeometric test where the intersection is 9000 between 700000 conserved islands and 60000 DHSs, assuming a total of 30 000000 potential loci), consistent with the idea that sequence conservation at overlapping unannotated conserved island/DHS loci reflects the importance of regulatory factor binding to DNA.
In theory, DHSs should correspond to all RFBSs bound in a given cell type, but in practice any DHS experiment will miss some RFBSs. In an independent approach to finding RFBSs, we turned to ChIP-Seq experiments to identify binding sites for a number of different regulatory factors. The factors immunoprecipitated were NPAS4, CREB, SRF and CBP in mouse neurons (18) and AP2α, AP2γ, MAX, cFOS, cMYC, E2F4 and E2F6 in HeLa cells (1). Analysis of the HeLa data revealed that ~90% of the binding sites for these factors overlapped a DHS, indicating that the DHSs represent a sensitive means of detecting RFBSs (Supplementary Table S2b) (53). Unsurprisingly in light of this high degree of overlap, ~12000 conserved RFBSs identified by factor binding were distributed across promoters, enhancers and unclassified RFBSs in much the same way as those identified by DNAse hypersensitivity (Figure 1; Supplementary Table S2a and S3a). Thus, regardless of the method used to identify RFBSs, many more unannotated conserved islands overlap with RFBSs than with ncRNAs. Moreover, the finding that a much larger fraction of the non-coding genomic elements serve as binding sites for regulatory factors rather than exons or promoters of ncRNAs is true for non-conserved parts of the genome as well. For the mouse neurons, we estimate that there are ~40000 non-conserved enhancers and RFBSs, compared with only 2700 promoters and exons for ncRNAs.
The large number of unannotated conserved islands found to overlap RFBSs led us to investigate what regulatory function unannotated conserved RFBSs (ucRFBSs) might serve. Insulators, which represent one major class of regulatory site, are involved in partitioning active and inactive regions of the genome (50,54). We asked how many ucRFBSs identified in HeLa cells were insulators, defined by the presence of the protein CCCTC-binding factor (CTCF). We found that ~3000 ucRFBSs could be classified as insulators on this basis (last three rows in Supplementary Table S3b).
Although the remaining ucRFBSs are distal both to annotated and HaTriC-defined promoters, we considered the possibility that they might be weak, unannotated ncRNA promoters not detected by HaTriC. We found that <500 ucRFBSs could be classified as promoters based on the presence of the promoter-specific H3K4me3 mark (Figure 2F, Supplementary Table S3c). Nonetheless, a larger number of ucRFBSs could represent inactive promoters that drive high levels of transcription in other cell types. We investigated this possibility by examining sequence, conservation and expression profiles of ucRFBSs (Figure 2). First, we reasoned that if ucRFBSs act as promoters in any tissue, they should share sequence characteristics with known promoters. Contrary to this prediction, ucRFBSs (like enhancers) lack high densities of CpG dinucleotides that are frequently found at coding promoters (Figure 2H). Second, we reasoned that promoters and promoter-distal regulatory sequences might be distinguished based on their extent of conservation. We found that the lengths of conserved islands at ucRFBSs more closely resemble those found at enhancers than those found at promoters, again suggesting that most ucRFBSs do not act as promoters in any tissue or cell-type (Figure 2G). Finally, if ucRFBSs act as promoters in any tissue, they should be enriched for 5′ ends of (5′-sequenced) ESTs (37), which have been sequenced at low depth from a wide variety of different tissues. However, the overlap between ucRFBSs and ESTs is similar to that observed between ESTs and previously defined enhancers and is dramatically less than the overlap between 5′ EST ends and promoters (Figure 2B). Moreover, only promoters of protein-coding genes have more spliced than unspliced ESTs (Supplementary Figure S9). These results suggest that ucRFBSs do not act as conventional coding or ncRNA promoters.
We hypothesized that many of the remaining ucRFBSs might represent enhancers that were missed by enhancer-identification algorithms (5,8,18) because their level of enrichment of p300/CBP or H3K4me1 was below the chosen threshold. To investigate this possibility, we examined CBP and H3K4me1 binding at ucRFBSs in mouse neurons. The majority of ucRFBSs had low but significant levels of CBP and H3K4me1 enrichment, suggesting that many of these unclassified sites could indeed be enhancers (data not shown and Supplementary Figure S5). However, a subset of these ucRFBSs may represent regulatory sites that are not enhancers. We conclude that ucRFBSs are a mixture of insulators, enhancers and perhaps other classes of promoter-distal regulatory sites (e.g., locus-control regions and silencers).
Our case studies of HeLa and mouse neuron cells suggest that most unannotated conserved islands overlap promoter-distal RFBSs (e.g. enhancers), rather than ncRNAs. However, recent genome-wide studies (18,19) have found that certain promoter-distal RFBSs (i.e. enhancers) are associated with low levels of transcription, producing ncRNAs. These results blur the distinction between promoter-distal regulatory sites and ncRNA promoters, raising questions about how well these two classes of loci can truly be distinguished. However, we find that that these enhancers and ucRFBSs can be clearly distinguished experimentally from ncRNA promoters for the following reasons. First, while both enhancers and ucRFBSs produce RNAs, these RNAs are expressed at much lower levels than those produced from traditional coding or non-coding promoters (Figure 2A,D, P-value <10−9, KS-test on distributions from Figure 2). The approximate expression levels of transcripts emanating from ucRFBSs averaged <1 transcript per 100 cells (Supplementary Figure S4a,b, Supplementary Methods), or ~10 to 100-fold lower than the copy number of an average genic mRNA. Second, relative to ncRNAs, the transcripts emanating from promoter-distal RFBSs are less likely to be spliced or polyadenylated (18,19) (Figure 2B,C,S9, P-value <10−16, KS-test on distributions from Figure 2). Third, while the length distributions of ncRNAs and ucRFBSs overlap, ucRFBSs are shorter (<2kb) than the typical lincRNA (P-value <10−2, KS-test on distributions from Figure 2). Fourth, sequence conservation at ucRFBSs in most cases does not extend beyond the specific location where regulatory factors bind, whereas coding and ncRNAs promoters exhibit conservation over a longer genomic region (Figure 2G). Thus, conserved distal regulatory sites differ significantly from ncRNA promoters based on their chromatin and transcriptional profiles.
To the extent that we can ascribe functions to unannotated conserved islands in our case study of one mouse and one human cell type, the ascribed functions are overwhelmingly related to binding of regulatory factors to DNA, with relatively few unannotated conserved islands corresponding to ncRNAs (Figure 1). However, our case study explains only ~20000 out of 700000 unannotated conserved islands, presumably because the remainder function only in cell types (or cellular conditions) other than those examined here. Since we are only able so far to investigate a small number of cell-types, our predictions about the functional role of the majority of conserved islands requires an extrapolation of our findings to additional cell types. In this extrapolation, how many additional unannotated conserved islands may be attributed to RFBSs and ncRNAs as more cell types are examined will depend on the relative cell-type specificity of RFBSs and ncRNAs.
To extrapolate how many additional conserved islands could be ascribed to conserved unannotated RFBSs (ucRFBSs) and ncRNAs as additional cell types are examined, we identified ncRNAs (using H3K4me3-binding) and ucRFBSs (using DHS sites) in 10 additional human cell lines using data from the ENCODE project (1) (‘Materials and Methods’ section). As we examined additional cell types, we discovered novel ucRFBSs and ncRNAs at similar rates (Figure 3A), implying that ucRFBSs and ncRNAs have similar cell-type specificity. However, this conclusion relies on the accuracy of using H3K4me3 to identify ncRNA promoters. To confirm that unannotated H3K4me3 loci are a reasonable proxy for ncRNA promoters, we used HaTriC to identify 800 novel ncRNAs in 10 human tissues, using RNA-Seq data generated from total RNA (Supplementary Table S5). The approximate number of novel ncRNAs found from each additional ENCODE cell line (using H3K4me3) or human tissue (using HaTriC) was similar (~100), suggesting that unannotated H3K4me3 sites are a reasonable proxy for ncRNA promoters [as was previously found by others (11)]. In fact relatively weak H3K4me3 sites are frequently found in locations where no transcription can be detected by RNA-Seq (Figure 2, cyan line), suggesting that our method of promoter identification by H3K4me3 may over-represent the number of ncRNA promoters. Because the rate of additional ncRNAs and RFBSs found with each new cell type examined is roughly equal, examining additional cell types does not radically alter our conclusion that more unannotated conserved islands have RFBS-related than ncRNA-related function.
We considered finally how many unannotated conserved islands might ultimately be assigned to RFBS-related or ncRNA-related functions once RFBSs and ncRNAs have been identified in all vertebrate cell types. The adult human body contains ~400 cell types (55). However, this number is likely to be an underestimate in the sense that it does not account for rare (and unknown) adult cell types, developmental stage-specific cell types, or for the ability of cells to adopt different gene expression and chromatin states depending on environmental cues. The true number is very difficult to approximate, but it is nonetheless instructive to extrapolate how conserved ncRNA and RFBS discovery might scale with increasing numbers of cell types. We therefore arbitrarily took 2000 as a potential upper limit on the number of cell types across human development and adulthood. Extrapolating our findings using the ENCODE cell lines to an estimated 1000–2000 cell type-conditions, we find it to be plausible that 20% of conserved islands encode ncRNAs or their promoters, whereas 80% of conserved islands function as promoter-distal RFBSs (Figure 3B). Even though there are many uncertainties associated with this estimate, it is to the best of our knowledge the first quantitative attempt to address the question of how many conserved non-coding sequences function as distal regulatory elements versus encode ncRNAs. We expect that access to RNA-Seq and ChIP-Seq data from additional cell-types, as well as more accurate multi-species alignments will ultimately allow similar approaches to produce more accurate estimates.
The function of the ~60% of conserved bases in vertebrate genomes that are non-coding is unknown (1,3). We provide genome-wide evidence suggesting that the reason most of these DNA bases are conserved across vertebrate evolution is due to their importance as promoter-distal regulatory elements, including enhancers. In individual cell types (HeLa cells or mouse neurons), we find that 4-fold more conserved islands are associated with promoter-distal regulatory elements rather than with ncRNAs. As we examine additional cell types, we find in each additional cell type four times as many conserved islands corresponding to promoter-distal regulatory elements as to ncRNAs. Extrapolating these results to an estimated 1500 cell type-conditions in a vertebrate body, it is plausible that almost ~300000 unannotated conserved islands function as promoter-distal regulatory elements, while ~60000 encode functions related to ncRNAs. At the same time, we cannot rule out the alternative possibility that many or even most unannotated conserved islands may have novel, as-yet unknown functions that are unrelated to RFBSs, ncRNAs, or even gene expression. For example, we used the H-rule (56) to estimate (Supplementary Figure S10) that 4% of conserved islands could correspond to MARs, which anchor chromatin to the nuclear matrix (57).
When characterizing the transcriptome, a crucial experimental parameter is the total number of reads captured in the experiment, since the extent of sequencing determines one's ability to detect lowly expressed transcripts. To evaluate how the sequencing depth affects our results, we down-sampled the RNA-Seq data. For mouse neurons >90% of the novel ncRNAs are discovered using only 50% of the original reads (Supplementary Figure S8), and extrapolation suggests that additional novel transcripts would be discovered at a rate of ~2 regions/10 million reads. Most of the regions that would be discovered from additional sequencing would express transcripts at even lower abundance than the regions that have been discovered at the current sequencing depth.
This low degree of extragenic transcription is in contrast with the results reported in a recent study by Mercer et al. (58). Using Capture-Seq, a combination of microarray capture and RNA-Seq, they identified 257 novel transcripts from 0.77Mb of human fibroblast genome. Extrapolation of their result suggests that there could be ~1 million lowly expressed extragenic transcripts in the entire genome, whereas extrapolation of our results suggests a number that is around three orders of magnitude lower. This difference is likely due in large part to their much higher sensitivity, which allows them to detect transcripts expressed at an average of 0.0006 mRNAs per cell.
Detection and expression level thresholds are crucial to understanding the ongoing debate about the extent of intergenic transcription. Initial estimates based on tiling arrays (1) suggested that the majority of the genome was transcribed in at least one cell type. Even though this finding has been challenged (28), other more recent studies (58,59) have argued that there is ubiquitous transcription, albeit at levels as low as an average of 0.0006 transcripts per cell. At the given sequencing depth, where we estimate our detection threshold at about one mRNA per cell, our findings are consistent with van Bakel et al. (28). Thus, while higher detection sensitivity allows for identification of further transcripts, these transcripts are expressed at very low levels. The weaker the expression level of a protein-coding or ncRNA, the less it tends to be conserved (60), suggesting that deeper the transcriptome is sequenced, the less overlap identified transcripts will have with evolutionary conservation.
Despite most of the genome not being transcribed, we do find thousands of short regions that produce low levels of transcripts associated with promoters, insulators, enhancers and other regulatory sites. These transcripts tend to be unspliced and non-polyadenylated (Figure 2B,C), suggesting that they do not leave the nucleus. Even if these transcripts themselves are not functionally important (49), the act of transcribing them may nonetheless be necessary. For example, much of the observed ncRNA transcription may be important for establishing and maintaining chromatin states such as histone acetylation or methylation [e.g., (61,62)]. In this case, the low levels of associated transcripts may reflect either the low stability of the non-functional transcripts that are produced or the low frequency of transcriptional initiation required for chromatin maintenance.
We propose that most sequence conservation in the non-coding genome reflects the importance of RFBSs. This proposition may appear to conflict with the observation that binding of certain regulatory factors (CEBPA, HNF4A) in liver tissue is poorly (<10% of sites) conserved between any two mammals (63). However, it has recently been argued that many of non-conserved binding sites are less likely to be functional and that the degree of conservation is significantly increased if expression of nearby genes is taken into consideration (64). More broadly, the extent of conservation of binding that is observed in any particular experiment likely depends on the particular tissues or regulatory factors examined, as is suggested by the higher conservation of binding of the transcription factor Twist across Drosophila species (65). Thus, in order to determine conclusively whether a particular conserved genomic sequence is associated with conserved binding, it will be necessarily to examine multiple bound factors at that locus across multiple tissues. Together our findings suggest that while many RFBSs come and go over an evolutionary time scale, a core set of conserved RFBSs account for most of the non-coding sequence conservation in vertebrate genomes.
Supplementary Data are available at NAR Online: Supplementary Tables 1–8, Supplementary Figures 1–10, Supplementary Methods and Supplementary References [66–85].
National Institutes of Health (NIH) [1DP2OD006461-01, 1R21NS070250-01A1, NS028829]; National Science Foundation (NSF) [BCS-0954570]. Funding for open access charge: NIH.
Conflict of interest statement. None declared.
We would like to thank Tae-Kyung Kim for preparing the ChIP-Seq libraries for the mouse neurons, Jian Gu and Kristithe Lea for preparing the HeLa RNA-Seq libraries and the ENCODE project for making their data publicly available.