|Home | About | Journals | Submit | Contact Us | Français|
RNA polymerases are highly regulated molecular machines. We present a method (global run-on sequencing, GRO-seq) that maps the position, amount, and orientation of transcriptionally engaged RNA polymerases genome-wide. In this method, nuclear run-on RNA molecules are subjected to large-scale parallel sequencing and mapped to the genome. We show that peaks of promoter-proximal polymerase reside on ~30% of human genes, transcription extends beyond pre-messenger RNA 3′ cleavage, and antisense transcription is prevalent. Additionally, most promoters have an engaged polymerase upstream and in an orientation opposite to the annotated gene. This divergent polymerase is associated with active genes but does not elongate effectively beyond the promoter. These results imply that the interplay between polymerases and regulators over broad promoter regions dictates the orientation and efficiency of productive transcription.
Transcription of coding and noncoding RNA molecules by eukaryotic RNA polymerases requires their collaboration with hundreds of transcription factors to direct and control polymerase recruitment, initiation, elongation, and termination. Whole-genome microarrays and ultra-high-throughput sequencing technologies enable efficient mapping of the distribution of transcription factors, nucleosomes, and their modifications, as well as accumulated RNA transcripts throughout genomes (1, 2), thereby providing a global correlation of factors and transcription states. Studies using the chromatin immunoprecipitation assay coupled to genomic DNA microarrays (ChIP-chip) or to high-throughput sequencing (ChIP-seq) indicate that RNA polymerase II (Pol II) is present at disproportionately higher amounts near the 5′ end of many eukaryotic genes relative to downstream regions (3–6). However, these techniques cannot determine whether Pol II is simply promoter-bound or engaged in transcription. Small-scale analyses using independent methods have shown that this distribution likely represents transcriptionally engaged Pol II that has accumulated between ~20 and 50 bases downstream of transcription start sites (TSSs) (5, 6), indicating that transcription can be regulated at the stage of elongation as well as the recruitment and initiation stages (7). This promoter-proximal pausing or stalling (8) is proposed to be an important post-initiation, rate-limiting target for gene regulation (7, 9).
Here, we present a global run-on-sequencing (GRO-seq) assay to map and quantify transcriptionally engaged polymerase density genome-wide. These measurements provide a snapshot of genome-wide transcription and directly evaluate promoter-proximal pausing on all genes. We used nuclear run-on assays (NRO) to extend nascent RNAs that are associated with transcriptionally engaged polymerases under conditions where new initiation is prohibited. To specifically isolate NRO-RNA, we added a ribonucleotide analog [5-bromouridine 5′-triphosphate (BrUTP)] to BrU-tag nascent RNA during the run-on step (fig. S1). The length of the polynucleotide was kept short, and the NRO-RNA was chemically hydrolyzed into short fragments (~100 bases) to facilitate high-resolution mapping of the polymerase origin at the time of assay (8). BrU-containing NRO-RNA was triple-selected through immunopurification with an antibody that is specific for this nucleotide analog, resulting in a 10,000-fold enrichment of the NRO-RNA pool that was determined to be >98% pure (8). A NRO-cDNA library was then prepared for sequencing from what represents the 5′ end of the fragmented, BrU-incorporated RNA molecule by using the Illumina high-throughput sequencing platform. The origin and the orientation of the RNAs and therefore the associated transcriptionally engaged polymerases were documented genome-wide by mapping the reads to the reference human genome (8).
In total, ~2.5 × 107 33–base pair (bp) reads were obtained from two independent replicates (8) prepared from primary human lung fibroblast (IMR90) nuclei, of which ~1.1 × 107 (44%) mapped uniquely to the human genome. Most reads (85.8%) align on the coding strand within boundaries of known RefSeq genes, human mRNAs, or expressed sequence tags (fig. S2). The number of transcriptionally active genes was determined by using an experimentally and computationally determined background of 0.04 reads per kilobase (8). We found 16,882 (68%) of RefSeq genes to be active (P < 0.01) compared with 8438 active genes found by a microarray experiment performed in the same cell line (3), reflecting, in part, the added sensitivity of sequencing platforms (10). Examination of several large regions shows that GRO-seq can differentiate between transcriptionally active and inactive regions in large chromosomal domains (Fig. 1). In addition, we are able to detect a generally low, but significant (P < 0.01 relative to background) amount of antisense transcription for 14,545 genes (58.7% of genes in the genome) (fig. S3).
Aligning the GRO-seq data relative to RefSeq TSSs shows that the density of reads peaks near the TSS in both sense (~50 bp) and antisense (~−250 bp) directions (see below) (Fig. 2A). Alignment of GRO-seq reads to annotated 3′ ends of genes reveals a broad peak that is maximal at about +1.5 kb and can extend greater than 10 kb downstream of polyadenylation (poly-A) sites (Fig. 2B). This peak distance is consistent with previous and recent estimates (11, 12). A small peak followed by a sharp drop off is observed at the site of polyadenylation, likely representing the known 3′ cleavage before polyadenylation of the RNA (13).
To identify all genes that show a peak of engaged Pol II that is characteristic of promoter-proximal pausing, we assessed whether each gene showed significant enrichment of read density in the promoter-proximal region relative to the density in the body of each gene (8). The ratio of these densities is called the pausing index (5, 6, 8), and significant pausing indices range from 2 to 103 (fig. S4). Within the defined promoter region, 7057 genes have a significant enrichment of GRO-seq reads relative to the body of the gene (P < 0.01), representing 28.3% of all genes (41.7% of active genes). Comparison of paused genes to either microarray expression or GRO-seq data revealed four classes of genes: class I, not paused and active; class II, paused and active; class III, paused and not active; and class IV, inactive (not paused and not active) (Fig. 3). Class III was severely depleted when we used GRO-seq to classify gene activity because GRO-seq provides a more sensitive measure of gene activity. Given the low signal at the promoters of the few genes left within this class, they are likely to be classified as active with deeper sequencing. Therefore, the overwhelming majority of genes with a paused polymerase also produce significant transcription throughout the gene, albeit often to quantities not detectable by expression microarrays. A recent comparison of Pol II ChIP-seq data to RNA-seq also supports the view that nearly all genes that are bound by Pol II produce full-length transcripts (10).
The density of polymerases within the promoter-proximal region generally correlates with the level of gene activity when all genes (Fig. 4A) or only genes with a paused polymerase are considered (fig. S5). Whereas nearly all paused genes show significant full-length activity by GRO-seq, the pausing index inversely correlates with gene activity (Fig. 4B). Considering that pausing is observed when Pol II enters a pause site faster than the rate of escape from pausing (9), this inverse correlation is consistent with the hypothesis that highly transcribed, but paused genes appear to be controlled, at least in part, by increasing the rate at which Pol II escapes the pause site and enters productive elongation (8).
A prominent and unexpected feature of the GRO-seq profiles around TSSs is the robust signal from an upstream, divergent, engaged polymerase. RNAs generated by these divergent polymerases can be identified at low concentrations when small RNAs are isolated from whole cells (14). These divergent polymerases cannot be accounted for by the 10% of known bidirectional promoters that are less than 1 kb apart (15) (fig. S6). We found that 13,633 genes (55% of all genes, 77% of active genes) display significant divergent transcription within 1 kb upstream of sense-oriented promoter-proximal peaks (P < 0.001), indicating that the number of bidirectional promoters exceeds even the highest estimates (16, 17). However, because it appears that the majority of these promoters produce mRNAs in only one direction (see below), we refer to this class of promoters as divergent. Although the top 10% of active genes have, on average, a slightly larger promoter-proximal than divergent peak (Fig. 3D), amounts of divergent transcription generally correlate with both the promoter-proximal signal (fig. S7) and the transcription level of the associated gene (Fig. 4C). Thus, divergent transcription is a mark for most active promoters.
Gene activity, pausing, and divergent transcription correlate with each other and with promoters containing a CpG island. These four characteristics co-occur significantly more often than would be expected by chance (P < 10−52) (table S1). Previous mapping of capped mRNA transcripts has shown that at CpG island promoters initiation occurs broadly over hundreds of base pairs (18), and GRO-seq shows that polymerases initiate and accumulate on this large class of promoters in both orientations.
Does existing ChIP-chip data (3) show any indication of the divergent peak of polymerase? Manual inspection of a number of genes and comparison with composite profiles aligned to TSSs show that the Pol II ChIP peak at promoters is accounted for by the two divergent peaks uncovered by GRO-seq (Figs. 1B and and4E).4E). Higher-resolution ChIP-seq data in different cell lines has identified Pol II molecules upstream of promoters that were proposed to be in the same orientation of the annotated gene; however, these instead are likely to represent the divergent promoters identified by GRO-seq (10). Additionally, active promoters are typically marked by histone modifications such as di- and trimethylation of H3-Lys4 (H3K4me2 and H3K4me3) as well as acetylation of histone H3 and H4 (H3ac and H4ac). These modifications show a bimodal distribution around TSSs, with the trough representing a nucleosome-free region encompassing the TSS (3, 4, 19). Comparison of available H3ac and H3K4me2 data in this cell line (3) with GRO-seq suggests that both upstream and downstream peaks of these histone modifications are associated with active transcription, with each peak of histone modifications being adjacent and downstream of an engaged polymerase (Fig. 4F) (8). Other studies have shown that histone modifications associated with transcription elongation (e.g., H3K36me3 and H3K79me3) do not associate in a bimodal fashion around TSSs (4, 19). This and the lack of divergent GRO-seq reads further upstream (fig. S8) indicate that the majority of promoters experience initiation in the upstream direction but that these divergent polymerases do not productively elongate transcripts. Thus, promoters can distinguish polymerase in the forward versus the reverse direction.
We envision several possible functions for divergent transcription. First, the act of transcription itself could be crucial for granting access of transcription factors to control elements that reside upstream of core promoters, possibly by creating a barrier that prevents nucleosomes from obstructing transcription factor binding sites (20, 21). Second, as proposed by Seila et al. (14), negative supercoiling produced in the wake of transcribing polymerases could facilitate initiation in these regions. Third, these short nascent RNAs could themselves be functional, through either Argonaute-dependent (22) or -independent (23) pathways. Upcoming challenges will be to decipher whether the widespread transcriptional activity that lies upstream but divergent from the direction of coding genes positively or negatively regulates transcription output and how promoter or unknown DNA elements are designed to distinguish between productive elongation in one direction versus the other.
We gratefully thank C. Haudenschild for advice on construction of our libraries and for performing the initial alignments, Q. Sun and L. Ponnala for aligning the trimmed reads, A. Siepel for computational and statistical discussion, and the members of the Lis lab for suggestions regarding this work. The work was funded by NIH grant GM25232 to J.T.L. The data discussed in this publication have been deposited in National Center for Biotechnology Information's Gene Expression Omnibus under accession number GSE13518. The authors are filing a patent based on the work in this paper.