ChIP-Seq peaks for different transcription factors cluster along the genome
Binding sites for transcription factors tend to cluster in regulatory modules [25
], and recent studies in Drosophila
] and mouse [10
] have also shown that peaks from a set of transcription factors identified by ChIP-Seq and related methodologies tend to cluster along the genome. To investigate whether human transcription factor ChIP-Seq peaks also display clustering properties, we identified overlaps between ChIP-Seq peaks for 39 factors in K562 and for 28 factors in Gm12878 (see Methods). These overlaps were compared with overlaps obtained by randomly shuffling peak positions within each chromosome. For the 39 factors in K562, a total of 151,624 individual ChIP-Seq peaks were identified. When identifying overlaps, these peaks condensed into 71,311 nonoverlapping regions, where 30,934 regions contained more than one peak. The same procedure for the randomly shuffled peaks produced 132,615 regions, where only 15,373 contained more than one peak. We defined each region containing more than one peak as a transcription factor cluster, and this definition was used throughout the rest of this study. The relative number of real transcription factor clusters compared to random clusters strongly increased with the number of transcription factors mapping to the cluster (Figures and ). We also observed an increase in the average length of the random clusters compared to transcription factor clusters as more peaks associated with the cluster (Figure ). The latter indicates that peaks within each cluster locate to more similar genomic positions than would be expected by chance. The results were similar for the 28 factors analysed in Gm12878 (Figures and ), where 126,238 peaks produced 61,209 nonoverlapping regions, and 23,491 regions contained more than one peak. The corresponding numbers for the random shuffle were 114,157 and 10,148. On the basis of these results, we conclude that human ChIP-Seq peaks from different transcription factors have a strong tendency to form clusters along the genome. We also calculated the overlap between transcription factor clusters in both cell types and found 8,320 overlapping clusters (one-third to one-fourth of all clusters). A comprehensive comparison between cell types will be the topic of a future study (T Håndstad, M B Rye, F Drabløs and P Sætrom, unpublished data). Herein we focus on the cluster patterns within each cell type individually but compare general trends between the cell lines.
Figure 1 High-throughput sequencing coupled with chromatin immunoprecipitation (ChIP-Seq) peaks from multiple transcription factors form clusters within the genome. Number of ChIP-Seq peak clusters compared to the number of clusters produced after a random shuffle (more ...)
Histone H3 lysine 4 methylation is the chromatin mark best associated with transcription factor binding
Presumably, transcription factors require open chromatin for DNA binding, but it is unclear whether open chromatin by itself is a good predictor of transcription factor binding. To address this question, we investigated the overlap between transcription factor clusters and experimentally defined open chromatin regions (OCRs) (see Methods) or regions containing different histone marks associated with open or closed chromatin. Our first observation was the abundance of OCRs compared to regions showing mono-, di- or trimethylation of H3K4, histone modificationstypically associated with accessible chromatin [6
]. Although the number of regions enriched with H3K4me comprised less than half the number of experimentally defined OCRs (Figure ), 54 of 67 transcription factor data sets had overlap with H3K4me similar to or better than that with OCRs (Additional file 1
, Figure S1). Exceptions were the pair CTCF/Rad21, which mapped better to OCRs than to H3K4me, and a few other factors, such as NRSF and SETDB1, which mapped to neither H3K4me nor OCRs. When CTCF/Rad21 was excluded, 90% and 92% of the clusters, respectively, mapped to H3K4me compared to 92% and 93% that mapped to OCRs for K562 and Gm12878, respectively (Figures and ), despite the much smaller number of H3K4me regions. Only a few clusters (4% in K562 and 3% in Gm12878) mapped to regions without H3K4me, OCRs or H3K27ac and/or H3K9ac. We therefore conclude that H3K4me and OCRs mark somewhat different types of accessible chromatin most visible in the overlap with CTCF/Rad21 elements, but that H3K4me is a more specific signature than OCRs for the binding of most other transcription factors.
Figure 2 Methylation of lysine 4 on histone H3 (H3K4me) is a more specific marker of transcription factor clusters than is open chromatin (OCR). (A) Relative numbers of unique and shared regions for OCR and H3K4me. To compensate for the smaller region lengths (more ...)
Different transcription factors showed a preference for different general chromatin signatures (Figure and Additional file 1
, Figure S1). Factors enriched at annotated promoters (for example, YY1) generally mapped to H3K4me3, whereas other factors (for example, GATA2 and NFE2) preferentially mapped to H3K4me1, which may indicate their association with enhancers. Except for a few factors not mapping specifically to any chromatin mark, all factors analysed in this study preferentially mapped to known marks for accessible chromatin. This shows that regions enriched with transcription factors can generally be used as an alternative to histone modifications and other markers (such as DNase hypersensitivity) to identify genomic regions involved in gene regulation. We also found that data for histone modifications associated with accessible chromatin were highly redundant. Regions marked by acetylation (H3K9ac and H3K27ac) were almost totally contained within H3K4me (97% and 98% of combined H3K27ac and H3K9ac overlapped with H3K4me in K562 and Gm12878, respectively), and regions marked by H3K4 mono-, di- and tri-methylation were also highly overlapping (Additional file 1
, Figure S2). In addition, H3K27ac and H3K9ac regions where highly overlapping in K562 (83% and 89% overlap, respectively), whereas in Gm12878, H3K9ac regions (92% overlap) covered a subset of H3K27ac regions (69% overlap). Though H3K27ac and H3K9ac correlated well with other marks for accessible chromatin, these regions showed a weaker overlap with clusters compared to H3K4me and OCRs (Figures and ). All of these observations taken together show that H3K4me is a good marker for accessible chromatin and seems to be the preferred marker for transcription factor binding.
Figure 3 Examples of transcription factors mapping to different chromatin domains. YY1 has a preference for annotated promoters compared to the other factors. GATA2 maps to both methylated lysine 4 on histone H3 (H3K4me3) and H3K4me1, but not to annotated promoters, (more ...)
Few factors were associated with repressive modifications. An important exception is NRSF (also known as REST), which, in contrast to the other modifications we analysed, mainly associated with H3K27me3 (Additional file 1
, Figure S1). NRSF is a transcriptional repressor that acts as a scaffold for recruiting several chromatin-modifying complexes involved in dimethylation of H3K9 (H3K9me2) and demethylation of H3K4 [28
]. By binding long noncoding RNAs, NRSF can also colocalise with polycomb-repressive complex 2 (PRC2) [29
], which may explain the observed association between NRSF and H3K27me3. Similarly, H3K27me3 was partly associated with CTCF/Rad21, which is consistent with CTCF's interacting with PRC2 [30
Transcription factor clusters map to promoters of transcribed genes and upstream and/or downstream of promoters of both transcribed and silent genes
The binding of transcription factors is closely related to gene transcription. To test whether transcription factor clusters associate with transcribed genes, we investigated the binding pattern of clusters within the set of annotated nonredundant genes (that is, gene clusters; see Methods) and their promoters (that is, promoter clusters) and correlated this with RNA-Seq gene transcription data (see Methods). On the basis of RNA-Seq, we defined 6,228 genes as being highly transcribed in both cell lines and identified 5,591 and 4,124 genes with zero transcription in K562 and Gm12878, respectively. In accordance with previously reported results [7
], H3K4me was enriched around TSSs for the set of transcribed genes, H3K36me3 and Pol II were enriched in their transcribed regions and H3K27me3 was enriched at silent genes (Figure and Additional file 1
, Figure S3).
Figure 4 Histone modifications and RNA polymerase II (Pol II) have distinct profiles around transcription start sites (TSSs) of transcribed and silent genes. The plots are derived from the K562 cell line. Methylated lysine 4 on histone H3 (H3K4me3) is enriched (more ...)
In transcribed genes, clusters were highly abundant around TSSs, as expected (Figure ). However, we also observed clusters mapping outside the typical promoter region 2 kb upstream and 200 bp downstream from TSSs. The overall number of clusters mapping to the immediate promoter region of transcribed genes (promoter clusters) was comparable to the number of clusters mapping to other parts of the genes (gene clusters), in particular for K562 (Figure ). For silent genes, the number of promoter clusters was greatly reduced compared to the number of gene clusters. Thus gene clusters locate to both transcribed and silent genes, whereas promoter clusters mostly locate to transcribed genes. For both transcribed and silent genes, the distribution of clusters showed no clear distance dependence beyond 2 kb from TSSs. This likely reflects promoter clusters' direct regulation of local RNA polymerase recruitment or assembly, whereas gene clusters are involved in long-range regulation.
Figure 5 Transcription factor clusters are strongly enriched around transcription start sites (TSSs) in actively transcribed genes. (A) Mapping of transcription factor clusters around TSSs in annotated genes and promoters for transcribed and silent genes in K562 (more ...)
Composition of gene clusters correlates with H3K4me and differs from promoter clusters of transcribed genes
To further investigate differences in function between promoter and gene clusters, we asked whether the groups differ in their use of transcription factors. Furthermore, when correlating the locations of transcription factor clusters to histone modifications, we observed that the gene clusters could be further separated into two groups, depending on their association with H3K4me (Figure ). We therefore used correlation coefficients (r, Pearson's correlation) to measure the difference in transcription factor composition between promoter clusters, gene clusters overlapping with H3K4me and gene clusters not overlapping with H3K4me (Table ). Specifically, we calculated correlation coefficients by comparing profiles of relative enrichment of transcription factors within each group of clusters (see Methods; see Figure for two examples of profile comparisons).
Figure 6 The genomic region around transcription start sites (TSSs) can be divided into subregions. This illustration shows promoter cluster locations (dark grey), gene clusters associating with methylated lysine 4 on histone H3 (H3K4me3) (medium grey) and gene (more ...)
Correlation coefficients for composition differences in transcription factor clusters mapping to annotated genes and promoters
Figure 7 The composition of transcription factor clusters is correlated between cluster types. The correlations shown are based on the enrichment of each factor in each cluster type. (A) Promoter cluster (black) and gene clusters (white) in transcribed genes (0.46, (more ...)
Consistent with the different regulatory functions of promoter and gene clusters in transcribed genes, these two cluster types showed the largest difference in composition (r 0.46 and 0.04 for K562 and Gm12878, respectively) (see Figure ). We also observed a large compositional difference between gene clusters associated with and not associated with H3K4me (r 0.40 and 0.56, respectively) and between gene clusters associated with H3K4me and promoter clusters (r 0.52 and 0.04, respectively) in transcribed genes. Thus transcribed genes had three types of clusters with markedly different composition: promoter clusters close to TSSs, gene clusters more distal to TSSs associated with H3K4me, and gene clusters far into the gene body not associated with H3K4me.
When comparing clusters between transcribed and silent genes, we observed that the composition of gene clusters did not change with respect to transcription (r 0.99 and 0.99, respectively) and that this was true both for gene clusters associated with H3K4me (r 0.93 and 0.97, respectively) (Figure ) and for those not associated with H3K4me (r 0.99 and 0.88, respectively). In contrast, and consistent with promoter clusters' having active and local roles in transcription regulation, promoter clusters changed in composition between transcribed and silent genes (r 0.70/0.37, respectively). Thus there is a change in composition between transcribed and silent genes only for promoter clusters, whereas the composition in gene clusters remains similar. This indicates that long-range regulatory interactions are present in both transcribed and silent genes. Further supporting this conclusion, CTCF, which facilitates long-range interactions [3
], was among the most abundant transcription factors in all regions except for promoters of transcribed genes (Additional file 1
, Table S4).
For silent genes, the compositional differences between the three types of clusters showed some discrepancies between the two cell lines. Especially for K562, the composition of gene clusters associated with H3K4me mostly resembled the composition of promoter clusters (r 0.89), whereas the closest resemblance was observed with gene clusters not associated with H3K4me in Gm12878 (r 0.81) (see Discussion).
The enrichment of individual transcription factors in promoters and genes did not always reflect the composition of the different cluster groups. Most factors were present in several groups, but the relative enrichment of each factor in each group was sometimes very different (Figure and Additional file 1
, Table S4). For example, in transcribed genes, a higher number of promoter clusters contained the factor c-Jun (831) compared to gene clusters (583), but the percentage of clusters containing c-Jun was slightly higher in gene clusters (24%) than in promoter clusters (16%). Thus c-Jun may have a more important regulatory role in gene clusters, even though more ChIP-Seq peaks for this factor mapped to promoter clusters than to gene clusters. Generally, the 11,041 c-Jun peaks mapped better to TSS distal markers, such as H3K4me1 (Additional file 1
, Figure S1a) than to promoters, which indicates the involvement of c-Jun in long-range regulatory interactions.
We also noted some H3K4me and Pol II enrichment at promoters of silent genes. However, this enrichment was not transformed into transcriptional output, as evidenced by the zero RNA-Seq expression and low enrichment of H3K36me3 for these genes (Figure and Additional file 1
, Figure S3). The increased enrichment of these transcript-related features may explain the higher concentration of transcription factor clusters around TSSs for silent genes in K562 compared to Gm12878 (Figure ).
H3K4me3, H3K36me3 and Pol II identify clusters overlapping with transcripts
In addition to annotated promoters and genes, we expected a proportion of the transcription factor clusters to associate with transcripts and enhancers outside annotated regions. To separate clusters located to promoters or to the body of transcripts (transcript clusters, including possibly unannotated transcripts) from clusters not associated with promoters or transcript bodies (enhancer clusters), we used associations with histone modifications H3K36me3 and H3K4me3, together with enrichment of Pol II, as described by Mikkelsen et al
]. Because of the special regulatory function of CTCF/Rad21, these two factors were left out of the analysis at this stage. Clusters overlapping with either H3K36me3 or Pol II were classified as transcript clusters. In addition, clusters overlapping with H3K4me3 were classified as transcript clusters if the region of H3K4me3 enrichment overlapped with either H3K36me3 or Pol II. The last criterion separated isolated regions of H3K4me3 from regions of H3K4me3 that involved the other two transcription markers. We used independent data for Pol III to identify additional transcript clusters not transcribed by Pol II.
Three points must be mentioned with respect to the classification of transcript and enhancer clusters. First, we have used the term 'enhancer clusters' to describe clusters which do not contain the histone modifications and polymerase signatures characteristic of transcription and have indicated that these are more likely to be involved in long-range interactions. However, a recent study [33
] showed that a subset of enhancers involved in long-range interactions also produce short noncoding transcripts. Since such enhancers also recruit Pol II and show enrichment of H3K36me3 [34
], these regulatory elements are classified among the transcript clusters according to the definition given above. Second, a subset of enhancer clusters may represent elements that are not involved in direct gene regulation [35
]. Third, when we compared this data-driven classification with our previous annotation-based analysis, we observed some transcript clusters in promoters and gene bodies of silent genes, especially in K562. The set of silent genes showed a small enrichment of Pol II and H3K4me3 (but not H3K36me3) around their TSSs (Figure and Additional file 1
, Figure S3), but this enrichment did not result in detectable transcription. We still chose to classify clusters in these silent gene regions as transcript clusters, as these signatures were most likely a result of stalled transcription [37
] and not related to long-range interactions. So, though there may be different and possibly overlapping functions between the two classes of regulatory elements, we continue to use the notion of transcript clusters as mainly transcript-producing and enhancer clusters as mainly involved in long-range interactions throughout the rest of the text.
In K562 and Gm12878, we identified 17,836 and 12,424 transcript clusters, respectively, and 5,914 and 9,056 enhancer clusters, respectively. As expected, most of the transcript clusters and a minority of the enhancer clusters in K562 and Gm12878 mapped to annotated promoters and genes (12,775 and 9,458 transcript clusters, respectively, and 2,044 and 3,118 enhancer clusters, respectively) (Figure ). To validate these transcript clusters, we investigated their overlap with genes with respect to high, medium/low and zero gene expression as measured by RNA-Seq (Figures and ). Over 95% of the transcript clusters in K562 and 99% in Gm12878 mapped to genes with high or medium/low expression. The enhancer clusters were more evenly distributed among the three expression states, with a slight bias towards genes with medium/low or zero expression. Both results confirmed that the model based on H3K36me3, H3K4me3 and Pol II could separate enhancer clusters from transcript clusters associated with annotated genes and promoters.
Figure 8 Transcript and enhancer clusters mapping within and outside annotated genes and correlation with actual transcripts. (A) Number of transcript and enhancer clusters mapping within and outside annotated genes and promoters in both cell lines. (B) Number (more ...)
We primarily used Pol II-related transcription to separate transcript clusters from enhancer clusters. Transcripts can be produced by polymerases other than Pol II, and clusters associating with these polymerases should be identified by our model as long as they also associate with H3K36me3 and H3K4me3. However, we did not always observe this association when analysing independent data for Pol III. The overlap between Pol III and H3K36me3 was only 39% and 26%, compared to 72% and 71% for Pol II and H3K36me3, in K562 and Gm12878, respectively. We therefore included the independent Pol III data in our model. On the one hand, we do not know whether other polymerases behave similarly to Pol II or Pol III with respect to H3K36me3 and H3K4me3, so we cannot exclude the possibility that some clusters classified as enhancer clusters may be associated with transcription by other polymerases, such as Pol I. The effect of Pol I transcription may, on the other hand, be most pronounced in genomic repeat regions, which are often excluded when mapping ChIP-Seq data to the genome.
A large number of clusters mapped outside annotated genes and promoters (5,061 and 2,966 transcript clusters and 3,870 and 5,938 enhancer clusters in K562 and Gm12878, respectively) (Figure ). To validate transcript clusters mapping outside annotated genes and promoters, we used data from two independent studies that identified promoters for miRNA transcripts [38
] and long, intergenic, noncoding RNA (lincRNA) transcripts [39
]. The miRNA TSSs were cell type-independent, whereas the lincRNAs were mapped in another cell line. We found an increased overlap between these noncoding RNA transcripts and our transcript clusters compared to enhancer clusters and overlaps expected by chance (Table ). Though the independent data covered only a small part of the transcript clusters outside annotated genes and promoters, the good correspondence between transcript clusters and data from RNA-Seq, miRNA and lincRNA lead us to conclude that the transcript clusters did represent transcript-related regulatory elements. Conversely, we concluded that the enhancer clusters were enriched with long-range regulatory elements.
Overlap between independent miRNA and lincRNA transcripts and transcript and enhancer clusters outside annotated genes and promoters
H3K4me1 is a better marker for enhancer clusters than p300 or H3K27ac
Though attempts have been made to annotate enhancers [20
], the general impression is that enhancers remain poorly annotated in the human genome. The most likely transcription factor clusters associated with enhancers are clusters that lack the histone modification marks characteristic of transcription (H3K4me3, H3K36me3 and Pol II; see previous section). We therefore wondered whether these clusters, which we termed 'enhancer clusters', showed marks previously associated with enhancer regions. We focused our analysis on enhancer clusters mapping outside annotated genes and promoters. There were 3,870 clusters of this type in K562 and 5,938 in Gm12878. We then investigated the overlap of these clusters with respect to three recently used marks for enhancers: histone modifications H3K4me1 and H3K27ac and binding of the histone acetyltransferase p300 (Figure and Additional file 1
, Figure S5). Several studies have suggested that H3K4me1 is an enhancer marker [16
], and we also observed an enrichment of H3K4me1 relative to other histone modifications within our enhancer clusters. Our results thus confirm H3K4me1 as a good marker for enhancers in both cell lines.
Figure 9 Transcript and enhancer clusters show different overlap with histone modifications and open chromatin (OCR). Transcript clusters generally show a good overlap with all modification marks for accessible chromatin together with OCR, whereas enhancer clusters (more ...)
Another marker commonly used for enhancer identification is the transcription factor and histone acetyltransferase p300 [16
]. However, we did not observe a good correspondence between this factor and enhancer clusters in Gm12878. Only 8% of the enhancer-related clusters were covered by this factor (data for p300 in K562 were not available). In fact, we identified ten factors in Gm12878 with better coverage of the enhancer clusters than p300, the best ones being BATF (51%), IRF4 (45%) and PU1 (41%). The latter factors also showed a preference for H3K4me1 (Additional file 1
, Figure S1B), whereas p300 also mapped well to H3K4me3 and transcript clusters.
Although we saw poor correspondence between p300 and regulatory elements in Gm12878, p300 could be a better marker in other cell lines [40
]. Only approximately 1,500 p300 peaks were identified by the ChIP-Seq analysis, which is one reason for the low coverage in Gm12878. In addition, the identified peaks did not seem to map specifically to enhancers. We did not observe any transcription factors with specificity only towards the enhancer related clusters, and all factors with a high overlap with these clusters also overlapped well with transcript clusters (Additional file 1
, Figure S5).
H3K27ac has also recently been used as an identifier for enhancer regions [36
], but it showed a weaker overlap with enhancer clusters compared to H3K4me1 (Figure ). In addition, all enhancer clusters containing H3K27ac also contained H3K4me1. Still, H3K27ac has been shown to be a useful marker for separating active from weak enhancers [34
] and might provide valuable information on subclasses of enhancers and enhancer activity in addition to the H3K4me1 mark. The notion that H3K27ac marks specifically active regulatory elements is reinforced by the observation that H3K27ac is present at nearly all active genes (Figure ). We also observed that H3K9ac shows a mapping pattern similar to that of H3K27ac, especially in K562 (not shown). The percentage overlap of H3K27ac was also similar to that of H3K4me3 (Figure ); however, these two modifications mark somewhat different enhancer clusters. OCRs also showed good overlap with enhancer clusters, but an OCR in itself was a less specific marker for enhancer clusters because of the large number of OCRs not mapping to any transcription factors (see Discussion). Thus the overall conclusion is that, in our enhancer clusters bound by transcription factors, H3K4me1 was a superior individual marker compared to H3K27ac, p300 and OCRs.
Mapped transcription factors cover a high percentage of promoters in highly transcribed genes
There are 1,500 to 2,000 known transcription factors binding to DNA in humans [46
]. Because not all factors are expressed in each cell type and since factors generally bind to DNA in a combinatorial fashion, we investigated whether our selected clusters were sufficient to cover regulatory regions in a specific cell type. We investigated the ability to cover annotated promoters for genes with high, medium/low and zero expression in both cell lines (Figure ). The coverage estimate was based on an average of 100 calculations whereby the order in which the transcription factors were selected was randomly shuffled between each calculation. As more transcription factors are added, the percentage of regulatory elements covered by clusters increases. For promoters of highly expressed genes, the coverage reaches 91% for K562 and 55% for Gm12878 when all transcription factors are added. The coverage was lower in promoters of genes with medium/low and zero expression, which is consistent with these genes' having low or no transcription in the two cell types. To investigate whether the inclusion of singletons would increase the coverage further, we repeated the analysis, but now also including singleton peaks in addition to clusters (Additional file 1
, Figure S6). When singletons were included, the coverage increased to 96% for K562 and 80% for Gm12878. An interesting observation regarding K562 is that when the number of factors exceeds 30, the increase in coverage of highly transcribed genes by adding a new factor is only marginal. This trend was also observed when we investigated coverage of the identified transcript and enhancer clusters by the same procedure (Additional file 1
, Figure S6). Thus, despite the large number of existing human transcription factors, the results derived from the coverage analysis indicate that a relatively modest number of factors may be sufficient to map regulatory elements in a specific cell line.
Figure 10 Transcription factors mapped by high-throughput sequencing coupled with chromatin immunoprecipitation (ChIP-Seq) show good coverage of annotated promoters for transcribed genes. Graphs showing the percentage of annotated gene promoters covered by clusters (more ...)