Organizational Complexity of RNAPII-associated Chromatin Interactions
We analyzed 5 different human cell lines (MCF7, K562, HeLa, HCT116 and NB4) using ChIA-PET with a RNAPII antibody (8WG16) that recognizes the initiation form of the protein. The cell lines originated from a wide range of lineages, and provided a broad representation of human cells. In our pilot analysis, about 20 million uniquely mapped paired-end reads were generated for each of the ChIA-PET experiments (Table S1
), which resulted in two genome-wide datasets: the ChIP-enriched RNAPII binding sites and the RNAPII-bound long-range chromatin interactions. Both intra-chromosomal and inter-chromosomal interaction data were obtained, and the vast majority of chromatin interactions identified by ChIA-PET were intra-chromosomal (Table S2
). Twenty five intra-chromosomal and seven inter-chromosomal interactions were validated either by 3C, DNA-FISH or both (Figure S1
and inset of ).
Characterization of RNAPII binding peaks and chromatin interactions
To present an inclusive view of the RNAPII-associated human chromatin interactome, we combined the ChIA-PET sequence reads from the 6 pilot experiments into one dataset for analysis (Table S1
). Using embedded nucleotide barcode controls and statistical analyses, we assessed the data quality, filtered out the technical noise, and identified high-confidence binding sites and interacting PET clusters (Experimental Procedures). From the combined pilot dataset, we identified 14,604 high-confidence (FDR<0.05) RNAPII binding sites as well as 19,856 high-confidence intra-chromosomal interaction PET clusters (Table S3
). The majority (83%) of RNAPII binding sites in the combined dataset were proximal to 5′ Transcription Start Sites (TSS) of genes (). There were also distinct but relatively weaker enrichments of peaks at the 3′ Transcription End Sites (TES) of genes. Similar patterns were seen in all the individual experiments. Of the total RNAPII binding sites, 9487 (65%) were involved in chromatin interactions and these sites showed higher RNAPII occupancy than those not involved in interactions (), indicating that most highly-enriched RNAPII binding sites are involved in looped chromatin conformations.
Three basic types of interactions were identified around gene promoters in the combined pilot dataset: intra-genic (promoter to gene internal regions, 938, 5%), extra-genic (promoter to distal regulatory elements such as enhancer, 6530, 33%), and inter-genic (promoter-promoter of different genes, 8282, 42%). There was also a subcategory composed of intermediate enhancer-enhancer interactions (4106, 20%). Some interactions (2341, 12%) were standalone duplex interactions between two interacting anchor regions, whereas most (17515, 88%) were further aggregated into 1544 interaction complexes.
We speculated that the isolated RNAPII binding at promoter sites, which are not involved in interactions, may reflect the basal promoter function for gene transcription, and thus were termed “basal promoters”. By contrast, RNAPII-associated interactions might constitute a structural basis for complex regulatory mechanisms. These basic interactions further aggregated into complex architectures which we classified as “single-gene” or “multi-gene” complexes depending on the number of genes involved (). The single-gene models consisted of single or multiple enhancer interactions with only one gene promoter, whereas the multi-gene models included inter-genic promoter-promoter interactions and could also include intra-genic and extra-genic enhancer-promoter interactions. Moreover, several such complexes, distantly separated on a chromosome or on different chromosomes, further converged to form higher-order multi-gene interaction complexes (Figures S1B, D, F-G
). Many chromatin complexes had genomic spans of 150Kb-200Kb, and a few complexes spanned several megabases. Although there were only 1328 multi-gene complexes in this combined pilot dataset, 11723 genes were engaged in these complexes for an average of 8.8 genes per interaction complex (), indicating that promoter-promoter interactions were widespread and may play a significant role in transcription regulation.
To understand how these looping structures influence transcription, we characterized these RNAPII-associated chromatin models (basal promoters, single-gene and multi-gene complexes) for structural features (genomic property), functional output (transcription activity), and epigenomic marks (chromatin state).
Distinct Genomic Properties of Single- and Multi-gene Interaction Models
To determine the genomic characteristics of RNAPII-associated chromatin structures, we mapped several genomic descriptors that were known to associate with the expressivity of the human genome (Versteeg et al., 2003
), including GC content, gene density, SINE/LINE density, gene length, and the intron/exon ratio. In our analyses (, Figure S2A
), the multi-gene complexes were significantly enriched with higher GC content, higher gene and SINE density, and lower LINE density as compared to the single-gene interaction complexes and the regions of basal promoters, suggesting that multi-gene complexes were located in open chromatin and highly transcribed regions. In addition, genes in the multi-gene complex regions were relatively shorter than other gene categories, which is yet another property of highly expressed genes (Eisenberg and Levanon, 2003
). Conversely, genomic loci associated with the single-gene complexes lay in the regions with lower gene and SINE density. Moreover, the genes engaged in the single-gene complexes were significantly longer and had higher intron/exon ratios than the genes of other chromatin models (). These observations suggest that genes with enhancer-promoter interactions in single-gene complexes were more likely to be tissue-specific or developmentally regulated, in line with the previous findings that genes in gene-poor regions associated with several distant regulatory elements, tended to be longer and had a higher non-coding to coding ratio than housekeeping genes (Eisenberg and Levanon, 2003
; Taylor, 2005
Genomic properties of promoter-centered chromatin models
Interacting Genes Show Correlated Expression
To investigate the functional output of genes involved in the different chromatin models, as defined by transcriptional activity, we focused our analyses on MCF7 cells, as it is a well-characterized human cancer cell model with complementary datasets including RNA-Seq (Experimental Procedures), time-course microarray gene expression (Fullwood et al., 2009
), and GRO-Seq datasets (Hah et al., 2011
Consistent with the combined pilot dataset, 90% binding sites in MCF7 cells were found proximal to known gene promoters and 97% genes with RNAPII present at their promoters had detectable transcriptional activity by RNA-Seq (). The interactive RNAPII binding sites that were distal to gene promoters included intra- and extra-genic regulatory elements such as enhancers. Approximately 45% of the extra-genic distal regulatory sites had detectable RNA signals that could represent possible non-coding RNA (ncRNA) transcripts.
Transcriptional activities in RNAPII-associated chromatin models in MCF7 cells
For genes associated with the three chromatin models, we analyzed the transcription levels measured by RNA-Seq reads. As shown in , in general, RNAPII binding at promoter sites correlated well with the expression level of the corresponding genes. Interestingly, the genes involved in the single-gene and the multi-gene models showed higher correlation between RNAPII binding and RNA-Seq signal (Pearson's Correlation Coefficient – PCC: 0.46 and 0.45 respectively) as compared to basal promoter genes (PCC: 0.24). Moreover, we observed that genes linked by complex chromatin interactions, especially those in multi-gene complexes, had significantly higher expression levels than basal promoter genes (). This high expression appeared to be limited to genes interacting at the RNAPII anchor sites, as compared to genes located in the intervening chromatin loops. These data indicated that promoter-promoter interactions in multi-gene complexes were associated with higher transcriptional activity, which is consistent with our observations of their associated genomic features.
Next, we characterized the expression patterns of genes present in the interacting regions using microarray data derived from 84 human tissues (Su et al., 2002
). We found distinct representation of tissue-specific and housekeeping genes in the three chromatin models (, Figures S3A-B
). Most genes in single-gene complexes with enhancer-promoter connectivity were tissue-specific, consistent with growing evidence that the expression levels of developmental and tissue-specific genes are largely modulated through cis
-remote regulatory elements and trans
-protein factors (Hou et al., 2010
; Schoenfelder et al., 2010
), and consistent with their genomic features (less gene density, longer gene body and higher intron/exon ratio) as previously described. Conversely, genes involved in multi-gene complexes as well as the basal promoter genes were characterized as both tissue-specific and housekeeping categories. These observations were also supported by normalized CpG content and GC-skew at their promoter regions (Figures S3C-D
As promoter-promoter interactions cluster multiple genes, they could provide an ideal topological framework for potential transcriptional coordination of both tissue-specific and housekeeping genes. This observation agrees with the evidence that “ridges”, which are domains of highly transcribed genes, contain both housekeeping and tissue-specific genes (Versteeg et al., 2003
). Since large numbers of genes are found in multi-gene complexes, we propose that promoter-promoter interactions could serve as a dominant mechanism for transcription regulation of both housekeeping and tissue-specific genes in mammalian genomes.
Next, we sought to determine whether genes with promoter-promoter interactions were more likely to be transcriptionally coordinated. RNA-Seq data showed that most of the paired genes with promoter-promoter interactions were expressed together at high levels (; Figure S3E
). To further assess the coordinated transcription of paired genes across different conditions, we performed Pearson's correlation analysis using estrogen-induced time course of GRO-Seq data (Hah et al., 2011
) that measured transcription initiation rates of estrogen responsive genes, and observed significant transcriptional correlation (; p
-value < 2.2E-16). Interestingly, the correlation was even greater for ERα-mediated gene pairs derived from our earlier data (Fullwood et al., 2009
), suggesting stronger correlation of transcription for genes involved in multi-gene complexes mediated by specific transcription factors. Similar correlation was also observed from other gene expression datasets (Figure S3F-I
). As expected, housekeeping genes and genes belonging to the same GO classes showed even higher correlation than the rest (Figure S3J-K
). Altogether, our analyses indicated that a significant proportion of gene pairs involved in promoter-promoter interactions tended to be transcribed cooperatively.
Multi-gene Complexes Provide Structural Framework for Co-transcription
Correlated expression of interacting genes suggests that the multi-gene interaction complex might provide a molecular basis for the postulated “transcription factory” (Cook, 1999
). To elucidate the link between the multi-gene complexes revealed by ChIA-PET and transcription factories, we performed 3D DNA-FISH experiments using probes representing distinct multi-gene complexes in combination with RNAPII-IF staining in MCF7 nuclei (Experimental Procedures). All experiments on four genomic loci randomly chosen from multi-gene complexes revealed a significant association of the multi-gene complex loci with RNAPII foci (), adding further evidence to support our view that multi-gene complexes could provide a structural framework for co-transcription.
Transcriptional coordination in multi-gene chromatin complexes
Furthermore, gene families were significantly over-represented (p
-value < 0.006) in the multi-gene complexes (Figure S3L
), such as HIST, ZNF, KRT, HOXC
, etc (Table S4
). Taking the HIST1H
family as an example, the 58 genes of this family located on chromosome 6 formed three multi-gene complexes, and these three complexes converged into a higher-order super-complex, suggesting that all HIST1H
genes were organized in a single chromatin architecture for coordinated transcription (). All HIST1H
genes were actively transcribed in both MCF7 and K562 cells, and were highly co-regulated across different tissues and cellular conditions (). Interestingly, HFE
, a gene was not a part of the HIST1H
family but was located in the middle of the first HIST1H
multi-gene complex, was not anchored at the interaction sites and was not expressed. Similarly, the genes located in the intervening loop regions between the three HIST1H
interacting complexes were relatively less active and much less coordinated for co-regulation across different tissues and cellular conditions. This case exemplifies the model where multi-gene complexes organize genes with similar functions across genomic space for coordinated expression.
Multi-gene Complexes Support Synergistic Transcription Regulation
To further investigate the likelihood that the multi-gene complex structure might provide a topological framework for transcriptional co-regulation of interacting genes involved in such topology, we designed a set of perturbation experiments to test this. After comparing the RNAPII and ERα ChIA-PET data from MCF7 cells, we found that the RNAPII-bound multi-gene complex at the GREB1
locus partially overlaps with the ERα-bound chromatin loops, suggesting that this interaction complex, in part, is associated with ERα. Therefore, we performed siRNA experiments to knockdown the protein level of ERα in MCF7 cells, and monitored the alteration of chromatin interactions and gene transcription in the GREB1
multi-gene complex. Several chromatin interaction loops at this locus were disrupted by siERα transfection as tested by 3C experiments (). In addition to GREB1,
which had a strong response to estrogen induction and reduction by siERα knockdown (Figure S4A-D
), we observed that the other genes in this complex such as E2F6, KCNF1
also had various levels of response to induction by estrogen and reduction by siERα knockdown (). Interestingly, these genes did not directly interact with ERα at their promoter regions, but indirectly associated with ERα through RNAPII-bound chromatin loops. As a control, this effect was not seen in the nearby genes such as NOL10
that were in other RNAPII interaction complexes and also did not interact with ERα (). Similar results were observed at another interaction locus centered on the GPR68
genes (Figure S4E
). Thus, these results indicate that a specific stimulus (estrogen) could lead to co-activation of genes organized primarily through RNAPII-bound multi-gene complexes, and perturbation at one gene locus (loss of ERα binding in this case) in a multi-gene complex could alter the transcriptional states of other interacting genes within the same complex. Although genes in close genomic distances with each other had been reported to be correlated in expression levels (Singer et al., 2005
), our data suggests that the conjoint expression can be mediated through chromatin interactions. The functional significance of such co-regulation needs further investigation.
Epigenomic Marks Associated with Chromatin Interaction Sites
To study the association of transcription factors (TFs) with the RNAPII interactions, we examined the enrichment of 20 different TFs in K562 cells at the RNAPII interaction sites from the three chromatin models in our K562 ChIA-PET dataset (, Figure S5A-D
). General TFs such as E2F4 and E2F6 (, Figure S5A
) directly bound at TSS sites ( for a specific example). By contrast, specific TFs such as JunD and Max preferentially bound to distal regulatory sites and marked potential enhancers (Figure S5B
). Several chromatin remodeling factors and chromatin organization proteins such as INI1, BRG1, CTCF and RAD21 associated primarily with non-TSS sites, suggesting that they may mediate long-range interactions with enhancer regions (, Figure S5C
). This hypothesis is consistent with other observations that INI1 and BRG1, two subunits of the SWI/SNF complex, were involved in transcriptional looping (Euskirchen et al., 2011
). A common observation among all the factors was that interaction sites in the multi-gene complexes consistently showed elevated levels of factor enrichment, suggesting that the co-operative binding of factors in gene-rich domains leads to higher transcriptional activity, or these transcriptionally active open chromatin domains might converge to distinct specialized transcription factories, each enriched with general and specific TFs.
Epigenomic profiles of chromatin interactions and combinatorial regulation of interacting promoters
We further explored the chromatin modification data available from the ENCODE Consortium. Collectively, we found high enrichment of active histone modification marks coupled with a lack of repressive marks in RNAPII interaction sites, confirming that the RNAPII interaction sites mapped by our ChIA-PET data were located in promoter and distal regulatory regions engaged and/or poised for high transcription levels (). Interestingly, the enrichment of active marks was highest in the multi-gene complexes, indicating that these might constitute transcriptional hubs. Our observations matched previous findings that the enrichment of active histone modifications positively correlated with RNAPII occupancy (Barski et al., 2007
We observed similar histone modification profiles in MCF7 cells () using data that we generated previously (Joseph et al., 2010
). In particular, we applied the log ratio of H3K4me3/H3K4me1 signal as a quantitative measurement of the likelihood that a genomic locus can act as a promoter or enhancer. Most non-interacting RNAPII sites proximal to TSS in basal promoter model showed high log ratios (, plot 1; median=2.4; >90% of the binding regions have log ratios >0), whereas most of the RNAPII interaction sites distal to TSS in the single-gene complex model and the multi-gene complex model (conventional enhancer sites) showed low H3K4me3/me1 log ratios (, plot 4 and 6; median < -0.72), confirming that this log ratio could reflect relative capacities of promoters and enhancers. Surprisingly, examination of RNAPII interaction sites proximal to known TSSs in the multi-gene complexes ( plot 5) revealed two peaks in the histogram of the log ratios, suggesting a mixture of enhancer and promoter elements in the promoter regions. Detailed profiles of H3K4me3 and H3K4me1 marks around the center (±5Kb) of those RNAPII interaction sites showed distinct characteristics of promoter-like, enhancer-like sub-groups (, heatmap). Moreover, enhancer-like RNAPII interaction sites, on average, showed lower transcriptional activity than the promoter-like RNAPII sites (Figure S5J
). Thus, a large portion of interacting promoters may also have potential enhancer functions. We observed the same inverse correlation of H3K4me3/me1 log ratio at the TSS proximal and TSS distal RNAPII sites for K562 (), indicating that this observation is a general phenomenon applicable to all cell types.
Interacting Promoters Possess Combinatorial Regulatory Functions
To examine potential enhancer activity of promoters, we performed luciferase reporter gene assays, a commonly used method for promoter and enhancer characterization (Pan et al., 2008
). In these assays, approximately 500bp fragments of the expected promoter regions were cloned upstream of a luciferase reporter gene construct either in a proximal position as the driving promoter or in a distal position as a presumed enhancer, and the constructs were transfected into MCF7 cells (Experimental Procedures, Figure S5E-I
). As shown in , the two interacting loci INTS1
were 26Kb apart, and our RNA-Seq data suggested that both genes were active in MCF7 cells. However, the normalized log ratio of H3K4me3/me1 was 0.36 for the INTS1
promoter and 1.13 for the MAFK
promoter, suggesting that the INTS1
promoter may have enhancer properties. To test this, we cloned the INTS1
promoter fragment in both orientations upstream of the MAFK
promoter flanking the luciferase gene. The luciferase reporter gene assay showed at least 7-fold enhancement of luciferase expression from the MAFK
promoter activity by the INTS1
promoter fragment, indicating that a bona fide
promoter can act as an enhancer to augment the activity of other promoters.
In another example (), the promoter of CALM1 interacts with an enhancer element 15Kb upstream and connects to the promoter of C14orf102 further upstream in 65Kb. Both RNA-Seq data and the H3K4me3/me1 log ratio indicated that the CALM1 promoter was strong, whereas the C14orf102 promoter was weak and enhancer-like. The luciferase reporter gene assay showed marginal enhancement to the CALM1 promoter reporter gene activity by the native CALM1 enhancer and the C14orf102 promoter individually. However, the combined CALM1 enhancer and the C14orf102 promoter together led to a significant ~3-fold enhancement of reporter expression from the CALM1 promoter. This result further validates the enhancer function by interacting promoters and elucidates a possibility of combinatorial effect among interacting elements in multi-gene interaction complexes for transcription regulation.
Next, we asked whether promoters with enhancer activity act specifically on their target genes. We swapped the promoter elements in the two examples of INTS1
for additional reporter genes assays (). Intriguingly, when placed upstream to the CALM1
promoter, the INTS1
promoter showed remarkable enhancement of CALM1
promoter activity. Similarly, the combined construct of C14orf102
promoter and CALM1
enhancer also increased MAFK
promoter activity significantly. Meanwhile, a TATA box deleted promoter and other control promoters (either active or inactive), taken from the nearby genes that are not involved in a promoter-promoter relationship, did not show cooperative enhancement to MAFK
promoter activities (Figure S5H-I
). Thus, these results suggest a common property for promoters with enhancer capacity that could influence other promoters.
In addition, we also tested the combination of inserting the enhancer-like promoter fragment in the position proximal to luciferase gene and the strong promoter in the distal position in the reporter gene construct. Of the 20 such luciferase experiments, we observed that the weaker promoters conveyed significant enhancer function to their stronger interacting partners in luciferase activity rather than the reverse (Figure S5K
). In the case of interacting pair INTS1
(enhancer-like promoter) and MAFK
(strong promoter), the strong promoter MAFK
did not demonstrate significant enhancer activity (Figure S5L
). Thus, at promoter sites, there is an inverse relationship between enhancer and promoter functions.
Cell-line Specificity of Long-range Chromatin Interactions
To elucidate the cell-line specificity of chromatin interactions, we saturated the coverage of chromatin interactions through deep sequencing of more MCF7 and K562 ChIA-PET replicates (Experimental Procedures). The saturated libraries are highly reproducible for interactions, and thus highly reliable for inter-cell line comparative analysis. These libraries exhibit the same pattern of genomic descriptors as the pilot libraries (Figures S2B-C
). With comprehensive ChIA-PET and RNA-Seq datasets, we performed comparative analysis between the two cell lines and identified cell-line specific genes and chromatin interactions (). Most of the genes specifically expressed in their respective cells also showed cell-specific interactions (), implying that cell-specific chromatin interactions provide the structural basis for cell-specific transcription. Gene Ontology (GO) analysis revealed significant enrichment of erythroid related GO terms such as response to stimulus and blood circulation for genes with specific expression and chromatin interactions in K562 cells, whereas GO terms such as ectoderm development and related biological process were enriched in MCF7 cells (, Figure S6A
). As expected, the genes common in both cell lines showed enrichment of housekeeping functions like metabolism, cell-cycle and signal transduction (Figure S6B
Cell-specific chromatin interactions
Among the chromatin interactions specific to K562 cells, we captured many previously characterized interactions including the α- and β-globin loci (Bau et al., 2011
; Hou et al., 2010
). shows extensive interactions identified by ChIA-PET data between the α-globin gene locus and the DNase hyper-sensitive (DHS) sites present in the gene body of the C16orf35
gene. Additionally, we found that the α-globin locus in K562 extended its interactions to the neighboring domains, which were constitutively active in both K562 and MCF7 cells, whereas the interactions to α-globin genes are K562-specific, suggesting a complex chromatin architecture for spatiotemporal regulation of both constitutive and cell-specific transcription. Similarly, the β-globin gene locus also displayed previously known K562-specific interactions with the nearby locus control region (Figure S6C
is a well characterized MCF7-specific gene. As expected, we found abundant chromatin interactions associated with RNAPII at this locus in MCF7, but not in K562 cells (). In addition to recapitulating the previously identified ERα-associated interactions (Fullwood et al., 2009
), RNAPII interaction data showed an additional interaction site on the far most upstream (left in ) side of this complex. A strong H3K4me1 mark on this site suggested that this is potentially an enhancer site for a transcription factor other than ERα. Intriguingly, a significant RNA-Seq peak was also identified at this site, indicating a possible enhancer RNA transcript, a new class of non-coding RNA species (Kim et al., 2010
Long-range Enhancer-promoter Interactions and Disease-associated Non-coding Elements
Our data showed that the enhancer-promoter interactions were significantly enriched over other types of interactions for cell-specific genes () when compared to genes commonly expressed in both cell lines. This finding supported the general view that distant-acting enhancers tend to be specifically involved in tissue-specific genes, and was consistent with our analysis in . Although potential enhancer sites can be identified using high throughput approaches (Heintzman et al., 2009
), it is still challenging to connect enhancers to their target genes that are hundreds of kilobases away. Moreover, many remote enhancers could be embedded in intronic regions of other distantly located genes (Visel et al., 2009
), making it notoriously difficult to relate enhancers to their specific target genes. In this study, we identified tens of thousands enhancer-promoter interactions including approximately 1000 ultra-long-distance (500Kb to megabases) events (Table S6
). We observed that ≥40% of enhancers do not interact with their nearest promoters and instead jump over to their target promoters, bypassing several intervening genes (; and Figure S7
Long-range enhancers and disease-associated non-coding elements
An interesting example is the SHH
gene that was expressed in MCF7 but not in K562 cells (). SHH
is important in development and related to certain cancers (Lettice et al., 2002
). Transcription of SHH
is controlled by its enhancer which is located 1Mb away and embedded in the intronic region of LMBR1
; point mutation in this enhancer site is known to cause preaxial polydactyly
, a common congenital limb malformation in mammals (Lettice et al., 2002
). We found abundant interaction data between the SHH
promoter and the previously characterized SHH
enhancer site in the LMBR1
intronic region in MCF7 cells, but no interaction data in K562 cells (), which correlated well with their SHH
transcription status. This is consistent with earlier observations (Amano et al., 2009
In another interesting example, we identified two major interaction sites located ~600Kb and ~1Mb downstream from the IRS1
gene promoter. IRS1 is known to participate in type-2 diabetes (T2D) mellitus, and is found specifically expressed in MCF7 cells (). A recent GWAS study uncovered a cluster of SNPs that is genetically associated with high risk to insulin resistance, T2D, and coronary artery heart disease (Kilpelainen et al., 2011
). This high risk locus is found located in one of the IRS1
enhancer sites (). Thus, our data provides experimental evidence to suggest that this disease-risk locus could be physically connected with the IRS1
promoter, potentially serving as a critical long-range enhancer to regulate the expression of IRS1
, in a similar manner as the SHH
locus. Other examples of long-range and cell-specific enhancer-promoter interactions in MCF7 and K562 are shown in Figure S7
Taken together, these results suggest that ChIA-PET interaction data may better inform the association of a SNP with a gene involved in a disease process by providing evidence for direct physical interactions.