CTCF is a multi-functional protein that has been implicated in transcriptional regulation, insulation, DNA replication, X-chromosome inactivation, splicing chromatin packaging and many others [18
]. CTCF binding sites are widespread in genomes from fly to humans [1
]. Earlier, several genome-wide studies identified ~14,000 to ~27,000 CTCF binding sites in several human cell lines. Those studies also showed that 40-60% of the CTCF sites in the cell lines studied were invariant to cell types [17
]. Many CTCF binding sites were also computationally identified [50
] and found to be conserved [17
]. However, it remained unclear how many CTCF binding sites are present in the human genome and what proportion of them is constitutively bound across most cell lines/tissues. A comprehensive CTCF binding site database containing more than 15 million sequences in 10 species has been recently updated to include long-range chromatin interaction data mediated by CTCF [52
], thereby facilitating analyses like ours in non-human species.
Our analysis of 112 ENCODE CTCF ChIP-seq datasets representing 56 human cell lines suggests that there might be as many as 450,000 CTCF binding sites in the human genome. Nearly half were found in CTCF peaks in only one of the 56 cell lines. About a quarter million of the CTCF sites were found in CTCF peaks in more than one of the 56 cell lines. Moreover, ~24,000 CTCF binding sites were found in CTCF peaks in more than 90% (at least 51 of 56) cell lines, suggesting that those constitutive CTCF sites may be implicated in some fundamental biological process/function for most or all cell lines.
Of course, the exact numbers of cCTCF sites identified by our methods depend on thresholds used for making decisions. In our analysis, we trimmed/extended all peaks to 200
bp in length from the center. Using 300
bp instead increased by 1,640 the number of CTCF sites declared constitutive. Including these additional sites in our analysis of ChIA-PET interactions yielded results substantially the same as those in Table . In our analysis, a p
-value cut-off of 0.0005 on the PWM score identified a CTCF binding site in 80-95% of the peaks. Adjusting the cut-off would certainly affect the number of CTCF sites identified and declared constitutive; but, like changing the peak length, changing this cut-off seems unlikely to influence our results about enrichment and our overall conclusions about the role of cCTCF sites.
Because many datasets used in our analysis were from cancer cell lines which often carry genetic and chromatin aberrations, we looked for evidence that cCTCF sites might diverge between cancer and normal cell lines. We identified 27,735, 28,662, and 27,774 cCTCF sites in recently deposited CTCF ChIP-seq from 23 cancer cell lines, 20 normal cell lines, and 19 cell lines with unknown karyotypes, respectively [40
]. Not only did these three groups have similar numbers of cCTCF sites, they had 19,279 (80.5% to 83.2%) cCTCF sites in common, indicating that cell origins have little effect on the number or locations of cCTCF sites.
The nature of ChIP-seq experiments is to capture a snapshot of protein binding in time. Thus, the sites that we define as constitutive because they are bound in over 90% of cell lines are likely sites where a protein spends most time in the bound state -- perhaps an individual binding event of long duration or perhaps frequent bouts of binding/unbinding with the bound state predominating. Long-duration binding might be attributed to strong binding whereas frequent binding/unbinding would not be. Thus, the constitutive sites that we detect should not correspond exactly to sites with strong binding, though different binding motifs (canonical vs. full-spectrum) might be correlated with binding strength. On the other hand, one can imagine sites in the genome where a protein bound relatively briefly but the site is bound at some time in every tissue or cell line. Such a site would theoretically meet our definition of ‘constitutive’ but would go undetected by our analysis as ChIP-seq snapshots would be virtually impossible to capture short-term binding at the same site in multiple cell lines.
Strong binding may occur at constitutive sites, but it may not be the only explanation for their existence. We recently developed an alternative method for identifying constitutive sites using peak data only (without motif search) (manuscript in preparation). We identified constitutive sites for 22 factors with ChIP-seq data in more than six cell lines. We found that the proportions of constitutive sites vary between different factors from a few to many thousands. It is unlikely that factors that bind to the highest number of constitutive sites (e.g., CTCF and Rad21) are strong binders whereas those that bind to the fewest constitutive sites (e.g., JunD) are weaker binders. We also found that gene ontology analysis of the target genes of the constitutive Pol II sites are highly enriched with biological processes such as metabolism and cell cycle (data not shown). Together, those results strongly suggest that the constitutive sites are biologically meaningful.
Because of CTCF’s diverse roles in genome regulation, different “classes” of CTCF binding sites might exist to carry out different functional roles. Such classes might differ in their co-factors and/or binding strength and specificity (e.g., canonical vs. full-spectrum motifs). In this study, we focused on the class of CTCF binding sites that are constitutively bound and co-localized with the constitutive cohesin loci and compared it to a class of constitutive CTCF binding sites without cohesin. We examined the genomic features, transcriptional landscape and epigenetic environments of those sites to gain insights into their functional relevance. Our analysis not only included many more datasets but also was more comprehensive than the earlier analyses of CTCF binding sites [16
We identified ~12,000 constitutive CTCF binding sites co-localized with constitutive cohesin loci. The majority of these cCTCF/cCohesin sites were located ≥ 5 kb from the TSS in introns or in intergenic regions that lacked CpG islands. Furthermore, the cCTCF/cCohesin loci were enriched in H3k4me1 mark with well-positioned nucleosomes (Additional file 1
). A substantial number of the cCTCF sites overlapped with cohesin in one or more cell lines without meeting the criterion that the corresponding Rad21 and Smc3 peaks were in ≥ 90% of available cell lines. In contrast, few cCTCF sites did not co-localize with cohesin loci in any cell line.
Our analysis of the constitutive sites is limited by the number of cell lines studied; some factors have data from only a limited number of cell lines. As data from additional cell lines become available, some of the cCTCF/cCohesin sites will no longer be designated as constitutive. Although the cCTCF sites were found in at least 51 of the 56 cell lines, constitutive cohesin was defined via Rad21 and Smc3 peaks, which were identified in only 6 and 4 cell lines, respectively.
Numerous studies have shown that CTCF cooperates with cohesin to contribute to DNA loop formation to thereby regulate gene expression and chromatin interactions [18
], DNA replication [14
], RNA pol II pausing [11
]. Our computational analysis revealed that the strength of association between CTCF and cohesin increases when both sites/loci were constitutive, similarly for CTCF and Znf143 (Additional file 1
and Additional file 2
: Table S2), and for CTCF, cohesin, and Znf143 (Additional file 1
and Additional file 2
: Table S3).
A footprinting study of CTCF binding to the promoter of the APP gene showed that the binding of the full-length CTCF protein generated a DNase I protected region covering 40
]. Subsequent motif analysis [33
] in a set of evolutionarily conserved CTCF sites identified ~5,000 33/34-mer full-spectrum CTCF binding sites. We independently identified the same 33/34-mer motifs in the set of cCTCF/cCohesin loci. Furthermore, we also identified two potentially novel 20/26-mer CTCF motifs (Figure ). Whether those full-spectrum motifs function in transcriptional regulation or in mediating chromatin-chromatin interactions, or both, remains unclear.
Our analysis in cancer cell lines K562 and MCF7 further revealed that the majority of the cCTCF sites were located in the CTCF-mediated chromatin interactions from ChIA-PET [24
]. The proportion of the cCTCF sites in the chromatin interactions was higher for those cCTCF sites that overlapped with cCohesin loci than for those that did not. These results suggest that the genomic loci that are constitutively
co-bound by both CTCF and cohesin may be involved in establishing or maintaining the “common” or “ground state” chromatin architecture in most human cell lines (Figure ). This idea is consistent with the finding that the overall topological domain structure between cell/tissue types or across species is largely unchanged [23
]. Hu et al. further suggested that the geometric shapes of the topological domains are strongly correlated with several genomic and epigenetic features [55
]. We found that most CTCF-mediated interactions from ChIA-PET [24
] involved cCTCF and were within a domain. It is conceivable that the cCTCF/cCohesin sites are an integral part of the large, discrete domains [23
], possibly mediating/maintaining the sub-domain structures within a domain.
Figure 8 A proposed model of role of cCTCF loci in chromatin structure. CTCF, cohesin (not shown) and possibly other factors such as Znf143 and mediator  (not shown) mediate long-range chromatin interactions through the constitutive CTCF sites. The cCTCF-mediated (more ...)