Transcription factors (TFs) are proteins that bind sequence elements in DNA and thereby affect expression of neighboring or distal genes. Depending on cellular contexts, such as hormone stimulus or the cell’s differentiation state or cell type, a TF can bind to different subsets of the TF’s potential binding sites and regulate different gene expression programs
[
1]. Investigating this context-dependent binding of TFs and the causes of binding differences across different cellular contexts is therefore fundamental for understanding gene regulation in general, and also for understanding how differential binding by TFs contribute to disease development.
There are three main factors that determine a TF’s binding activity at a potential binding site. First, TFs bind to specific sequence motifs
[
2] that favor a local DNA structure recognized by the TF’s DNA-binding domain. Second, the local chromatin structure needs to be favorable for TF binding. Specifically, the chromatin must be sufficiently accessible to allow the TF to scan and bind to its sequence motif
[
3-
5]—a process that is influenced both by high level chromatin structure and local nucleosome positioning
[
5,
6]. Certain post-translational histone modifications are associated with open or closed chromatin and therefore also binding site activity, but certain TFs may also directly bind specific histone modifications
[
7-
9]. Similarly, DNA methylation also affects TF binding—both by directly affecting binding motifs and by being involved in altering local chromatin structure
[
10]. Third, TF co-activators can recruit and stabilize TF binding, whereas repressors can out-compete or hinder binding to a potential binding site
[
11].
The TF binding activities that result from a given cellular context form in sum a transcription regulatory network. There are many different methods of inferring the structure of such regulatory networks
in silico. Some of these methods rely on context-dependent data, such as experimentally determined gene expression, TF binding, or chromatin structure
[
12], and therefore produce networks specific to a given context. Examples include methods that rely on gene expression data only
[
13,
14], and methods that integrate expression data and binding location data
[
15-
18].
In comparison, many traditional methods for inferring regulatory networks are context-indifferent, typically relying on sequence motifs to map putative transcription factor binding sites (TFBS). Some of these methods use additional data such as a putative site’s conservation level in related species
[
19,
20] and motif clustering
[
21-
24] to increase the predictions’ signal to noise ratio
[
25]. However, newer methods increasingly take advantage of recently available experimental data such as genome-wide occurrences of histone modifications and nucleosome occupancy
[
26,
27] and our increased understanding of how these modifications affect the likelihood of TF binding
[
28]. Unlike previous methods that mainly rely on sequence motifs, adding experimental data typically makes the predictions specific to the given experimental context.
Chromatin immunoprecipitation followed by massively parallel DNA sequencing (ChIP-seq) is the current high-throughput experimental technique of choice for mapping the genome-wide state of chromatin, and this technique is also used for experimentally identifying TFBS
[
12]. ChIP-seq captures TF binding as it happens
in vivo, so using ChIP-seq data alone or as a basis in more integrative methods for modeling gene regulation will result in context-specific predictions
[
17,
18]. But how specific are these predictions to the given context?
The few studies that have investigated cell-type specificity of TFBS show that in general, binding differences increase with functional and evolutionary distance. A study investigating MyoD-binding in the highly related cell types myoblasts and myotubes found the majority of predicted binding sites to be common in both tissues
[
29]. Another study looking at E2F4 binding sites in seven primary mouse tissues and a mouse cell line found that between 65% and 85% of the cells’ binding events overlapped
[
30], whereas a study of serum response factor (SRF) binding across three distinct human cell lines found that less than half of the observed SRF binding sites were shared across all three cell lines
[
31]. Studies comparing TFBS across homologous species have shown that TFBS in general are even less conserved between different species than between different cells within the same organism
[
30,
32,
33]. Thus, whether regulatory interactions determined for one cellular context can be used to predict functional outcomes in a different context seems to depend on both the TF itself and the context of the comparison. However, the studies also suggest that some TFBS appear to be active consistently across different cellular contexts, and it is not clear what separates such apparently context-independent TFBS from context-dependent sites and whether the genomic context for such sites differs for different TFs.
To address this question, we used ChIP-seq data from two ENCODE cell lines
[
34] to examine cell-type specific binding sites for seven TFs with known DNA sequence preferences and six transcriptional cofactors with no known sequence preferences. Five of the six cofactors were Polymerase (Pol) III TFs
[
35], whereas the remaining factors were Pol II TFs. We first show that although both the number of sites and the site overlap differ substantially between TFs, stronger sites, as estimated by ChIP-seq peak height, are generally less cell-type specific than are weak sites. Second, we find that strong sites generally occur more frequently in regulatory regions such as promoters and TFBS clusters and in conserved sequences, compared to weak sites. Moreover, by analyzing cell-type specific chromatin data, we find that strong sites occur more frequently in open chromatin and at histone modifications associated with active promoters, compared to weak sites. Strong sites are also generally more conserved than are weak sites. Third, we show that differences in chromatin can be a reason for cell-type specific TFBS—both at strong and weak sites. We also show that some of the apparent cell-type specific TFBS can be due to differences in genotype that affect sequence motif regions. Finally, by training a machine learning classifier to distinguish common, context-independent sites from cell-type specific sites, we show that site strength and clustering are the most important parameters for identifying context-independent TFBS. Importantly, sites for sequence-specific TFs and sequence-independent cofactors and sites for Pol III and Pol II TFs shared these same characteristics. Thus, our results suggest that context-independent sites are strong, clustered sites in conserved genomic regions.