|Home | About | Journals | Submit | Contact Us | Français|
The binding of transcription factors to specific DNA target sequences is the fundamental basis of gene regulatory networks. Chromatin immunoprecipitation combined with DNA tiling arrays or high-throughput sequencing—ChIP-chip and ChIP-seq—has produced many recent studies that detail the binding sites of various transcription factors. Surprisingly, data from a variety of model organisms and tissues have demonstrated that transcription factors vary greatly in their number of genomic binding sites, and that binding events can significantly exceed the number of known or possible direct gene targets. Thus, our current understanding of transcription factor function must expand to encompass what role, if any, binding might play outside of direct transcriptional target regulation. Here, we discuss the biological significance of genome-wide binding of transcription factors and present models that can account for this phenomenon.
The complex interactions between multiple transcription factors and gene targets across various tissues, cellular contexts, and time points are termed `transcriptional regulatory networks' (Box 1). It has been stated that a truly thorough understanding of such interactions should theoretically explain how an organism is `computed' from its DNA . The core model of gene regulation posits that transcription factors recruit a polymerase complex to the transcriptional start site . Transcription factors initiate this by binding at nearby or distant DNA sequences and directly interacting with components of the polymerase complex or with complexes that indirectly mediate the polymerase interaction. In eukaryotes, the latter may include chromatin remodelers or modifiers that facilitate access or increase protein-protein affinities via histone modifications [3,4]. The simplest view of the core model would suggest that factor binding directly correlates with transcriptional regulation. However, numerous examples of the separate regulation of factor binding and transcriptional activation suggest otherwise [5–7]. For example, recent studies indicate that the sequence of the DNA binding site can induce conformational changes in the bound transcription factor that permits transcriptional regulation by subsets of a transcription factor family that can bind to similar sites [8,9].
Defining the relationship between transcription factor binding and target regulation across the entire genome of various species has become an attainable goal with the recent explosion in advanced computing and information processing tools. These advances have resulted in some remarkable progress in reconstructing and predicting regulatory networks . The advent of ChIP-chip (chromatin immunoprecipitation coupled to microarray hybridization) and ChIP-seq (chromatin immunoprecipitation coupled to high-throughput sequencing) have now allowed for determination of the precise, genome-wide distribution of transcription factor binding sites. The results of numerous studies employing these techniques have been at times predictable and at other times surprising. While some studies have shown the expected correlation between factor binding and gene regulation, others have observed binding events that vastly exceed the number of expected gene targets (Table 1). Given these findings, it is timely to reconsider the relationship between transcription factors and gene regulation and the role, if any, that widespread transcription factor binding may play outside of direct gene target regulation.
Several genome-wide transcription factor binding studies in various model organisms have supported a relatively direct connection between factor binding and gene regulation. One of the first genome-wide assessments of transcription factor binding in yeast reported transcription factor binding in promoter regions, in spite of the presence of binding motifs in both coding and intergenic regions . Another report evaluating over 100 tagged factors in yeast identified more than 4,000 promoter-transcription factor interactions and described numerous regulatory circuits . The subset of circuits that comprised feed-forward networks (Figure 1a) alone was extensive, involving 39 factors, 49 distinct networks, and greater than 10% of all bound areas. This study emphasized both the importance of regulatory networks in controlling gene expression, as well as the ability of ChIP studies to uncover such networks.
A later study looking at an individual transcription factor in yeast, with roles in both filamentous growth and mating behavior, also found that DNA binding tightly correlated with function. Under cellular conditions that activated either growth or mating functions individually, Ste12 was found to occupy approximately 60 unique binding sites that were located in the promoters of genes with appropriate corresponding functions . This binding was noted to be dependent on another transcription factor for the process of filamentation, an example of the importance of cooperative factor binding (Figure 1b) in mediating transcription factor activity.
The forkhead box A homolog PHA-4 regulates organogenesis of the pharynx in Caenohabditis elegans, and provides an example of factor binding correlating closely with direct gene target effects in a multicellular organism. Initial studies demonstrated that expression of its targets correlated with PHA-4 binding sites in promoter regions, and that the timing of target expression correlated with binding affinity between transcription factor and its target sequence . Follow-up studies refined this model, providing evidence for other factors that cooperated with PHA-4 binding to modulate timing of target expression . Taken together, the data suggested that pharyngeal organ development is regulated by a combination of PHA-4 binding affinity and cooperating factors to temporally regulate gene expression. It also suggested that it should be possible to predict the time of expression of a putative PHA-4 target gene solely from analysis of its DNA sequence.
Recent ChIP-seq data for PHA-4 has been in agreement with this assessment. The great majority (>90%) of the bound sites identified in either embryos or larvae can be designated as `gene-associated' using a distance cutoff of 2 kb or less between a bound site and nearest gene . Overlapping the binding with gene expression data (high-throughput sequencing of RNA), most (87%) of the associated genes were expressed when PHA-4 binding was present, and a decrease in factor binding was associated with a reduction in expression for most (60%) presumptive targets, suggesting that binding of the factor activated the expression of those genes.
Studies in Drosophila melanogaster have identified the importance of cis-regulatory modules (CRMs), which are short DNA sequences (~300–500 nucleotides in length) that integrate multiple input signals to control gene expression. For example, the binding of Mef2, an important factor in mesodermal development, changes temporally during the course of muscle development . At the time points evaluated, different factor motifs were noted at Mef2 binding regions, suggesting a cooperative factor mechanism used to temporally regulate the expression of various Mef2 targets. Further complexity in regulation is also suggested by a study comparing the binding profiles of Mef2 and lameduck (Lmd) . Mutants of Mef2 and Lmd show a similar defect in myoblast fusion, suggesting similar or overlapping biological roles; however, while their DNA binding profiles overlap significantly, the effect of binding is widely variable. Depending on the enhancer target, co-binding can lead to additive, synergistic, or repressive effects, as demonstrated in reporter assays using eight different characterized enhancers. For example, co-expression of Lmd and Mef2 activates the blow enhancer while expression of Lmd counteracts the positive effect of Mef2 on the CG9416 enhancer. While these results reveal the potential complexity of regulatory networks, a relatively direct relationship can still be inferred between DNA binding and target gene effects.
The close relationship between DNA binding and gene target effect has also been observed in mammalian systems. In one of the first studies to use ChIP-seq, the binding of the zinc-finger protein neuron-restrictive silencer factor (NRSF) was mapped to only ~2000 sites in the human genome . It was found that a few hundred potential target genes showed relatively `low' gene expression compared to average cellular transcript expression when a NRSF peak was located nearby (≤1 kb), suggesting that NRSF was exerting its transcriptionally repressive effects at those genes when bound nearby. Studies of other factors, such as pregnane X receptor (PXR)  and calcium-response factor (CaRF) , have also demonstrated a direct correlation of factor binding with gene regulation in mammalian cells.
In contrast to the model of direct gene regulation, several studies have demonstrated transcription factor binding at a large number of sites, many of which cannot be clearly connected with target gene regulation. In Drosophila, several ChIP-chip studies using whole genome tiling arrays have been performed for developmental transcription factors [21,22]. These studies have identified a large number of binding regions, on the order of several thousands, for individual factors in the developing embryo, indicating a greater amount of DNA binding by developmental factors than had been anticipated. For example, over 2,000 binding regions were observed for Twist in the Drosophila genome in two separate studies utilizing distinct microarray designs [21,23], vastly exceeding the number of known Twist targets and including many intronic and intergenic sites. Also unexpectedly, Twist binding overlaps significantly with both Dorsal and Snail binding sites, and many of these sites possess highly conserved motifs. Their conservation suggests they are likely to be functional sites, but their significance is still unclear.
While widespread binding of early developmental transcription factors is perhaps not entirely surprising , the unexpected finding has been the identification of numerous binding sites of unclear function, including for other factors as well. Studies of the binding and gene regulation of Myc and other proteins of the dMax family in Drosophila and human cells have shown extensive binding across the genome, but that binding did not necessarily correlate with transcriptional regulation of the nearby target genes [25,26].
In an early ChIP-seq study examining the interferon-γ (IFN-γ) responsive transcription factor STAT1 in human cells, a strikingly large number of bound sites was observed . In unstimulated cells, over 10,000 binding sites were identified, and this increased more than four-fold after stimulation with IFN-γ. In both conditions, approximately 50% of the total sites were intragenic and 25% intergenic. While there was a strong overlap with sites of known STAT1 activity, the majority of binding sites were not located adjacent to STAT1 regulated genes, suggesting that many, or most, bound sites were not directly regulating a nearby gene target. The authors suggested that many of the STAT1 sites might correspond to weaker, less favored binding sites, or possibly functional sites with STAT1 bound in only a subset of the total cell population.
As another example of widespread binding, the hematopoietic factor GATA1 was reported to have over 15,000 DNA binding sites in a mouse erythroblast line . GATA1-factor binding is apparently necessary for the binding of another hematopoietic factor, the basic helix-loop-helix (bHLH) factor TAL1, to an adjacent E-box motif, the consensus binding site for bHLH factors. There is a strong association of TAL1 binding with erythroid gene regulation [29–31], with over 2000 genes, most of which (90%) were categorized as related to erythroid development, having TAL1 binding within putative regulatory elements in one study, and over half of TAL1-regulated genes containing TAL1 bound within a proximal or distal regulatory element in another study . In this case, the widespread binding of GATA1 might be identifying the sites that can be bound by TAL1, and possibly other factors at different times or in different cells, to execute cell-type specific programs of gene expression.
The myogenic bHLH factor MyoD is another transcription factor that offers potential insight into genome-wide binding. MyoD directly regulates genes expressed during skeletal muscle differentiation  and orchestrates a temporal pattern of gene expression through a feed-forward circuit . ChIP-seq on MyoD in skeletal muscle cells identified approximately 30,000–60,000 MyoD binding sites . As anticipated, genes regulated by MyoD during myogenesis had associated MyoD binding sites. However, almost 75% of all genes were associated with a MyoD binding site and about 25% of the MyoD sites were in intergenic regions. Therefore, the majority of MyoD binding events were not directly associated with gene regulation. Although regional transcription was not detected at these intergenic sites, MyoD binding was demonstrated to induce local chromatin modifications, specifically acetylation of histone H4 that is generally associated with active and/or accessible regions of the genome.
Together with the studies discussed above, these findings demonstrate that some transcription factors have binding events that are vastly in excess of the genes that they directly regulate. The remainder of this review will discuss the possible significance of these large number of transcription factor binding events that are not directly related to gene transcription. One proposed explanation for large-scale genome-wide transcription factor binding is the presence of `non-functional' binding sites that serve no biological purpose . Alternatively, it has been proposed that transcription factors may bind to many low affinity sites in the genome and contribute to gene expression at levels that are low but sufficient to allow evolutionary conservation, an idea proposed from a large scale ChIP-chip study in yeast . Presuming that these sites are functional, other possibilities include roles in affecting the functional concentration of factors, induction of chromatin looping, changing chromatin and nuclear structure, or the evolution of new transcriptional regulatory networks.
It has been suggested that binding sites occurring outside of areas directly involved in gene regulation may be `non-specific,' or random. However, these intergenic sites contain the factor-specific binding motifs and have been validated both experimentally and statistically, the latter by passing very strict statistical cutoffs [27,34]. Thus, it seems more appropriate to conclude that the observed genome-wide binding of some transcription factors is a biologically specific event; however, the biological role at many of the sites remains largely undetermined.
Based on the binding of the lac repressor to bacterial DNA, it was suggested that genome-wide binding at non-regulatory sites might function to maintain an optimum amount of available transcription factor in the nucleus . In this model, some of the transcription factor binding sites that are located in intergenic regions or repetitive elements might serve that function, helping to fine-tune gene expression by limiting the concentration of unbound factors and preventing binding to sites that need to be regulated by co-factor occupancy and cooperative binding. In this model, the genome-wide binding serves as a reservoir for factors, sequestering them in a manner analogous to other biological buffering systems.
Some studies provide support for this model. For example, in the Drosophila studies that show binding at thousands of sites in the genome in addition to binding at regulated genes [22,37], higher-affinity binding occurred at regulated genes, and lower-affinity binding occurred in regions not regulated by the factors. This is consistent with the model that accessible DNA serves as a low-affinity reservoir for transcription factors and that these sites are not directly regulating regional gene transcription.
Other studies provide additional support for the notion that transcription factors will bind to any available sites genome-wide. ChIP-seq of 15 transcription factors and regulators involved in mouse embryonic stem (ES) cell biology demonstrated binding for multiple factors at the same 3,583 sites in both promoter and intergenic regions . Similarly, in Drosophila several of the patterning factors exhibit notable overlap in their binding sites, although there is variability in the degree of overlap. And while analyses of binding site sequences demonstrate, in general, factor specificity for preferred DNA-binding motifs previously identified in vitro, many regions also exist which lack consensus binding motifs . Therefore, some genome-wide binding might reflect factor interaction with accessible DNA regions that have not been specifically selected for a role in regional gene transcription.
Although likely correct in many instances, this model does not explain why there is an order of magnitude, or more, difference in genome-wide binding for factors with equivalently complex binding motifs. As noted above, MyoD has ~30,000–60,000 binding sites whereas TAL1 is reported to have ~3,000–6,000 sites in erythroid cells [29–31,34]. Both are bHLH factors that dimerize with an E-protein and recognize the core CANNTG E-box motif. The substantial difference in their genome-wide binding, however, suggests that sequence complexity is not the only determinant of binding. One possibility is that some factors are more constrained by site accessibility than others. MyoD can initiate chromatin remodeling at inaccessible sites and can bind independently of other factors, whereas the related bHLH factor Myogenin is more constrained to bind to accessible sites [33,34,39,40] and the TAL1 bHLH factor might require GATA1 or other factors to bind . This suggests that the difference in the number of MyoD and TAL1 binding sites might, at least in part, reflect their relative ability to make new sites accessible for binding and to bind independently of other factors.
Another, non-exclusive, model is that intergenic binding sites regulate gene transcription at a distance. Chromatin looping provides a mechanism for transcriptional control by bringing regulatory elements into proximity with target genes. Chromosome conformation capture studies indicate that the interaction of the distant locus control region (LCR) with the beta globin gene is required for high-level transcription. Interestingly, this interaction is dependent on GATA1 acting as an anchor . Given that GATA1 binds to over 15,000 sites, it is plausible that some proportion of these may affect transcription by inducing chromatin loops. In agreement with this idea, the LCR is necessary for globin genes to associate with transcriptionally-engaged PolII sites , while other experiments demonstrated the association of hundreds of specific genomic loci with the murine globin genes in `transcription factories' . In another specific example of chromatin looping leading to gene regulation, a Wnt-responsive enhancer downstream of the Myc gene has been shown to loop to cooperate with a 5' enhancer in a beta-catenin/TCF dependent fashion to regulate Myc expression . These studies suggest that genome-wide binding might establish productive long-range interactions, either by looping to bring distant enhancers together with promoters, or in more complex interactions such as the co-regulation found in transcription factories.
As noted above, many of the MyoD binding events are not directly associated with regional gene transcription, but rather with regional histone modifications associated with active or accessible chromatin . Genome-wide changes in chromatin also occur in response to Myc binding . Therefore, a major biological role of these factors, and perhaps other genome-wide binding factors, might not be to directly regulate transcription, but rather to re-organize the chromatin to make regions generally more accessible for factors expressed later in development. Such a role is supported by several studies of genome-wide influence on chromatin structure of general regulatory factors in yeast [46–48].
Although it might seem unusual to suggest that some transcription factors have a role in regional chromatin organization at some sites and function as typical transcription factors at others, these represent two related functions of many transcription factors and it is reasonable to imagine that they can be deployed independently. For example, at genes transcriptionally regulated by MyoD, MyoD recruits histone acetyltransferases and chromatin remodeling complexes prior to mediating transcriptional initiation, which often occurs following the binding of an additional transcription factor [33,49,50]. Therefore, the initial steps of transcription factor-mediated chromatin modifications can be distinguished from subsequent steps of transcriptional activation.
The suggestion that some transcription factors might have a role in regional chromatin organization that is independent of regional transcription is reminiscent of CTCF, which was originally identified as a transcription factor and is now recognized to have a broad role in chromatin organization. CTCF has also been found to have tens of thousands of binding sites in human and mouse cells [38,51]. The greatest portion of CTCF sites was located in intergenic regions and many were at the border of distinct chromatin regions, consistent with a role in demarcating different chromatin domains [51,52]. Furthermore, CTCF binding sites were flanked by arrays of well-positioned nucleosomes enriched in specific histone types (H2A.Z) and specific histone modifications, suggesting additional roles in broad changes in chromatin composition and structure .
Related to the model that some transcription factors might influence chromatin on a global scale is the idea that some of these factors might contribute to other aspects of regional nuclear organization. Apart from its role in affecting chromatin structure, CTCF may also mediate long-range chromatin interactions [54,55]. Also, as previously noted, both MyoD and Myc mediate broad epigenetic reprogramming within the nucleus, and it is reasonable to speculate that this activity might alter nuclear architecture and be important for their biological function. The ability to study changes in nuclear organization has recently become more accessible through the development of techniques such as Hi-C , and it will be interesting to determine whether the major role of some transcription factors is to re-organize the architecture of the nucleus.
The relationship between the feed-forward network motif and the evolution of new transcriptional regulatory networks is another theoretical model for understanding a potential biological role for genome-wide binding. Feed-forward regulation is the dominant motif for regulating complex biological pathways, with the ability to temporally regulate the expression of its targets while retaining the ability to rapidly cease target expression [10,57,58]. Feed-forward circuits have been found to occur repeatedly in S. cerevisiae, and have arisen via convergent evolution, suggesting their widespread utility .
Genome-wide transcription factor binding and feed-forward mechanisms might have led to the evolution of distinct regulatory networks from a common network, a theory that can be understood using MyoD as an example. MyoD directly binds and regulates genes expressed throughout the program of skeletal myogenesis. At many targets, binding alone is not sufficient for transcriptional activation, but instead requires cooperation with factors that MyoD also regulates, thereby achieving temporal patterning through the feed-forward circuit. The evolution of a feed-forward circuit can be easily understood as the refinement of an initial single-input motif (Figure 2). For example, a primitive MyoD-like factor might have initially activated all the genes necessary for a primitive muscle cell phenotype, providing some selective advantage for this initial event. Subsequently, feed-forward regulation could be superimposed on the single-input motif to gradually improve and regulate the final output.
One prediction of this model is that factors with the potential to regulate complex transcriptional programs would bind throughout the genome because mutations in factors that sample a large portion of the genome would have the highest probability of generating a new network by changing the expression of large numbers of genes. Again using MyoD as an example, MyoD binds within a regulatory distance of more than one-half of all genes . Altering the activation potential of MyoD through a translocation or mutation could drastically alter genome-wide transcription and potentially generate a novel complex phenotype from a single genetic event. In this model, genome-wide binding of a subset of transcription factors might reflect an evolutionary advantage rather than a cell-type specific function.
Comparing the findings from genome-wide transcription factor binding studies supports two general types of transcription factor binding. In some studies, the transcription factors tend to bind in the neighborhood of genes that they regulate, whereas in others the factors bind throughout the genome and relatively equivalently at both regulated and apparently non-regulated genes. A major caveat in suggesting that these might represent different biological strategies is the problem inherent to comparing results from different studies. Differences in sample preparation, data acquisition, and data processing can result in dramatically different conclusions that do not directly reflect the biology of the factors studied. Having acknowledged this important caveat, some factors appear to have binding profiles that reflect their regulatory network. For these factors it should be possible to infer their function based on knowledge of their binding sites, and, ultimately, it might be possible to compute their regulatory networks directly from knowledge of the organism's DNA sequence. The binding profiles of other factors appear much too dispersed across the genome to accurately correlate binding with regional transcription. For these factors, it might be impossible to infer their regulatory networks from DNA sequence, or even from knowledge of where they are physically bound. It remains to be determined whether these genome-wide binding events have one or more biological functions that are distinct from regulating regional transcription. Although speculative, this raises the intriguing possibility that the majority of binding events of some transcription factors might not be the direct regulation of transcription, but rather a currently unrecognized role in genome-wide biology.
Transcription factors interact in a sequence-specific fashion with DNA to either increase or decrease transcription of gene targets. Transcription factors often bind and regulate multiple targets simultaneously, and targets, in turn, are frequently regulated by multiple factors. Regulatory networks can be constructed to describe these interactions, and represent the interactions that occur at multiple factor-target levels. Networks can be comprised of various motifs, which represent the regulatory approaches taken by one or more factors at specific targets. Multiple types of motifs have been described, but two common ones include the feed-forward loop and multi-input motif (Figure 1). Using these and other commonly found motifs (eg. auto-regulatory loops in which a gene product downregulates its own production), transcription factors are able to establish complex and dynamic mechanisms of gene regulation.
S.J.T. was supported by NIH NIAMS R01AR045113. K.L.M. was supported by a Developmental Biology Predoctoral Training Grant T32HD007183 from the National Institute of Child Health and Human Development. A.P.F was supported by a grant from the University of Washington Child Health Research Center, NIH U5K12HD043376-08.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.