|Home | About | Journals | Submit | Contact Us | Français|
A crucial question in the field of gene regulation is whether the location at which a transcription factor binds influences its effectiveness or the mechanism by which it regulates transcription. Comprehensive transcription factor binding maps are needed to address these issues, and genome-wide mapping is now possible thanks to the technological advances of ChIP-chip and ChIP-Seq. This review discusses how recent genomic profiling of transcription factors gives insight into how binding specificity is achieved and what features of chromatin influence the ability of transcription factors to interact with the genome, and also suggests future experiments to further our understanding of the causes and consequences of transcription factor-genome interactions.
Understanding how genomic information is translated into gene regulation has been the subject of intense scientific investigation for the last several decades. Until recently, most studies focused on detailed characterization of a particular gene or gene family. These studies resulted in the development of general principles of gene regulation, but genome-scale studies are now prompting re-examination of some of these principles.
The established view of transcriptional regulation is that cis regulatory elements, such as promoters and enhancers, and proteins that bind to these elements control different levels of transcription of different genes1, 2. Promoters are composed of common sequence elements, such as a TATA box and initiator, and binding sites for other transcription factors, which work together to recruit the general transcriptional machinery to the transcriptional start site (TSS). Enhancers also contain binding sites for transcription factors but are located some distance from the site of transcription initiation. Transcriptional activity resulting from the general factors binding to the core promoter is usually quite low but can be increased by site-specific factors binding to proximal promoter regions, which can help to recruit or stabilize the interaction of the general factors at the core promoter. Promoter activity can be further stimulated by factors binding to distal enhancer regions and subsequent recruitment of a histone modifying enzyme that creates a more favorable chromatin environment for transcription, or a kinase that induces a bound initiation complex to begin elongation (Figure 1). Transcription can also be modulated by repressive factors that bind to upstream repressing sequences and/or silencers, which can interfere with activator binding (and thus prevent recruitment of the general transcriptional machinery) or recruit histone modifying complexes that create repressive chromatin structure.
Recent genome-scale studies have enabled more precise definition of thousands of promoters for known genes and identified many previously unrecognized transcription units, revealing that some previous assumptions about transcriptional regulation are not correct. For example, based on the detailed characterization of a small subset of promoters, a typical RNA polymerase II (RNAPII) promoter was thought to contain a TATA box located 30 bp upstream of the TSS. However, we now know that TATA-driven promoters are the exception and not the rule 3, 4. Other recent genomic studies suggest that ~50% of human genes have alternative promoters 5, indicating that regulatory sequences for a particular gene can be spread over a considerable distance. Clearly, access to large datasets documenting RNA expression and transcription factor binding on a genome-wide scale now provides an exciting opportunity for investigators to reevaluate previous models of transcriptional regulation. Of particular interest is the role of site-specific DNA binding factors, which is the focus of this review.
It has been estimated that there are 200-300 transcription factors, in humans, that can be considered components of the general transcriptional machinery that bind to core promoter elements (for example, subunits of RNA polymerases and complexes such as TFIID that are required for transcription of most protein-coding genes), and perhaps 1400 transcription factors that have sequence-specific DNA binding properties and thus regulate only a subset of genes by binding to site-specific cis elements 6-8. Interestingly, the site-specific factors tend to be either expressed in all or most tissues or instead are expressed in only one or two tissues, suggesting either a very broad or very specific function 7. Alterations in gene expression caused by the inappropriate level, structure, or function of a transcriptional regulator have been associated with a diverse set of human diseases, including cancers and developmental disorders 9. For example, 164 transcription factors have been shown to be directly responsible for 277 diseases 7. This is undoubtedly a large underestimate of the importance of transcription factors in human disease due to the fact that most human transcription factors are essentially uncharacterized 7. Because of the paucity of our knowledge concerning the function of transcription factors and the likelihood that increased knowledge of transcription factors will lead to increased insight into the causes of human diseases, it is of utmost importance to expand our understanding of how site-specific transcription factors contribute to gene regulation. Crucial questions that need to be addressed are: where do transcription factors bind in the genome; how is specificity of binding achieved; what features of the chromatin can influence the ability of transcription factors to stably interact with the genome; and how is binding of the factor related to its subsequent function in respect to regulation of a nearby gene?
Fortunately, recent advances in the techniques of chromatin immunoprecipitation followed by microarray (ChIP-chip) or by sequencing (ChIP-seq) (Box 1), and similar techniques such as DamID now allow investigators to create a global map of specific protein-DNA interactions in a given cell type in a single experiment10-18 19. Binding sites identified from these ChIP studies 20-28 are categorized relative to genomic features such as the nearest gene, frequency of binding relative to gene structure (for example a promoter, enhancer, exon, or intron), and the type of chromatin domain. The cost of ChIP-Seq depends partly on the depth of sequencing, but an estimate is that 10-12 million uniquely mapped reads should be sufficient for most human transcription factors, which can obtained in 1 or 2 lanes of sequencing, for a cost of one to two thousand dollars. As multiple DNA microarrays are needed to cover the entire human genome, comprehensive studies by ChIP-chip are more expensive. However, for certain applications (such as detailed analyses of a protein complex binding to a small segment of a genome), a focused ChIP-chip experiment currently remains more cost-effective than a genome-wide ChIP-seq analysis.
Briefly, chromatin immunoprecipitation (ChIP) (illustrated in the figure) involves crosslinking DNA-binding proteins to DNA by treatment of cells with formaldehyde and preparation of chromatin by sonication or enzymatic digestion. An immunoprecipitation of the crosslinked chromatin is performed using an antibody that recognizes a specific transcription factor or histone isoform, resulting in the collection of all the binding sites in the genome for the factor of interest. After purification of the precipitated fragments, the sample can be analyzed by PCR to study particular genes. However, genome-wide analysis can be performed by microarray (ChIP-chip) or sequencing (ChIP-Seq). For ChIP-chip, the immunoprecipitated sample and input DNA, as a control, are labeled with fluorescent dyes and hybridized to microarrays. Binding sites are identified by the intensity of signal of the ChIP sample in relation to the signal of the input sample at each probe on the microarray using various ChIP-chip peak-calling programs 21 22. For a single ChIP-chip experiment, most investigators use between 106 and 107 cells, however recent methodological improvements using amplification methods have enabled successful ChIP-chip experiments with as few as 104 cells 77-80. For ChIP-seq, the immunoprecipitated sample is used to create a library that is analyzed using high throughput next generation sequencers. Binding sites are identified using various ChIP-seq peak calling programs 16 27 81 26 82, all of which identify target sites based on the number of sequenced tags from the ChIP library corresponding to each position in the genome. For a ChIP-seq experiment designed to map binding of a site-specific factor, most investigators use 107 to 108 cells, although 104 to 105 cells is sufficient for the ChIP-seq analysis of certain histone modifications 83. It is important to note that because ChIP assays require such large cell numbers, the observed peaks in either ChIP-chip or ChIP-seq represent an average of binding of a factor at a particular site in the cell population. Thus, a small peak could represent very strong binding in only a subset of the cells (for example, cells at one stage of the cell cycle) or modest binding in the entire cell population. ChIP-seq experiments, which allow binding to be analyzed at all unique overlapping oligomers of a certain length (usually 27-50 nts are sequenced per fragment) in the genome, can provide very high resolution mapping of transcription factor binding sites. For example, three-fourths of all the ChIP-Seq peak positions for the DNA binding proteins CTCF, NRSF and STAT1 are within 18, 27 and 51 bp, respectively, of the nearest motif for that factor 82. In general, genome-scale ChIP-chip experiments are less precise in mapping the exact location of a binding site because the oligomers on the array are not overlapping but are spaced approximately 35-100 nt apart, due to the large number of arrays that would be required if overlapping oligomers were used.
This review summarizes recent discoveries provided by genome-wide profiling of site-specific transcription factors and how they have led to new insights regarding patterns of transcription factor binding, how binding specificity is achieved, and what features of the chromatin can influence the ability of transcription factors to interact stably with the genome. The focus will be on the human genome, although relevant insights from other organisms are also incorporated (in particular when studies using model organisms are more advanced than similar studies of the human genome) as it is likely that the implications of transcription factor recruitment for gene regulation will be similar across all eukaryotes. Importantly, genome-wide studies have not only provided new information, they have also created new challenges in our understanding of gene regulation, such as why certain transcription factors bind to so many places in the genome and why so much of the regulation appears to be via steps that occur after recruitment of the site-specific factor to the DNA. Therefore, this review concludes with suggestions for future experiments that are needed to further our understanding of the causes and consequences of specific transcription factor-genomic interactions.
Two decades ago, investigators were using in vitro assays or reporter constructs, to define cis elements necessary for basal transcriptional activity or regions that control cell type-specific, hormonal, or environmental transcriptional responses. In most cases, relatively small promoter segments (from 500 bp to perhaps 10 kb upstream of a TSS) were used as the starting point for mutational analyses. One common observation was that severe truncation of a fragment could cause large changes in promoter activity but that incremental deletion of the 5′ end of the fragment resulted in only minor changes in activity, suggesting that multiple transcription factor binding sites were scattered throughout the analyzed region (for example ref. 29). In contrast, other studies found that hormonal regulation or cell type-specific transcription from a promoter could not be reproduced using reporter assays (for example ref. 30). Such results raised two important questions that are now being addressed by genome-wide binding analyses: do different transcription factors bind in clusters near each other and are most of the binding sites for a given transcription factor located in proximal promoter regions?
Transcription factors have been categorized into those that bind proximal promoters and those that bind enhancers1, 2. However, in most past work, a single binding site, or in some cases a small set of sites, was studied for a particular factor. Such focused analyses do not allow general conclusions to be drawn as to whether a factor usually binds near or distal to a promoter region. Thus, accurate categorization of factors is not possible without genome-wide analysis of binding sites. Knowing the location, relative to the TSS, at which a factor binds is of interest as it can provide insight into the mechanisms by which it regulates transcription (Figure 1). For example, factors that bind close to TSSs have been proposed to regulate transcription by stabilizing general transcription factors at the core promoter elements; factors that bind to distal regions, either upstream or downstream of a gene, might regulate transcription by mediating protein-protein contacts between distal complexes and the general transcriptional machinery bound at start sites (that is, by a looping mechanism). Thus, comprehensive location analysis of a factor can not only allow the development of a genomic map but can also provide insight into the mechanisms by which it regulates transcription.
Initial large-scale analyses of transcription factor binding, by ChIP-chip, focused on the identification of binding sites near CpG islands or within 1-5 kb of the TSS of known genes 15, 31-34. Although these studies identified hundreds, and in some cases thousands, of promoters that were bound by a particular transcription factor they were limited to target sites in proximal promoter regions so it was not known whether the identified sites were representative of the majority of the genomic binding sites for a given factor. Analyses of 1% of the human genome as part of the ENCODE pilot project, which are being continued both by the ENCODE Consortium and others 3, 22-24, 35, 36, have now shown that transcription factors that bind almost exclusively at proximal promoters might be the exception, not the rule. Some factors, for example E2F transcription factor family members, are almost always bound in proximal promoter regions (Figure 2A). In fact, it is often difficult to distinguish E2F binding patterns from the binding patterns of general transcription factors such as RNAPII or the TATA box binding protein-associated factor TAF1 15, 22. However, other factors that have recently been analyzed by genome-wide ChIP-chip or ChIP-seq, such as GATA1 and ZNF263, bind to diverse regions of the genome (Figure 2B), including extragenic regions distant from the TSS and intragenic regions (including both introns and exons). Other examples of transcription factors that have wide-spread binding patterns include p53, p63, the estrogen receptor, FoxA2, and TCF4 13 10, 24, 36, 37.
Although it is difficult to make accurate comparisons of binding patterns generated by different research groups using different experimental platforms, genome-wide profiles for a large number of factors were compared in the ENCODE pilot project 3. This study found that less than 10% of the factors tested had greater than 50% of their binding sites within 2.5 kb of a transcription start site (see Figure 6 and Figure S31 of ref 28). Another study, which analyzed 13 site-specific factors in mouse ES cells using ChIP-seq, also found that many binding sites were located outside of proximal promoter regions 38. Clearly, a typical reporter or in vitro assay cannot monitor the contribution to promoter activity of sites distant from the proximal promoter. The new findings of the distribution of factors throughout the genome might explain many of the failed attempts in the past to demonstrate accurate regulation of a target gene using reporter assays or transgenic constructs. Also, the distributive pattern of binding seen for many factors has important implications for subsequent functional analyses. For example, it is not easy to link enhancers to specific promoters if the enhancer is between two genes, but at a great distance from both; this is discussed in more detail below.
Early studies of Drosophila melanogaster development identified regulatory regions that are bound by combinations of different transcription factors, leading to the concept that transcription factors can cluster near each other to regulate transcription cooperatively 39. For example, enhancers that regulate D. melanogaster segmentation contain a module that typically receives input from multiple transcription factors and has multiple binding sites for each of the factors; in many cases the binding sites are clustered within a small interval of 0.5-1 kb. Recently, large-scale profiling of the binding patterns of a set of D. melanogaster transcription factors revealed binding hotspots, each 1-5 kb in length and spaced ~50 kb apart 40. The D. melanogaster genome is one-tenth the size of the human genome and therefore it is not yet clear if the same sort of clustering will be commonly found for human transcription factors.
Owing to the large size of the human genome and the large number of transcription factors (~ 1400), most investigations of the concept of clustered binding sites creating a regulatory element have used computational tools 41. As detailed below, bioinformatic analyses are not sufficient to determine which of all possible binding sites are actually occupied by a transcription factor in vivo. However, there is some experimental evidence that at least a few binding hotspots do exist in the human genome. An extensively studied mammalian enhancer is the interferon beta enhanceosome 42, 43 in which 8 transcription factors bind to overlapping elements within a 55 bp region upstream of the interferon beta gene (IFNB1). This enhancer was characterized over many years using classical mutational analyses of a single regulatory element. Although very few regions of the human genome have been characterized in as much detail as the IFNB1 enhancer, several other enhancer regions have been fairly well-studied, including the mouse and chicken beta-globin locus control regions and the human growth hormone and MHCII enhancer regions 44.
Chen et al analyzed a set of factors that work together to mediate pluripotency and maintain self-renewal properties of mouse ES cells 38. They found that some regions (termed MTL for multiple transcription factor-binding loci) were bound by several factors. Specifically, clusters of Nanog, Oct4, and Sox2 sites were identified outside of promoter regions, suggesting that these regions might be enhancers, and a subset of MTL showed strong enhancer activity in follow-up experiments. Identification of these MTL may have been facilitated by the fact that Nanog, Oct4, and Sox2 were previously known to cooperate in regulating the mouse ES cell transcriptome.
Unfortunately, to date only a handful of human factors (very few of which have been implicated in regulating the same sets of genes) have been analyzed using ChIP-seq and these factors do not seem to show a large degree of overlap in binding at locations outside of promoter regions (Figure 2B). However, it is hard to know if the lack of observed clustering is because there are in fact no hotspots for binding in the human genome or because the correct combinations of factors have not yet been studied. Knowledge of the extent of clustered binding in mammalian genomes must await the collection of more ChIP-seq data. Genome-wide analyses of enhancers based on specific histone modification patterns have also recently been initiated 45, 46. However, identifying a potential enhancer region based on histone patterns does not reveal how many site-specific factors bind to the region. If clusters of binding sites are found in mammalian genomes, they could correspond to enhancesomes similar to the one at IFNB1, with multiple factors all working together to mediate transcriptional activation. Alternatively, they could represent nonfunctional “storage bins” for excess transcription factors, provide functional redundancy that decreases the chances that a gene might be turned off due to mutation, or allow activation of a gene by multiple different signaling cascades.
In vitro studies, such as CASTing (cyclic amplification and selection of targets), and sequence comparisons of small sets of promoters known to be bound by a factor have allowed the derivation of consensus binding motifs for some transcription factors 47. Subsequent bioinformatic analyses that search the human genome using consensus motifs or position weight matrices – a collection of motifs similar, but not identical, to the consensus motif - allow the identification of all locations in the genome to which a transcription factor might bind 41, 48. This approach provides the set of all possible locations for a given factor; however, in a mammalian genome there are clearly many more occurrences of a consensus motif for a given factor than there are binding sites 37, 49. Also, the utility of bioinformatic studies relies on the assumption that transcription factors are recruited to the genome in vivo via motifs similar to identified in vitro studies. These caveats have led to uncertainties as to the importance of consensus motifs for in vivo binding. ChIP-chip and ChIP-seq studies have allowed investigators to address two important questions concerning motif usage: what percentage of binding sites contain a consensus motif and what influences whether a specific motif is in fact bound by a particular factor?
Although some factors appear to be recruited to a majority of their binding sites via a common motif, other factors seem to have a more diverse set of recruitment mechanisms. For example, members of the E2F family appear to lack a requirement for a specific motif for binding in vivo 49. In contrast, the set of binding sites for factors such as p63, STAT1 and NRSF show high enrichment for a specific motif 16, 20, 37. It should be stressed that binding detected at sites that lack a consensus motif is not due to a general, low affinity DNA binding activity. ChIP-chip and ChIP-seq measure DNA-protein interactions as an average of individual binding events in millions of cells and a peak at a site without a motif can be as high and as sharp as a peak located over a consensus motif, which is inconsistent with random protein-DNA interaction.
Several mechanisms have been proposed to explain how specific recruitment can occur in the absence of a consensus motif (Figure 3). These include: binding at a distal site that contains a consensus motif and looping to the site in question via protein-protein interactions (perhaps via a co-activator or co-repressor); ‘piggyback’ binding mediated by protein-protein interactions with a second factor, with no contribution of the DNA binding domain of the first factor; or assisted binding to a site that is somewhat similar to the consensus site, enhanced by protein-protein interaction with another site-specific DNA binding factor or with a specifically modified histone. Clearly, the greater the contribution of protein-protein interactions to the genomic localization of a factor, the greater is the difficulty of using a strictly bioinformatic approach to identify in vivo binding sites.
Sorting binding sites for a factor into subsets that contain or lack a specific motif might eventually provide insight into alternative recruitment or regulatory mechanisms mediated by that factor; the ability of a factor to be recruited to the genome in more than one way might allow a factor to participate in multiple different signaling pathways. For example, serum response factor (SRF) is ubiquitously expressed but its activity is modulated at several levels, including protein-protein interaction 50 51. Perhaps recruitment of SRF via a consensus motif allows for regulation of one set of targets in many cell types, whereas stabilized binding via protein-protein interaction to sites lacking the consensus motif allows the constitutively expressed SRF to also have some cell type-specific functions. It should be noted that even factors that prefer to bind to regions containing a specific motif can also have subsets of binding sites that lack that motif 52, 53. A recent study has shown that the ability of a factor to bind to more than one motif is not necessarily due to protein-protein interactions but instead can be observed using purified proteins and in vitro assays. Using protein binding microarrays, Badis et al. 54 found that about half of a set of 104 mouse DNA binding proteins recognized multiple different sequence motifs. Such studies suggest that motif analysis of ChIP-seq data should be performed under the assumption that more than one motif can be present in the set of identified binding regions.
As discussed above, a major difficulty with using a bioinformatics motif-driven approach to identify binding sites is that it is clear that only a small percentage of all occurrences of a motif are actually bound by that factor. Therefore, the majority of regions in the genome that contain a consensus motif for a given factor are not occupied. Lack of binding to the genome in certain regions could be due to chromatin structure (inaccessibility due to close packing of the nucleosomes in heterochromatin) or to DNA methylation (reduced binding affinity due to methylation of a critical residue in the recognition motif). However, a comparison of unoccupied E2F consensus sites in a human breast cancer cell line to sites of repressive chromatin (that is, histone H3 trimethylated on lysine 9 or lysine 27 (H3K9me3 or H3K27me3)) and DNA methylation showed that neither repressive histone marks nor DNA methylation appeared to account for the lack of E2F binding 49. An alternative possibility is that specific histone modifications enhance transcription factor recruitment to certain genomic regions. For example, recent ChIP-chip and ChIP-seq studies have shown that histone H3 monomethylated at lysine 4 (H3K4me1) is localized at enhancer regions 45, 46. Of course, it is not known whether the histone modification or the binding of a factor comes first, but it is possible that certain factors might have an affinity for a specific histone modification. For example, PHD finger domains in several proteins, such as the TAF3 subunit of TFIID, BPTF, and ING2, can mediate a specific high affinity interaction with histone H3 trimethylated on lysine 4 55 56 57, 58, which is highly localized to promoter regions 3 46. PHD domains in site-specific factors or co-activators could help localize DNA binding factors to consensus motifs located in proximal promoters; other domains might mediate interaction of transcription factors or co-activators with H3K4me1, resulting in preferential occupancy of motifs located in enhancer regions (Figure 3D).
Although each of the models presented in Figure 3 are possible, it is generally not clear why some consensus motifs are occupied and others are not. Perhaps once we have binding maps for hundreds of factors, it will become obvious that binding of a factor to one motif commonly prevents another motif from being occupied by a different factor. For example, an ETS and an E2F binding site overlap in the MYC promoter and it is only after mutation of the E2F site that ETS1 can bind in vivo 59. Alternatively, as described above, we might find that stable binding is rarely mediated by a single DNA-protein interaction but requires cooperative binding between adjacent site-specific factors either through direct interaction between the two site-specific factors or indirect interaction through a platform such as a co-activator or co-repressor 60.
The discovery of thousands of binding sites by genome-wide profiling has raised two important questions: can a factor occupy a certain site in many cell types but regulate transcription via binding to that site in only one (or a few) cell types, and is functional redundancy a built-in safeguard for maintaining accurate regulation of the genome?
Several recent studies have attempted to assess the functional importance of each of the thousands of binding sites for a given factor by altering the level of that factor in the cell. A frequent finding is that changing the level of a factor alters expression of 1-10% of the potential target genes 12, 37, 61, 62. One interpretation of these results is that most binding is not functional. There are, however, several caveats to this conclusion. First, the assignment of a specific binding site to a target gene is not always accurate. Investigators use the most expedient approach, which is to assign the binding site to the nearest known gene, but this can lead to false binding site - target gene pairing due to long-range regulation, undiscovered genes, or alternative upstream promoters. Changes in expression of a gene that does not have a nearby binding site for the factor that is altered might initially be interpreted as indicative of indirect regulation, but might be owing to direct regulation by a site many thousands of kb away (Figure 4a). Second, altering expression of a human transcription factor is fraught with problems. Down regulation of a transcription factor in human cells is usually accomplished using small interfering RNAs (siRNAs or shRNAs). However, loss of expression is rarely complete; it is possible that a reduction of 90% of the protein might not have functional consequences if there is a 10-fold excess of the factor under normal conditions. Many studies are performed in cancer cell lines that can have, as shown by western blot, a massive increase in the amount of particular transcription factor compared to a normal cell. Thus, what appears to be an efficient knockdown in a cancer cell line might leave sufficient levels of the factor for normal regulation (Figure 4B). Very few studies have actually shown reduced binding of a transcription factor in knockdown cells by ChIP-chip or ChIP-seq. To deal with this problem, mouse knockouts can be used. However, cells from these mice could undergo compensation for loss of a factor during development, resulting in related proteins being selected to regulate the targe genes. Third, closely related family members might bind to the same sites and have the same function. Thus, elimination of one family member could allow more binding of another family member (Figure 4C). Finally, only a small proportion of the binding sites for a factor might be functional in a given cell type. For example, if a cell type-specific partner needs to be recruited for transcriptional activity, then binding of the site-specific factor is necessary but not sufficient for transcription of a target gene (Figure 4D). Thus, knockdown of a factor in 10 cell types may show 10 different subsets of affected target genes. To address this possibility, one would have to collect ChIP-chip or ChIP-seq data and gene expression data before and after knockdown of the factor in a diverse set of cell lines. However, most transcription factors have been studied on a genome-wide scale in only one cell type. The ENCODE consortium (http://www.genome.gov/10005107) has chosen a set of different cell types for thorough characterization of binding of a large number of site-specific factors and initial studies appear to show that factors can be grouped into those that show very little cell type specificity in binding, such as E2F4 and YY1 (H. O’Geen and P. Farnham, unpublished observations) and those that show considerable cell type-specific binding, such as JunD (D. Raha and M. Snyder personal communication) and the estrogen receptor 15, 63, 64. Continuing studies will address whether factors that have small numbers of cell type-specific binding sites show regulation of a large percentage of their target genes in a given cell type compared to factors that show constitutive binding to a large number of sites and might regulate only a subset of target genes in each cell type.
Many previous analyses of transcriptional regulation used the assumption that transcription factors act as “individuals”, with a specific assigned role in regulating a particular gene and a specific mechanism of action. However, it is possible that a factor acts as an individual at subset of its sites (perhaps those that show altered regulation of a nearby gene upon loss or enhanced expression of that factor), but has a very different “community” function at other sites. For example, binding of a set of factors in a cluster might regulate transcription throughout a chromatin domain by helping to keep an open chromatin structure, through recruitment of histone acetyltransferases or histone methyltransferases. Loss of single factor would not affect transcription of the nearby genes; it would take the removal of a large proportion of factors bound in the cluster to alter gene regulation (Figure 5a). Alternatively, a cluster of bound factors could serve to define a local genomic search space for a second binding factor. Recent studies have shown that many transcription factors have a very fast dissociation rate in vivo 65. A factor might rebind to the same region of DNA, but in a nonspecific manner, and begin scanning for its high affinity binding site. If the factor moves unimpeded in the wrong direction, there could be a detrimental time lag before it finds another binding site. However, a cluster of bound factors that blocks scanning in the wrong direction might favor release, rebinding, and perhaps scanning in the correct direction. That is, binding of a cluster of factors might affect the expression of a nearby gene whose activation is controlled by an entirely different factor. Again, reduced expression of one of the “bumper proteins” may be fairly inconsequential; loss of several factors from the cluster would be required to cause a significant effect (Figure 5b). Data to support either of these possibilities is not yet available due to the lack of genome-wide binding information for most transcription factors.
Although enormous progress has been made in mapping transcription factor binding sites throughout the genome and expanding the number of transcription factors for which we have information about global binding patterns is very important, simply collecting genome-wide datasets will not be sufficient to answer all crucial questions. A number of methodological problems now need to be tackled.
It is not yet possible to conclusively link a specific binding site with a specific target gene. It remains possible that many binding sites, scattered perhaps tens or hundreds of kb away each other (or perhaps even on different chromosomes), all cooperate to regulate a single target gene. If so, then linking a binding site to the nearest gene is not appropriate and leads both to an incorrect assignment of target genes and to an underestimate of the number of binding sites that contribute to transcriptional regulation. Methods that define features of chromosomal architecture such as transcription factories 66, 67 could aid in defining co-regulated groups of genes, perhaps by collapsing thousands of seemingly unlinked binding sites into a smaller number of interactomes. For example, 3C, a technique which can identify chromosomal loops mediated by multiple, long range protein-protein interactions 68, might reveal a connection between an enhancer binding protein and the promoter of a distant gene, and thereby allow a more accurate interpretation of the regulatory role of that factor in the cell.
Although ChIP-seq can identify all the binding sites for a given factor in a given cell type, the possibility that we must perform ChIP-seq experiments in many different cell types to determine all possible binding sites for a given factor is quite daunting. The ENCODE Consortium is currently performing studies to estimate how many cell types are needed to identify most binding sites for a set of factors. If a limited, but diverse, set of cell types could be identified that are representative of many different human tissues, then perhaps genome-wide analyses will not have to be performed in every possible cell type.
Most approaches designed to study the relationship between a specific cis element and a potential target gene involve creating a reporter construct that includes the regulatory element of interest 4, 69. Unfortunately, as reporter analyses remove the cis element from its normal genomic context they cannot reveal effects on long-range regulation. Precise mutation or deletion of a single cis element within the genome can be performed in model organisms such as yeast, for which efficient methods to substitute genomic sections have been developed. Theoretically, mutations could be engineered to alter a specific binding site in animal models or human cell lines. However, mutagenesis of specific small regions of the mouse or human genome is not routinely used to study the significance of individual binding sites, due to low frequencies of homologous recombination that limit the efficiency of this technique. New approaches in site-specific targeting of DNAses using artificial zinc fingers 70 might improve the efficiency of genomic replacement, so mutagenesis could become a practical method for dissecting the role of individual cis elements. Also, artificial zinc fingers fused to transcriptional activation or repression domains have been used to specifically regulate cellular promoters 71. It is therefore possible that artificial zinc fingers (without either an activation or repression domain) could be used to simply block access of a factor to single binding site in the genome, but this has not yet been demonstrated successfully. Other possible methods include the use of pyrrole-imidazole polyamides or peptide nucleic acids to bind to (and perhaps also mutate) specific cis elements in the genome72 71 73. Although very few studies have used these methods to target a specific site and even fewer have examined the consequences of such agents on the entire transcriptome, they do hold the promise of providing a method to test the function of a specific binding site in its natural genomic context.
ChIP-chip and ChIP-seq have greatly advanced our understanding of gene regulation. First, genomic studies have confirmed that RNAPII together with general and site-specific factors are bound to thousands of proximal promoters that are active at very low levels 74-76, thus supporting the first step in the model set out in the introduction. These studies have also revealed that binding of a factor to an enhancer region can be necessary, but not sufficient, for high levels of promoter activity, which leads to the inclusion of a new step in the model (Figure 6, step 3): the binding of a cell type-specific partner protein that allows the recruitment of a coactivator, resulting in cell type-specific function of a constitutively expressed factor. Although the principle that binding of a transcription factor can be necessary, but not sufficient, for regulation of a specific gene was previously established using one-gene-at-a-time approaches, it was not clear whether a cooperative mode of regulation was the exception or the rule for most genes. Recent genome-wide analyses suggest that this type of regulation is very common. For example, of the ~3700 Oct4, ~4500 Sox2 and ~10,000 Nanog binding sites identified in mouse ES cells, only a small number of regions were bound by all three factors and by the co-activator p30038. These studies support the hypothesis that occupancy of an upstream site by a single factor (Oct4) was not functional (as in Figure 6 step 2), but binding of Sox2 and/or Nanog near to the occupied Oct4 site resulted in recruitment of the p300 coactivator (step 3) and transcriptional activation.
Other discoveries, such as the findings that most transcription factors bind to thousands of places in the genome, that binding sites are not localized only in proximal promoter regions, and that some binding sites lack sequences similar to the consensus motif, have also stimulated new ideas concerning long range and combinatorial regulation. However, current genomic studies have not yet determined whether most transcription factors cluster at hotspots in the human genome or with what frequency binding events have a functional outcome. The answers to these two questions will require the genomic profiling of many more factors. It is likely that a true understanding of the role of a given factor at a particular site in the genome will require the identification of all other factors binding nearby and knowledge of histone modifications in that region. These studies will be best performed by a cooperation between large groups such as the ENCODE Consortium (http://www.genome.gov/10005107) and the NIH Roadmap Epigenomics Program (http://nihroadmap.nih.gov/epigenomics/) that can identify binding sites for a large number of transcription factors and develop reference epigenomes in many different cell types and individual investigators who can perform the follow-up functional analyses of the role of a specific factor in a particular cell type. The next several years of large-scale data collecting should provide investigators with a plethora of information that will form the basis for hundreds of follow-up experiments that address important biological questions.
The author thanks Xiaoqin Xu, Henriette O’Geen, and Seth Frietze for providing data used in Figure 2 and the members of the Farnham lab for their insights and discussions.
Dr. Farnham earned a B.A. from Rice University, Houston, USA, a PhD from Yale University, New Haven, USA, and performed postdoctoral work at Stanford University, Palo Alto, USA. She was a faculty member at University of Wisconsin, Madison, USA, from 1987-2004 and moved to University of California Davis, USA, in 2004 where she is Associate Director of Genomics. Dr. Farnham has been a leader in using chromatin immunoprecipitation (ChIP) to study mammalian transcription factors. Currently, she is using ChIP with high throughput sequencing (ChIP-seq) to analyze chromatin structure, as a member of a Reference Epigenome Mapping Center, and for identification of target genes of human transcription factors, as a member of the ENCODE Consortium.