Understanding how genomic information is translated into gene regulation has been the subject of intense scientific investigation for the last several decades. Until recently, most studies focused on detailed characterization of a particular gene or gene family. These studies resulted in the development of general principles of gene regulation, but genome-scale studies are now prompting re-examination of some of these principles.
The established view of transcriptional regulation is that cis regulatory elements, such as promoters and enhancers, and proteins that bind to these elements control different levels of transcription of different genes1, 2
. Promoters are composed of common sequence elements, such as a TATA box and initiator, and binding sites for other transcription factors, which work together to recruit the general transcriptional machinery to the transcriptional start site (TSS). Enhancers also contain binding sites for transcription factors but are located some distance from the site of transcription initiation. Transcriptional activity resulting from the general factors binding to the core promoter is usually quite low but can be increased by site-specific factors binding to proximal promoter regions, which can help to recruit or stabilize the interaction of the general factors at the core promoter. Promoter activity can be further stimulated by factors binding to distal enhancer regions and subsequent recruitment of a histone modifying enzyme that creates a more favorable chromatin environment for transcription, or a kinase that induces a bound initiation complex to begin elongation (). Transcription can also be modulated by repressive factors that bind to upstream repressing sequences and/or silencers, which can interfere with activator binding (and thus prevent recruitment of the general transcriptional machinery) or recruit histone modifying complexes that create repressive chromatin structure.
Transcriptional regulation by promoters and enhancers
Recent genome-scale studies have enabled more precise definition of thousands of promoters for known genes and identified many previously unrecognized transcription units, revealing that some previous assumptions about transcriptional regulation are not correct. For example, based on the detailed characterization of a small subset of promoters, a typical RNA polymerase II (RNAPII) promoter was thought to contain a TATA box located 30 bp upstream of the TSS. However, we now know that TATA-driven promoters are the exception and not the rule 3, 4
. Other recent genomic studies suggest that ~50% of human genes have alternative promoters 5
, indicating that regulatory sequences for a particular gene can be spread over a considerable distance. Clearly, access to large datasets documenting RNA expression and transcription factor binding on a genome-wide scale now provides an exciting opportunity for investigators to reevaluate previous models of transcriptional regulation. Of particular interest is the role of site-specific DNA binding factors, which is the focus of this review.
It has been estimated that there are 200-300 transcription factors, in humans, that can be considered components of the general transcriptional machinery that bind to core promoter elements (for example, subunits of RNA polymerases and complexes such as TFIID that are required for transcription of most protein-coding genes), and perhaps 1400 transcription factors that have sequence-specific DNA binding properties and thus regulate only a subset of genes by binding to site-specific cis elements 6-8
. Interestingly, the site-specific factors tend to be either expressed in all or most tissues or instead are expressed in only one or two tissues, suggesting either a very broad or very specific function 7
. Alterations in gene expression caused by the inappropriate level, structure, or function of a transcriptional regulator have been associated with a diverse set of human diseases, including cancers and developmental disorders 9
. For example, 164 transcription factors have been shown to be directly responsible for 277 diseases 7
. This is undoubtedly a large underestimate of the importance of transcription factors in human disease due to the fact that most human transcription factors are essentially uncharacterized 7
. Because of the paucity of our knowledge concerning the function of transcription factors and the likelihood that increased knowledge of transcription factors will lead to increased insight into the causes of human diseases, it is of utmost importance to expand our understanding of how site-specific transcription factors contribute to gene regulation. Crucial questions that need to be addressed are: where do transcription factors bind in the genome; how is specificity of binding achieved; what features of the chromatin can influence the ability of transcription factors to stably interact with the genome; and how is binding of the factor related to its subsequent function in respect to regulation of a nearby gene?
Fortunately, recent advances in the techniques of chromatin immunoprecipitation followed by microarray (ChIP-chip) or by sequencing (ChIP-seq) (Box 1
), and similar techniques such as DamID now allow investigators to create a global map of specific protein-DNA interactions in a given cell type in a single experiment10-18 19
. Binding sites identified from these ChIP studies 20-28
are categorized relative to genomic features such as the nearest gene, frequency of binding relative to gene structure (for example a promoter, enhancer, exon, or intron), and the type of chromatin domain. The cost of ChIP-Seq depends partly on the depth of sequencing, but an estimate is that 10-12 million uniquely mapped reads should be sufficient for most human transcription factors, which can obtained in 1 or 2 lanes of sequencing, for a cost of one to two thousand dollars. As multiple DNA microarrays are needed to cover the entire human genome, comprehensive studies by ChIP-chip are more expensive. However, for certain applications (such as detailed analyses of a protein complex binding to a small segment of a genome), a focused ChIP-chip experiment currently remains more cost-effective than a genome-wide ChIP-seq analysis.
Box 1: Chromatin immunoprecipitation methods
Briefly, chromatin immunoprecipitation (ChIP) (illustrated in the figure) involves crosslinking DNA-binding proteins to DNA by treatment of cells with formaldehyde and preparation of chromatin by sonication or enzymatic digestion. An immunoprecipitation of the crosslinked chromatin is performed using an antibody that recognizes a specific transcription factor or histone isoform, resulting in the collection of all the binding sites in the genome for the factor of interest. After purification of the precipitated fragments, the sample can be analyzed by PCR to study particular genes. However, genome-wide analysis can be performed by microarray (ChIP-chip) or sequencing (ChIP-Seq). For ChIP-chip, the immunoprecipitated sample and input DNA, as a control, are labeled with fluorescent dyes and hybridized to microarrays. Binding sites are identified by the intensity of signal of the ChIP sample in relation to the signal of the input sample at each probe on the microarray using various ChIP-chip peak-calling programs 21 22
. For a single ChIP-chip experiment, most investigators use between 106
cells, however recent methodological improvements using amplification methods have enabled successful ChIP-chip experiments with as few as 104
. For ChIP-seq, the immunoprecipitated sample is used to create a library that is analyzed using high throughput next generation sequencers. Binding sites are identified using various ChIP-seq peak calling programs 16 27 81 26 82
, all of which identify target sites based on the number of sequenced tags from the ChIP library corresponding to each position in the genome. For a ChIP-seq experiment designed to map binding of a site-specific factor, most investigators use 107
cells, although 104
cells is sufficient for the ChIP-seq analysis of certain histone modifications 83
. It is important to note that because ChIP assays require such large cell numbers, the observed peaks in either ChIP-chip or ChIP-seq represent an average of binding of a factor at a particular site in the cell population. Thus, a small peak could represent very strong binding in only a subset of the cells (for example, cells at one stage of the cell cycle) or modest binding in the entire cell population. ChIP-seq experiments, which allow binding to be analyzed at all unique overlapping oligomers of a certain length (usually 27-50 nts are sequenced per fragment) in the genome, can provide very high resolution mapping of transcription factor binding sites. For example, three-fourths of all the ChIP-Seq peak positions for the DNA binding proteins CTCF, NRSF and STAT1 are within 18, 27 and 51 bp, respectively, of the nearest motif for that factor 82
. In general, genome-scale ChIP-chip experiments are less precise in mapping the exact location of a binding site because the oligomers on the array are not overlapping but are spaced approximately 35-100 nt apart, due to the large number of arrays that would be required if overlapping oligomers were used.
This review summarizes recent discoveries provided by genome-wide profiling of site-specific transcription factors and how they have led to new insights regarding patterns of transcription factor binding, how binding specificity is achieved, and what features of the chromatin can influence the ability of transcription factors to interact stably with the genome. The focus will be on the human genome, although relevant insights from other organisms are also incorporated (in particular when studies using model organisms are more advanced than similar studies of the human genome) as it is likely that the implications of transcription factor recruitment for gene regulation will be similar across all eukaryotes. Importantly, genome-wide studies have not only provided new information, they have also created new challenges in our understanding of gene regulation, such as why certain transcription factors bind to so many places in the genome and why so much of the regulation appears to be via steps that occur after recruitment of the site-specific factor to the DNA. Therefore, this review concludes with suggestions for future experiments that are needed to further our understanding of the causes and consequences of specific transcription factor-genomic interactions.