|Home | About | Journals | Submit | Contact Us | Français|
Transcriptional regulation is a critical mediator of many normal cellular processes, as well as disease progression. Transcription factors (TFs) often co-localize at cis-regulatory elements on the DNA, form protein complexes, and collaboratively regulate gene expression. Machine learning and Bayesian approaches have been used to identify TF modules in a one-dimensional context. However, recent studies using high throughput technologies have shown that TF interactions should also be considered in three-dimensional nuclear space. Here, we describe methods for identifying TF modules and discuss how moving from a one-dimensional to a three-dimensional paradigm, along with integrated experimental and computational approaches, can lead to a better understanding of TF association networks.
Transcriptional regulation is a critical step in transmission of information from genotype (i.e. DNA) to phenotype (e.g. expression of proteins and noncoding RNAs or microRNAs). A large number of proteins, namely transcription factors (TFs),2 play an essential role in gene regulatory networks by binding to short DNA sequences called cis-regulatory elements (CREs), which include promoters, enhancers, repressors, and insulators (1). In many cases, multiple TFs function as a regulatory complex (hereafter referred to as a TF module) at a CRE. The concept of cooperative regulation by TFs bound near each other in the genome has been recognized for decades. For example, the enhancer located upstream of the interferon-β gene (IFNB1) has served as a wonderful example of cooperative regulation (2, 3). Another example of TF modules is the regulatory region upstream of the Drosophila gene eve. Expression of the eve gene is modulated by different combinations of multiple TFs that bind to the upstream CRE (4). Although the identification of modules composed of TFs bound to adjacent genomic sites is increasing, due in part to ChIP-seq analysis of a large set of TFs by the ENCODE Consortium,3 it has recently become clear that cooperative regulation can be achieved by means other than the interaction of TFs bound next to each other on the genome. TFs can also be brought into close spatial proximity with other TFs bound to a regulatory element located a great distance away via the three-dimensional conformation of a chromosome. In fact, three-dimensional genomic organization, which brings together two distant loci, has been shown to be involved in both gene regulation and nuclear compartmentalization (5–8). The regulatory effects of a TF bound to a CRE can be either active or repressive, often switching from one to the other depending on other interacting factors. Therefore, a detailed understanding of the association of TFs with other TFs bound at adjacent or distal sites is required to comprehend the complex molecular mechanisms involved in transcriptional regulation of the genome. Also, elucidating cell type-specific TF modules may help to understand the mechanisms driving cell differentiation and disease progression. New experimental techniques facilitated by high throughput sequencing allow investigators to more globally address questions concerning the relationship between three-dimensional chromatin organization and TF modules. However, it is a remarkably complex task to extend the analysis of TF modules from a one-dimensional to a three-dimensional scale, requiring tremendous efforts from both experimental and computational biologists, as well as effective communication and collaboration among these specialists. In this minireview, we focus on the experimental and computational methods involved in the identification of TF modules, concluding with a suggested pathway by which investigators can identify both one- and three-dimensional TF modules.
Although a variety of methods have been developed to investigate TF binding throughout the genome (9), the technique of chromatin immunoprecipitation (ChIP) is the most common. This technique, which was developed during the 1980s and 1990s, has been modified extensively for the analysis of site-specific factors and histones (10–17). The steps in a ChIP experiment include 1) cross-linking TFs to the genome, 2) shearing DNA (usually by sonication) to fragments ranging from 100 to 500 bp in length, 3) enriching for TF-DNA complexes using target TF-specific antibodies, 4) removing proteins by reversing the cross-links, and 5) purifying the enriched DNA fragments for further analyses (Fig. 1A). When ChIP was first developed, a polymerase chain reaction (PCR) assay would be performed to determine whether the TF bound to a specific genomic position. Although this assay is still used to study single loci, the sequencing of the human genome (18–20) and the development of high throughput technologies (21) have enabled genome-wide profiling of TFBSs. ChIP-chip, a high throughput technique that hybridizes the enriched DNA fragments to microarrays (22–25), was first used to survey TFBSs genome-wide. ChIP-seq (26–31), a new technology that combines ChIP and massively parallel sequencing (on platforms such as the Illumina Genome Analyzer and HiSeq machines, the ABI SOLiD system, and the Roche 454 system), was first developed in 2007 and has rapidly proved to have several advantages such as the complete coverage of the unique portions of the genome along with high resolution and sensitivity.
Although the aforementioned techniques provide a detailed map of TFBSs, they are not sufficient to identify TF modules coordinated by higher order chromatin organization. In the past, co-localization methods such as fluorescent in situ hybridization (FISH) have been used to investigate three-dimensional chromatin structures (32). However, the higher resolution of the chromosome conformation capture (3C) technique (33) has greatly improved our ability to examine the effects of chromatin conformation on transcriptional regulation. The 3C assay can detect pairs of genomic loci that are in close proximity in the three-dimensional space of the nucleus. In a 3C experiment, formaldehyde is used to cross-link non-adjacent regions of chromatin that are spatially close. The DNA is then digested with a restriction enzyme, and the fragments within the cross-linked complexes are joined by ligation. This is followed by cross-link reversal and PCR using primers specific for two different genomic regions. A high signal for the hybrid DNA sequence indicates a high ligation rate between the two genomic loci, which is likely produced by their close proximity and high interaction frequency (Fig. 1B). Several high throughput variations on the 3C assay have been developed that allow a larger scale screening of chromatin interactions. For example, chromosome conformation capture-on-chip (4C) (34, 35) detects many genomic regions interacting with one particular locus using a microarray containing a set of specifically designed probes. After the cross-link reversal step of 3C, a second enzyme restriction digestion step is performed to shorten the hybrid fragments, and then the small hybrid fragments are circularized and subjected to inverse PCR. To identify the interacting regions for the locus of interest (which is called the bait), specific primers within the bait region of the circularized hybrid fragment are designed such that they face the portion of the circularized fragment that is derived from the interacting region. After amplification, the products of inverse PCR are hybridized to the custom microarray. The major obstacle for wide application of the 4C assay is that it can detect only regions interacting with one chosen genomic locus per experiment. Another 3C-based large-scale DNA interaction profiling method is chromosome conformation capture carbon copy (5C) (36). Similar to 4C, 5C allows detection of many potential chromatin interactions, but a multiplex ligation-mediated amplification (LMA) step distinguishes it from 4C. The universal primers of LMA are designed to fit near the restriction enzyme cutting sites and have a specific orientation so that the amplification products in the 5C library theoretically contain the ligation junctions of all the hybrid fragments in the 3C library. The 5C library is analyzed using microarrays or next-generation sequencing. Both 4C and 5C involve multiplex primers or probe designs, which substantially increase the cost and decrease the applicability of the assay. Two newly developed next-generation sequencing-based techniques, ChIA-PET (37, 38) and Hi-C (39), are more suitable for unbiased identification of chromatin interactions across the entire genome. ChIA-PET incorporates enrichment of cross-linked complexes that contain a target protein using antibody-based immunoprecipitation before cross-link reversal; thus, it has been used mainly to interrogate interactomes of a specific TF such as estrogen receptor-α (ER-α) (37) or CCCTC-binding factor (CTCF) (38). On the other hand, Hi-C uses biotin labeling of the ligation sites, which are then purified using avidin. In contrast to ChIA-PET, Hi-C can map the location of all possible interacting loci in the genome in an unbiased manner, but it does not provide information as to which TFs are involved in formation of the different interactions. However, by combining other experimental and computational assays with Hi-C (as described below), one can link TFs to sets of interacting loci, allowing the identification of three-dimensional TF modules.
In a ChIP-seq experiment, millions of short DNA sequence tags are aligned with a reference genome, and binding sites of the target protein are identified as genomic regions that are enriched within the set of sequenced tags. However, not all enriched regions correspond to binding sites, and therefore, computational methods have been developed to identify true binding sites. Model-based analysis of ChIP-seq (MACS) (40), PeakSeq (41), QuEST (17), site identification from short sequence reads (SISSRs) (42), Sole-Search (43, 44), and other peak identification programs (45–48) are available for ChIP-seq data analysis, and each of these programs applies a different algorithm (see Ref. 49 for a review of these tools). For example, because only the end of a ChIP fragment is sequenced, the MACS algorithm begins by shifting the sequenced tags toward the binding site for a certain number of base pairs and then locates the binding site by calculating the summit within a peak region. Instead of tag shifting, PeakSeq utilizes a strategy in which tags are extended to better represent the precipitated fragments. QuEST identifies binding sites using a tag enrichment profile of a peak region. SISSRs screens binding sites in a certain window by a threshold of tag counts on both forward and reverse strands that is calculated based on a Poisson distribution. Sole-Search extends the tags to represent the length of the precipitated fragments and allows the identification of both narrow peaks and longer binding regions based on prior knowledge of the factor being studied. Another recent method applies a mixture model to provide higher resolution of TFBS localization and the ability to distinguish closely positioned TFBSs (50). wBELT uses a bin-based enrichment threshold to identify TFBSs and applies tag shifting and statistical methods to define a false discovery rate (FDR); this software has been integrated into a user-friendly web-based application called W-ChIPeaks (51).
Accurate identification of individual TFBSs can be achieved using ChIP-seq and ChIP-chip. However, as discussed above, TFs usually do not function alone. Cooperating factors can influence the specificity and affinity of TF binding and can significantly alter the function of a bound TF, greatly influencing gene regulation. For example, the serum response factor (SRF) activates distinct sets of genes via interaction with different cofactors upon serum stimulation (52, 53). By interacting with different cofactors, the Mcm1 protein can either promote or repress transcription of a group of genes (54). Thus, to completely understand the function of a TF, it is important to characterize the TF at the level of interacting modules. Many computational approaches have been developed to search for TF modules, most of which are based on the fact that TFs bind to a specific sequence of DNA called a motif. A motif can be represented by a position weight matrix (PWM), which is a probability matrix that indicates the chance of each position of the motif being a certain nucleotide; motifs for many TFs are collected in the TRANSFAC data base (56). A high concurrence of motifs for two different TFs within a relatively short region of DNA may indicate a potential TF module. One example of a computational approach to predict TF modules is CisModule (55), which applies Bayesian inference and uses a two-layer hierarchical mixture model, in which the first layer represents the mixture of modules and the second layer represents the mixture of motifs in the modules. Parameters, including the product multinomial parameters for each motif, the width of each motif, the probability of a module start, and the probability of a motif start, are updated with each iteration. Studies have shown that methods that take advantage of both the TRANSFAC TF motif data base (56) and sequence conservation information within multiple organisms (57–60) generally outperform other algorithms in terms of the accuracy of prediction of TF binding. More recently, integrated computational and experimental genomics approaches have been combined to identify one-dimensional TF modules from ChIP-chip and ChIP-seq data (61). In these approaches, the binding sites of a TF are first identified using peak detection software. Then, regions adjacent to a set of high confidence TFBSs are searched for putative motifs of other TFs whose PWMs have been characterized in the TRANSFAC data base (56). One such method called ChIPModules employs a classification and regression tree model to generate TF modules based on the co-occurrence rate of the PWMs. Using this method, E2F1 target genes were classified into distinct groups regulated by five different modules. ChIP-chip analysis demonstrated that one predicted cofactor, activating enhancer binding protein 2α, did in fact form binding modules with E2F1. Another example of this type of method is hypergeometric optimization of motif enrichment (HOMER) (62), which includes a set of programs for de novo motif discovery. A recent study using HOMER found that PU.1 co-localizes with distinct sets of TFs in macrophages versus B cells (62). These findings demonstrated that different TF modules are indeed lineage-specific and responsible for the development of characteristic features of macrophages and B cells. Cistrome is another tool suite that includes a de novo motif discovery algorithm and allows cofactor identification through co-localization analysis of TF motifs (63). A recent study utilized Cistrome and epigenetic information to define CREs and predict TFBSs with high accuracy (64). Although this method was applied primarily to predict TFBSs, it could also increase the accuracy of predicting cell type-specific TF modules by eliminating false positive regions within closed chromatin. Studies using programs such as ChIPModules, HOMER, and Cistrome clearly demonstrate that searching for motifs in regions located near TFBSs can identify putative collaborating TFs. However, to find TF associations that occur through higher order chromatin structure, one must also integrate chromatin interaction data.
Among the chromatin interaction profiling methods, the Hi-C technique has the potential to identify a whole genome interactome in an unbiased manner without relying on known protein interactions. In the original description of the Hi-C assay (39), a probability matrix at 1-Mb scale was used to model the data. However, this method identified only low resolution interacting loci and is thus not suitable for analysis of the fine structure of the chromatin. To increase the resolution and better utilize the great resources of Hi-C data, a more recent method utilized a mixture Poisson regression model (MPRM) to increase the resolution of the identified interacting loci to 20 kb (65). There are three major types of hybrid fragments in the Hi-C library: 1) proximate ligation, formed by the joining of closely positioned DNA fragments; 2) random ligation, formed by the random interaction of two “floating” fragments; and 3) self-ligation, formed when the two ends of a single fragment are joined. In the MPRM, the self-ligation fragments are easily discounted by their distinctive characteristics, but the proximate ligation events and random ligation events are considered as two independent Poisson distributions; thus, the overall ligation events can be represented by a latent class model with two hidden variables. An expectation maximization algorithm is used to estimate the hidden variables, and a FDR is calculated based on the cumulative distribution function of the Poisson distribution. The MPRM identified 96,137 interacting loci with a FDR of 5.76%. Consistent a the previous study (39), the majority of the identified interactions were intrachromosomal and within a distance of 1 million bp and occurred close to CREs marked by histone depletion and flanked by H3K4me1 (65). Because of sequencing depth limitations in the existing data, it was estimated that the identified interacting loci represent only ~25% of the complete set of interacting loci in K562 cells. These analyses suggest that the future Hi-C analysis of human cells should be derived from sequence information of at least 100 million proximate ligation hybrid fragments. Another method developed to explore Hi-C data is based on a probabilistic model, which takes into account several systematic biases that reside in the Hi-C protocol (66). These biases include spurious ligation fragments, the size of the enzyme-digested DNA fragments, the ligation efficiency, the CpG content of the ligation sites, and the mappability of the sequenced tags. This method also showed a high interaction rate of nucleosome-depleted regions and active promoters. Furthermore, this study demonstrated that the genome can be divided into active and inactive domains based on the connection intensity between CREs in those domains. As experimental techniques such as Hi-C become more commonly used, there will be an urgent demand for optimized methodologies to analyze the derived data. Perhaps by combining the advantages of the above-mentioned approaches, new methods with high resolution that appropriately consider system biases can provide a robust reconstruction of the chromatin structure.
As described above, 3C has been used to study looping between defined genomic regions such as the β-globin locus control region (LCR) and DNase I-hypersensitive sites upstream or downstream of the locus (67). In another study, expression of UBE2C was shown to be regulated by FoxA1- and MED1-mediated chromatin interactions between the UBE2C promoter and enhancer (68). In breast cancer cells, the 3C assay was used to demonstrate that ER-α drives chromatin looping at a cluster of genes on genomic region 16p11.2 (69). In these cases, specific TFs were analyzed at specific loci. Because very few genome-wide chromatin interactions studies have been published, there are only a few examples in which TF binding data have been integrated with large-scale genomic structure information. However, a recent study (70) integrated Hi-C information with CTCF ChIP-chip and ChIP-seq data and found that strong interchromosomal interactions are highly correlated with CTCF-binding sites, suggesting that CTCF plays an important role in the organization of the human genome. This study served as a proof of principle that it is possible to identify chromosomal hubs that are associated with a specific TF. The extensive ChIP-seq data that are now available from the ENCODE Consortium provide investigators with the opportunity to identify associations of interacting loci with TF binding, chromatin modifications, and open chromatin. A recent study used 5C to comprehensively interrogate long-range looping interactions between genes and distal elements in the 1% of the human genome representing the ENCODE Pilot regions.4 These 5C maps of ENCODE regions in GM12878, K562, HeLa, and human embryonic stem cells (hESCs) identify thousands of long-range interactions and provide new insights into the cell-type specificity of chromatin looping. We have used data from the ENCODE Consortium to develop a TF interaction network using integrated genome-wide Hi-C data, epigenomic profiles, open chromatin, and TF binding data (65). In this study, the Apriori algorithm (71) was used to search for the association of TFs bound at the two ends of sets of interacting loci, e.g. if TF1 was bound to a set of loci and if binding of TF2 was statistically enriched in the genomic regions interacting with those loci, this suggested a potential association of TF1 and TF2. By incorporating ChIP-seq for 45 different TFs, the analysis showed 1) a high concurrence of CTCF and RAD21 at the ends of interacting loci, which is consistent with their similar binding preference and with previous reports (72); 2) that, consistent with previous studies, E2F4 and RNA polymerase II are highly linked (73); and 3) that c-Jun, GATA1, GATA2, INI1, and BRG1 are closely linked, which suggests interactions between chromatin modifiers and cell type-specific TFs. We note that this type of detailed TF network analysis is possible only if the genome-wide experimental data sets for chromatin interactions and TF binding are available for the same cell line. To date, Hi-C data are available only for K562 and GM06990 cells (39). Although Hi-C data from additional cell lines will become more readily available due to reduced sequencing costs, it will be a longer period before hundreds of TFs are characterized by ChIP-seq in all cell types. The ENCODE Consortium is producing large ChIP-seq data sets for numerous commonly used cell lines. However, most of these cell lines are derived from cancer cells. The analysis of primary cells and/or specific subpopulations of cells from different tissues will be problematic. Although it is possible to obtain enough material from primary cells for Hi-C, current ChIP-seq technologies use large numbers of cells, and thus, it is difficult to obtain enough primary cells for hundreds of ChIP-seq experiments. In this case, searching for putative TFBSs using motif analysis and the TRANSFAC data base (56) can serve as an alternative method to integrate chromatin structure and TF information. This approach has been used to correlate CTCF motifs with Hi-C data (70). However, it is critical to keep in mind that binding of TFs to the genome is highly influenced by certain histone modifications; a consensus motif may not be bound by a TF if the chromatin is in a closed confirmation (i.e. it is marked by H3K27me3 or H3K9me3). For example, ER-α was shown to bind to consensus motifs that are in open chromatin and marked by the histone modification H3K4me1 (64). Thus, if motifs are used to predict TF binding, it is imperative that epigenomic information be included to identify regions that are biologically relevant. Fortunately, relatively small amounts of chromatin are needed for epigenetic analyses, which can be easily performed on primary cells (Roadmap Epigenomics Project). Putative motifs can then be identified within interacting loci that fall within regions of open chromatin and are marked by H3K4me1 or H3K27ac. A TF association network can be predicted using methods such as the Apriori algorithm, a Bayesian approach, neural networks, etc. We note that a derived network using only motif information will be more complex than a network produced using experimentally determined TFBSs. Therefore, it is essential that putative TF associations be validated by downstream experimental approaches.
High throughput techniques such as ChIP-seq, ChIP-chip, ChIP-PET, ChIA-PET, and Hi-C can provide detailed information of genome-wide binding of TFs, histone modifications, and higher order organization of the chromatin. Integration of these different types of data has greatly facilitated our understanding of TF associations and has brought our view of transcriptional regulation from a linear paradigm to a three-dimensional model. Although such studies are in their infancy, it is clear that elucidating the relationship between TFs and chromatin interactions is essential to understand the complexity of the underlying biology of transcriptional regulation. Fig. 2 illustrates a possible workflow for a genome-wide identification of three-dimensional TF modules. In this pathway, epigenetic profiling to identify sites of DNA methylation, modified histones, and open chromatin is performed to segment the genome into different epigenomic states (Step 1). Next, the Hi-C method is used to identify all the interacting loci in the genome (Step 2). Then, the clustering of interacting loci based on epigenetic status is performed to identify distinct sets of interacting chromatin loci (Step 3). TF binding data, either predicted (Step 4A) or experimentally determined (Step 4B), are integrated with the chromatin structure information (Step 5), and computational methods such as the Apriori algorithm, Bayesian approaches, or neural networks are used to develop TF modules and association networks (Step 6). Finally, experimental validation of the predicted TF associations can be performed (Step 7). We also note that although moving from a one-dimensional to a three-dimensional model of gene regulation has been a major advance in the field, it is critical to realize that a complete understanding of transcription requires that we take into account changes caused by drug treatments, environment challenges, and time (74–80). There is increasing evidence suggesting a dynamic model of nuclear organization and gene regulation. For example, studies have shown that the interaction of TFs with chromatin can change in a cyclical manner after treatment with hormone, and there is an intensive chromatin reorganization induced by estrogen treatment of Matrigel-derived endothelial cells (69). These observations support a new transcriptional regulatory paradigm in which sets of associated TFs serve as a driving force for highly dynamic formations and dissociations of chromatin interactions, resulting in a profound impact on transcription. Thus, we look forward to the future when a series of three-dimensional TF networks can be combined to provide a four-dimensional motion picture of gene regulation and chromatin organization.
*This work was supported, in whole or in part, by National Institutes of Health Grants CA45250 and 1U54HG004558 from the United States Public Health Service (to P. J. F.). This work was also supported by the Department of Biomedical Informatics of The Ohio State University (to V. X. J.). This is the fourth article in the Thematic Minireview Series on Results from the ENCODE Project: Integrative Global Analyses of Regulatory Regions in the Human Genome.
3The ENCODE Consortium (2012) Integrative analysis of the human genome. Nature, in press.
4J. Dekker, personal communication.
2The abbreviations used are: