|Home | About | Journals | Submit | Contact Us | Français|
Differential gene expression is the fundamental mechanism underlying animal development and cell differentiation. However, it is a challenge to identify comprehensively and accurately the DNA sequences required to regulate gene expression, called cis-regulatory modules (CRMs). Three major features (singly or in combination) are used to predict CRMs: clusters of transcription-factor binding-site motifs, noncoding DNA under evolutionary constraint, and biochemical marks associated with CRMs, such as histone modifications and protein occupancy. The validation rates for predictions indicate that identifying diagnostic biochemical marks is the most reliable method, and understanding is enhanced by analysis of motifs and conservation patterns within those predicted CRMs.
The development of animals from zygotes to adults and the differentiation of cells into distinct tissues and organs requires the expression of a specific set of genes at each developmental stage and in each cell type1. The features distinguishing humans from apes have long been attributed to differences in gene expression2, and aberrant gene expression lies at the heart of multiple diseases. Thus, identifying the DNA sequences required for regulating gene expression, called cis-regulatory modules (CRMs), can both expand our understanding of biology and have applications in several fields including evolution and medicine. For example, most of the genetic variants significantly associated with susceptibility to disease do not lie in protein-coding regions3, and we surmise that many affect the regulation of gene expression.
Three major approaches have emerged for predicting CRMs. The first is to search genomic DNA for clusters of short motifs that are needed for the specific binding of transcription factors (TFs). Although CRMs should contain multiple such motifs, this approach to identifying CRMs has had limited success. A second approach for identifying CRMs involves comparing homologous, noncoding DNA sequences between related species. These methods can reveal important subsets of conserved CRMs that are under purifying selection, such as developmental enhancers, but they miss lineage-specific ones. More recently, high-throughput, direct assays for DNA sequences that have epigenetic features characteristic of regulatory regions provide a third approach that has potentially high predictive power for identifying CRMs. This method, which involves mapping the locations of TF-binding and histone modifications in a wide range of tissues and developmental stages, yields an unbiased genomic view of potential gene-regulatory regions that is not restricted to conserved regions or those with known regulatory motifs.
We briefly review the major types of CRMs being studied in animals and then review the strengths and weaknesses of the three approaches to CRM prediction, assessing the success rates of each. We suggest ways to use the three approaches in combination to improve predictions, and discuss important questions for future research. Improvements in CRM prediction and classification are already leading to advances in understanding how genetic variants affect susceptibility to disease4–7.
Our emphasis in this review is to assess the efficacy of these methods and suggest ways in which they can be improved. Readers are referred to other recent reviews for more details on the biochemical features of chromatin around CRMs8–13, prediction methods that are based on conservation and motifs14,15, and earlier comparisons of the different approaches16,17.
Regulation of gene expression involves an interaction between TFs and CRMs, and it is important to be clear about how one refers to the DNA sequences that TFs can bind (Box 1). In this review we emphasize the TF binding sites (TFBS) that are occupied in living cells. The emphasis on in vivo occupancy is crucial. Biochemical assays in solution, such as electrophoretic mobility shift assays, in vitro footprints, and capture of TF-bound sequences, can define the sequence required for recognition of DNA by TFs; algorithms that assess DNA sequence similarity to a TFBS motif will therefore be able to detect millions of motif instances in a mammalian genome18,19. While any motif instance could potentially be bound in vivo, only about one in 500 actually are bound in organisms with large genomes18. As a specific example, the mouse genome contains about 8 million instances of a match to the GATA-binding factor 1 (GATA1) binding site motif, but only about 15,000 DNA segments (some with multiple motif instances) are bound by this transcription factor in erythroid cells18,20.
CRMs in animal genomes are usually placed into one of three categories, defined by their role in gene expression. The ability of the three prediction methods to detect a CRM depends on the properties of each particular CRM class. This review covers work in both flies (Drosophila melanogaster) and mammals because a large number of studies have been done in these species and the fundamental mechanisms of regulation are similar in insects and mammals. However, some genomic features and proteins are present only in one clade, and the smaller size of the fly genome coupled with the lower proportion of noncoding DNA may contribute to a greater success of CRM predictions in this species.
A promoter directs RNA polymerase to initiate at the transcription start site, TSS21. In promoters for RNA polymerase II, general transcription factors bind to a core promoter of about 100 bp around the TSS and facilitate binding of the polymerase complex8 (Box 1). Some core promoters contain well-known motifs such as a TATA box and have a discrete start site for transcription; however, most promoters in mammalian genomes are GC- and CpG-rich regions that lack TATA boxes and tend to support initiation of transcription at a broad range of positions within a roughly 100 bp interval22. The heterogeneity in sequence composition and genomic structure of promoters has complicated the accurate prediction of this CRM class based on single sequences. Furthermore, CpG islands are not present in Drosophila melanogaster. However, promoters do reside in chromatin with distinctive modifications (Box 1), and they can be identified by mapping the start sites for transcription (see below). The functional and mechanistic implications of the differences in promoter classes along with distinct chromatin modifications was recently reviewed13.
Enhancers23–25 and silencers26 are defined operationally by their positive or negative effects, respectively, on a reporter gene after transfer into a transgenic animal or transfected cells in culture. They can act independently of position and orientation in gene transfer assays (Box 4). However, depending on the trans-environment in a cell, a given DNA segment can switch between enhancing and silencing, presumably reflecting the recruitment of co-activators and co-repressors, respectively27,28. Hence the successful prediction of enhancers will probably identify some silencers as well. Currently, few silencers are well-characterized, and they will not be covered further in this review.
Enhancers can be located close to their target promoter29, but many of them are located a long distance away; an enhancer for the mouse Shh gene is 1 Mb away from the Shh promoter30. An enhancer contains multiple TFBSs (Box 1), and this multiplicity is a requirement for enhancement31,32. Genes can have multiple, distinct enhancers that drive expression in specific tissues depending on the particular TFBS motifs and the TFs that bind to them1,33–35.
The variability in the distance of enhancers from a TSS and the diversity in their composition make prediction of enhancers particularly challenging. The set of mammalian TFs, estimated to be at least 1,000 in number, bind to hundreds of TFBS motifs, but these motifs are short and the vast majority of motif instances are not bound by a TF. As will be developed in a later section, these sequence features are not sufficient for consistently accurate predictions of enhancers. However, including signatures of purifying selection and especially direct evidence for distinctive epigenetic features (Box 1) improves the prediction accuracy.
Insulators are CRMs that restrict the effect of long-range regulatory modules, such as enhancers, so that they act on the appropriate promoter target36,37. One way to do this is via an enhancer-blocking activity. When located between an enhancer and a target promoter, such an insulator can block the activity of the enhancer and thereby reduce gene expression38. CCCTC-binding factor (CTCF) is a protein required for the enhancer-blocking activity of mammalian insulators39 (Box 1), whereas Drosophila species have at least four additional proteins sufficient for enhancer blocking activity, some of which can be identified in other insects40. Insulators that serve as barriers can prevent position effects when they surround a stably integrated reporter gene41, presumably by blocking the spread of repressive heterochromatin from the site of integration into the reporter gene. This is a separate activity from enhancer blocking, and it requires different proteins such as upstream stimulatory factor (USF), which in turn recruits histone modifying enzymes42. The enhancer blocking and barrier activities can occur together in some insulators or separately in others.
As for enhancers, an insulator can be located almost anywhere relative to a gene, and thus location offers no predictive power. Known insulators are located in chromatin with a histone modification profile similar to that of enhancers, but the requirement for CTCF distinguishes enhancer-blocking insulators from enhancers (Box 1). A major complication is that CTCF has many additional functions in addition to insulation43. Thus, finding CTCF-bound DNA segments should identify most instances of this type of insulator44, but many of the CTCF-bound segments will not necessarily be insulators. The challenge is to identify those other functions.
The observation that clusters of TFBS motifs are necessary for TF binding to CRMs motivated initial motif-based approaches for predicting enhancers and promoters. The advantage of these approaches is that predictions can be made using only genomic DNA sequence and models of the TFBS motifs for the TFs involved in the process under study (Box 2). However, clusters of TFBS motifs occur frequently in large genomes, and alone they are not sufficient for TF binding (e.g. epigenetic marks are required). Thus, genome-wide CRM predictions based on TF motifs typically make many false positive predictions, and consequently have low validation rates. When the search space can be reduced, e.g. by interrogating species with smaller genomes, restricting to relevant genes or using general epigenetic marks, TFBS motif approaches can be effective. Unlike more general epigenetic marks, they can also be useful for classifying elements based on the particular TFs involved. However, for many biological processes, the TFs involved are not fully known, and so these approaches cannot be applied.
In early applications, detailed information about TFs involved in muscle determination, such as myogenic factors (MYFs) and Myocyte enhancing factor 2 (MEF2), and their TFBS motifs enabled the prediction of elements that are active in muscle, based on clustering of the TFBS motifs45,46 (Fig. 1a, Table 1). These and related methods can find up to two-thirds of known muscle enhancers but the validation rate can be low45. In Drosophila melanogaster, knowledge of the TFs and their cognate TFBS motifs that regulate expression of genes controlling early development enabled several approaches to finding clusters of TFBS motifs relevant to different developmental process47–51. All had good sensitivity, in that each found at least one novel enhancer active in transgenic flies, but in most cases the predictions had a low positive predictive value (14 to 33%). Modelling matches to TFBS motif matrices as a thermodynamic affinity instead of making binary calls on TFBS motif instances has a substantially higher success rate, probably because many weak matches were able to contribute to the predictions51. In this case, only known segmentation genes were investigated; in general, the larger the search domain for predicting CRMs (e.g. whole genome), the lower the positive predictive value.
When relevant TFs and motifs are unknown, motif discovery and CRM discovery can be performed simultaneously. For example, the CisModule software52 (Table 1) models TFBS motifs and CRMs simultaneously. When applied to the muscle expression dataset described above, this approach recovered some of the known TFBS motifs and showed good specificity in discriminating the true muscle CRMs from random sequences63. Training models to discriminate different classes of CRMs (rather than just CRMs from background) can improve the inference of TFBS motifs and CRMs. Smith et al.53 combine known motifs with motifs discovered to be discriminative between datasets in promoter proximal regions to construct a logistic regression model that can significantly predict tissue specific expression in 45 of the 56 human and mouse tissues considered. The ability to discover novel TFBS motifs, especially in the process of CRM identification and classification, will remain important as long as TFBS motifs have not been comprehensively defined.
As collections of TFs and their cognate TFBS motifs are more completely defined, a promising future direction is to build quantitative models that predict expression levels under diverse conditions for both naturally occurring and synthetic CRMs. Impressive success has been achieved for synthetic promoters in yeast using thermodynamic models of binding affinity of TFs to DNA and to each other54. As we strive for an understanding of the regulatory code, experiments such as these will reveal how complete (or lacking) is our knowledge.
Comparative genomic approaches for CRM prediction assume that the DNA sequences involved in gene regulation have remained significantly more similar than non-functional DNA across a wide phylogenetic span, such as multiple species of Drosophila or many eutherian mammals. Sequence changes in these regions are thus more likely to show signatures of purifying selection (Fig. 1b). While this assumption holds for most transcription factor coding sequences, it is not uniformly true for CRMs55,56, as illustrated in Box 3. Thus, comparative genomics approaches can be effective only for identifying the subset of CRMs that were under strong purifying selection since the separation of the species under comparison, and they will not reveal lineage-specific, recently evolved CRMs.
Evidence of strong evolutionary constraint in noncoding DNA, without other information such as TFBS motifs, has been used successfully as a de novo predictor of CRMs (Table 1). This approach has been applied both at the level of a single TFBS and of an entire CRM.
In alignments of orthologous sequences from a diverse set of mammals, the noncoding regions contain blocks with little or no change among species, surrounded by blocks with sequence differences (Fig. 1b). These conserved blocks are interpreted as functional DNA sequences in which substitutions were rejected during the evolution of the species being compared57–60. Noting the similarity between rejection of substitutions in DNA (revealed by the multi-species alignments) and protection of DNA from nucleases by protein binding (biochemical footprinting assays), Tagle et al.61 called these “phylogenetic footprints” and predicted that they would be reliable indicators of TF binding – even for TFs that have not yet been discovered. This prediction was validated in multiple studies of individual genes and gene families62–64 (Fig. 1b; Table 1). Subsequently, this approach was part of elegant work to identify regulatory motifs in promoters and 3' untranslated regions of mammalian genomes65 and entire genomes from multiple Drosophila species66,67. Because this approach is not dependent on a library of known TFBS motifs, novel motifs are discovered, and these can predict expression patterns of the regulated genes65 (Table 1).
Evidence of evolutionary constraint over longer segments of noncoding DNA (hundreds of base pairs) can reveal entire CRMs. Early examples are the use of human–mouse alignments to discover enhancers of immunoglobulin68 and interleukin genes69. CRMs predicted by noncoding constraint have been validated as enhancers at a very high rate using reporter gene assays after transfection of cells70,71 or production of transgenic Ciona intestinalis72, fish (Fugu rubripes)73 or mice71 (Box 4). Hundreds of human noncoding DNA segments showing signatures of extreme evolutionary constraint have been tested for the ability to drive tissue-specific expression in transgenic mouse embryos, and over half were validated34,74. In most studies (Table 1), predictions were made in the vicinity of regulated genes69–72, or a genome was scanned for evidence of extreme evolutionary constraint (e.g. conserved from humans to fish). A much lower validation frequency is observed when these criteria are relaxed75 (Table 1). Thus many constrained noncoding sequences may not be overtly involved in gene regulation, but constraint combined with other features can be effective for CRM prediction.
Combining inference of constraint from multispecies alignments with clusters of TFBS motifs can improve CRM prediction. Many known CRMs and in vivo bound TFBS motifs were found to be conserved between humans and rodents15 or among Drosophila species66,67,76, and the specificity of CRM prediction was improved when TFBS motif instances were restricted to those that are conserved in other species77–79. Blanchette et al. searched mammalian genomes and alignments for clusters of evolutionarily constrained TFBS motif instances80. These predicted regulatory modules (PReMods) encompass a large fraction of known CRMs (Table 1). A subsequent genome-wide mapping of likely enhancers found that over 40% of the DNA segments occupied by the co-activator p300 (which marks many enhancers) overlap with PReMods81. Some CRMs are bound by multiple molecules of a TF, each at an individual TFBS, and multiple instances of conserved motifs could represent homotypic clusters of TFBSs82. When DNA segments with more than one conserved instance of a given motif are tested, they validate at a high rate in transgenic fish and mice (Table 1).
Other efforts focus on specific cell types, e.g. using TFBS motifs for known hematopoietic TFs in addition to multispecies alignments83. A limited set of these predictions was tested, and all were validated (Table 1). Recently, Narlikar et al.84 predicted heart enhancers by applying a model of known and novel TFBS motifs learned from a large set of known heart enhancers to conserved noncoding sequences. This model predicts 42,000 heart enhancers in humans. Of these, 26 were tested in transgenic fish, and an impressive 62% of these were validated.
Although a CRM may be constrained among species, individual TFBS motif instances can tolerate sequence level change56. Modelling the evolution of CRMs can capture the signatures of this change without assuming sequence level conservation. The MorphMS model85 identifies regions in an existing pairwise alignment that fit an evolutionary model derived from a set of existing TFBS motifs, and was found to have the best performance for recovering known Drosophila melanogaster CRMs in a comparison of several computational approaches17. A promising extension of this approach incorporates gain and loss of binding sites86 but, due to additional computational complexity, this approach has not yet been employed for genome-wide CRM detection.
Because not all TFBS motifs are known, it is desirable to develop “motif-blind approaches” to prediction that are not limited by current knowledge of TFBS motifs. Approaches that search for patterns in a training set of known CRMs that distinguish them from non-functional DNA have been used for this purpose. One method finds patterns in multi-species alignment columns with significantly more frequent occurrences in training sets of alignments of known CRMs compared to alignments of presumably non-functional DNA87. The resulting “regulatory potential” score has been computed across the human and mouse genomes aligned with multiple mammals. Like the approaches based on modelling CRM evolution, this can capture signatures of change rather than just constraint, however using heuristics rather than an explicit evolutionary model. In the vicinity of erythroid-regulated genes, over half of the DNA segments with high regulatory potential that also have a preserved match to an erythroid TFBS motif are validated as enhancers in transfected cells88 (Table 1; an example is Zfpm1R13 in Box 3). Almost all the PReMods80 are found in the set of DNA segments with high regulatory potential89.
A different approach uses multiple methods to search for words (short DNA sequences) that are over-represented in a training set of known CRMs, and then further restricts the word matches to evolutionarily constrained regions90. The predicted CRMs that were tested were all validated both in transgenic Drosophila and mice (Table 1).
The studies reviewed here illustrate the power of comparative genomics approaches for predicting CRMs, but also highlight substantial differences in validation rate between approaches. The highest validation rates are found when focusing on genes likely to be regulated by a designated set of TFs, e.g. when searching for conserved instances of TFBS motifs for hematopoietic regulators around genes that are expressed in particular blood lineages. Furthermore, studies testing fewer CRMs tend to have higher validation rates. Perhaps it is not surprising that more comprehensive tests that include genes subject to a wider variety of regulatory mechanisms, such as the project examining constrained noncoding sequences on human chromosome 2175, reveal limited activity of the tested predictions. But the bulk of these studies show partial success of these approaches under favourable circumstances, i.e. involving known TFs and TFBS motifs, and a set of genes responding to a particular stimulus or differentiation pathway.
Some caveats should be kept in mind when evaluating the conservation-based methods. Only a small subset of CRMs is likely to be discovered by extreme evolutionary constraint, e.g. conservation from human to chicken or fish. While this is a strong predictor of developmental enhancers, it does not work equally well in all tissues74. Also, perhaps less than 5% of mammalian CRMs show conservation outside eutherian mammals91.
A major limitation of most comparative approaches is that they are not designed to find CRMs that are active in only one species or that are changing in a lineage-specific manner, such as enhancer GHP88 (Box 3). One would expect CRMs that are adaptive for a species to show evidence of rapid evolutionary change, but these will be missed by comparative approaches driven by a search for purifying selection. Indeed, some studies now indicate that most CRMs are species-specific92.
In future studies, comparative approaches can be developed that cover both closely-related and more divergent species, with the goal of finding lineage-specific and preserved functional sequences, respectively93. Additional types of regulatory regions should be tested. A study of silencer and insulator activities for 47 DNA segments from a 1 Mb region containing CFTR and flanking genes revealed that signatures of constraint did not improve predictions of these types of regulatory regions94. Larger scale studies and the development of models incorporating more types of features could be productive. Also, developing quantitative models that predict expression levels and patterns of target genes, followed by large-scale experimental testing, will be essential to evaluating progress toward more complete understanding of the genomics of gene regulation.
Given the limitations of methods based on sequence motifs and comparative genomics, direct measurement of diagnostic epigenetic features should lead to improved methods for CRM prediction. Epigenetic features are molecules and chemical modifications associated with genomic DNA, including covalent modifications of DNA and histones, RNA transcribed from the DNA, occupancy of DNA by transcription factors, and accessibility of DNA in chromatin to DNases95. Particular epigenetic features are highly correlated with CRMs, and progress is being made in finding combinations of these features that may distinguish different types of CRM.
Chromatin immunoprecipitation is a reliable method for purifying DNA in close contact with a particular protein in animal cells, as long as the interactions are relatively stable and a highly specific antibody is available96,97. With the introduction of sequence census methods98, in which the immunoprecipitated DNA is analyzed on massively parallel short-read sequencers, the DNA in close contact with the protein of interest can be determined with remarkable sensitivity and useful resolution (200–300 bp) across an animal genome. This methodology, called ChIP-seq, is being applied in many cell types to find DNA bound by a wide range of transcription factors or associated with chromatin having particular histone modifications (Fig. 1c, Box 3). DNase hypersensitivity, a general biochemical feature of CRMs, can also be mapped by sequence census methods called DNase-seq99,100. Consortia of multiple laboratories, such as ENCODE101, modENCODE102,103, and the NIH Roadmap Epigenomics Mapping Consortium104 are working in a coordinated manner to expand the coverage of cell types, transcription factors and modifications, and other epigenetic features. This section will summarize advances in using direct epigenetic information to predict two classes of CRM, promoters and enhancers.
The TSS is almost invariably located within the promoter (Box 1), and promoters can be successfully predicted from the locations of TSSs105,106. One study tested 152 predicted promoters by reporter gene assays in a range of mammalian cell types and found that 91% were active in at least one cell type (Table 1, Box 4). This remarkably high validation rate, which was confirmed in later studies106, shows that knowledge of a TSS leads to reliable promoter prediction, with no overt bias for sequence composition or motifs. A different epigenetic feature, the histone modification H3K4me3, is also effective for predicting active promoters81 (Table 1).
Enhancers have now been predicted with high accuracy based on several epigenetic features including histone acetylation107, the histone modification H3K4me1108, and binding of the co-activator p300 to a DNA segment109,110 (Table 1). The reporter gene assays were conducted in either transfected cultured cells or in transgenic mouse embryos (Box 4). Even when the tests were conducted on large groups of predicted enhancers, tissue-specific expression was driven by 75 to 87% of the DNA fragments, showing that these epigenetic marks are robust, accurate predictors of enhancers. Furthermore, a multivariate hidden Markov model (HMM) that combined information on several histone modificationsprovided excellent predictive power for tissue-specific enhancers in human111 (Table 1).
The success rate of predicting enhancers by occupancy by tissue-specific transcription factors is encouraging but not as successful as using the epigenetic marks just described (Table 1). For instance, of a set of 63 mouse DNA sequences bound in vivo by GATA1, half are active as enhancers in transfected cells in culture112. As expected from the association of CRMs with evolutionary constraint discussed previously, the set of validated enhancers includes some DNA segments with deep phylogenetic conservation of DNA and conservation of binding between species, but it also includes DNA segments bound in mouse but not human (Box 3). A similar fraction of Myoblast determination protein (MYOD)-occupied DNA segments was validated as active enhancers after testing in transfected cells19, (40%, Table 1, but applying a similar threshold for validation to that used in the GATA1 study shows about half had enhancer activity). Examination of occupancy by multiple transcription factors increases the predictive power of the data. Wilson et al.113 identified DNA segments jointly occupied by five hematopoietic transcription factors in megakaryocytes. Rather than testing these directly for enhancer function, they looked for genes previously not known to be important for hematopoiesis that are in the vicinity of the jointly occupied DNA segments. The function of these genes was then tested using a knock-down strategy, and all but one of the knock-downs caused a reduction in blood cell production113 (Table 1). Thus the jointly occupied DNA segments were excellent predictors of likely enhancers that led immediately to novel insights into hematopoietic regulation.
Direct experimental determination of biochemical features associated with promoters and enhancers has many advantages over computational methods. It is grounded in decades of work on biochemical mechanisms of transcriptional regulation, and the proteins and histone modifications being assayed are strongly associated with regulation. The experimental approach is now almost exclusively based on high throughput sequencing and mapping to reference genomes, and while these methods do have some biases, they allow almost complete coverage of animal genomes. These recent advances are exciting, but more research is needed to assess the sensitivity and specificity of both the comparative and the epigenetic methods. For example, a method based on P300 occupancy predicts hundreds of heart enhancers, whereas almost none of the enhancers predicted by very stringent constraint on noncoding sequences are active in heart110 (Table 1). However, another approach applying a motif-based classifier to less-stringently constrained sequences predicts 42,000 heart enhancers84. Both the latter and the P300-based predictions are validated at impressive rates (Table 1), suggesting that at least some of the current ChIP-seq datasets are missing some CRMs.
One disadvantage of the direct experimental approach is that epigenetic marks must be mapped in tissues and times of development that are informative to the question at hand. Ideally, all transcription factors and all histone modifications would be mapped in all cell types and developmental stages in the species of interest. Achieving this will be difficult for many reasons beyond budgetary ones, such as the limited number of regulatory proteins for which ChIP-quality antibodies are available and the difficulty in obtaining sufficient amounts of many cell types. While the ideal of completeness may never be achieved, a substantial amount of predictive power is likely to be attained as the regulatory landscape is mapped in a large number of cell types. DNase hypersensitive sites are being mapped in a broad range of cell types and tissues99,100. These are general marks for sequences potentially involved in regulation; virtually all known CRMs reside in such hypersensitive sites. Some regulatory regions, especially promoters but also some enhancers81,105,114, are active in multiple cell types. Many others are bound by TFs in specific cell types and can only be identified when assays are done in those cells. The genome-wide maps of epigenetic marks provide a valuable resource for CRM prediction, and one that will increase in value as a broader range of cell types and developmental stages are interrogated.
A caveat in using the genome-wide maps of factor occupancy is that some of these protein–DNA interactions may not be playing an active role in regulation. The very deep coverage achieved in recent ChIP-seq studies reveals significant binding at thousands of sites, but for well-known, lineage-specific transcription factors, the number of bound sites substantially exceeds the number of genes with significant changes in expression in that lineage. For example, over half of all genes are bound by the determination factor MYOD in muscle cells19. Integration of information on multiple epigenetic features111 may allow the TF occupied segments to be partitioned into classes with more specific predicted functions (including no obvious function), thereby giving more accurate predictions.
Future work will also likely interrogate more diverse functions. As noted before, CTCF is almost always found at mammalian insulators with enhancer-blocking activities115, but it is currently unclear what fraction of the large numbers of CTCF-occupied segments have this activity. We expect that such surveys will be conducted in the near future.
The bulk of the results summarized in this section were derived from ChIP-seq approaches that yield assignments of TF occupancy at a resolution of 200–300 bp. New technologies should refine that resolution substantially. Already, deep sequencing of DNase-sensitive regions is revealing small segments that correspond to TF binding sites (10–20 bp)100,116, and a new method employing exonuclease trimming gives very high resolution117. Identifying the gene(s) responsive to the TFs at enhancers has been problematic, and many studies used the closest active gene as a proxy for the target gene. However, the high-throughput versions118–120 of chromosome conformation capture technologies yield three-dimensional interaction maps that are providing exciting new insights into how distal CRMs interact with target promoters.
Comprehensive identification of CRMs is not currently possible from sequence comparisons alone, whether utilized to find clusters of TFBS motifs or to find evidence of strong constraint in DNA. Clusters of TFBSs do not provide sufficient specificity to be used in large-scale investigations, while restricting a search to strong constraint will miss a large number, even a majority, of TF-occupied segments. In contrast, high quality, high throughput biochemical data on epigenetic features will capture a large fraction of CRMs, and of course the fraction captured will increase as the amount of data increases, particularly as more diverse cell types and conditions are assayed. This information is becoming more readily available to individual investigators, either through their own efforts or by using the publicly released data from large consortia. We recommend that this be the starting point in searches for potential regulatory regions, but that both evolutionary information and motif patterns should then be used to bring in insights about potential functions and to organize experimental tests (Box 5).
We expect that future work will show that patterns in the TFBS motifs and their conservation (or lack thereof) can lead to strong and precise functional predictions. At the present time, investigators can use de novo motif discovery121 to predict binding partners of transcription factors and guide further ChIP experiments. It may be productive to partition the TF-occupied segments into motif classes, and assess whether these tend to associate with induction, repression or other activities of likely target genes. Significant associations will probably lead to mechanistic insights.
While most of the evolutionary analysis in the review focuses on purifying selection that has taken place across a wide evolutionary timescale, recent changes in DNA sequences can also affect gene regulation. Some in vivo TFBSs are allele-specific122, e.g. the TF binds to the maternal but not paternal allele in heterozygotes. Genetic variation affecting the affinity of regulatory proteins for CRMs likely explains some of the differences in gene expression between individuals123,124. Allele-specific binding by transcription factors or chromatin opening has been found at loci associated with susceptibility to cancer4,5 and to diabetes125. These studies show the impact of recent evolution in CRMs, and point to the medical importance of understanding these recent changes.
Future work should focus on integrating the many types of epigenetic information, building on recent efforts103,111,126,127. These could be extended to include multi-species comparisons, not only for the underlying DNA sequences (e.g. to infer evolutionary constraint) but also information about occupancy in additional species. Similarly, information about in vitro binding affinities128 and motif patterns needs to be brought into the analysis. Such integrations will be challenging, and using them to formulate testable hypotheses will be even more challenging. However, these seem to be reachable goals, and it will be exciting to work toward them. Indeed, the resulting hypotheses will constitute an initial formulation of a possible regulatory code. The hypotheses will need to be tested experimentally, likely starting with conventional gain-of-function reporter gene assays. Larger scale efforts are needed, which require development of higher throughput assays. Also, synthetic biology approaches54 will provide powerful tests of the hypotheses, and hopefully lead to a better understanding of the regulatory code.
DNA segments bound by TFs in the nucleus of cells are TF binding sites (TFBS, panel a). These are commonly mapped by chromatin immunoprecipitation (ChIP)96,129. Most cis-regulatory modules (CRMs) are comprised of a cluster of TFBSs. A TFBS motif is a short sequence (often 6 to 10 bp; colored circle in panel a) found within a TFBS that is required for TF binding, as demonstrated by loss of binding upon mutation of the sequence. The motif can be characterized as a consensus or as a position-specific weight matrix. Any match to a TFBS motif in a DNA sequence is a motif instance.
The size of a TFBS is determined by the resolution of the experimental technique employed. Using chromatin immunoprecipitation followed by high through-put sequencing (ChIP-seq)130,131, binding of transcription factors in vivo can be mapped to a DNA segment about 200–300 bp in length (panel a; note that the TFBS mapped for the bound red motif includes DNA also bound by the orange protein). DNase footprints100,116 and the recently developed ChIP-exo117 provide higher resolution, approaching that of the bound motif instance.
Different classes of CRM (promoter, enhancer/silencer, insulator) share some chromatin modifications (circles with different shades of green on the blue histone tails extending from nucleosomes, panel b), such as acetylation (Ac) of histones H3 and H4 for all three classes132 (for simplicity, the Ac is only shown on H4 in the figure) and monomethylation of histone H3K4 (H3K4me1) for both enhancers and insulators (and distal to the TSS around promoters)108. Other modifications are distinctive for a CRM class. Active promoters have a nucleosome-depleted region just upstream from the TSS, flanked by nucleosomes with high levels of trimethylation at lysine 4 of histone H3 (H3K4me3)81,132. Promoters can also be identified by ChIP-seq for RNA polymerase II. Nucleosomes at enhancers have high levels of H3K4me1108 and are positioned adjacent to the TFBSs133. The co-activator P300 is frequently found at enhancers108. Insulators that work by blocking enhancement require binding by CCCTC-binding protein (CTCF) in mammals39.
Bioinformatics approaches for cis-regulatory module identification typically employ supervised machine learning, in which models are built from trusted training data and then used for prediction. Training data are generally derived from experimental data, such as binding footprints or regions identified by ChIP-seq that are enriched for a specific transcription factor, or they are the result of functional assays for enhancer or other CRM functions.
TFBS motifs are most often described using a position weight matrix (PWM), a model for a fixed length sequence that specifies the probability of each nucleotide at each position. Given a background model, a PWM can be converted to a position specific scoring matrix (PSSM), which can directly compute the log-odds of a given string being generated by the PWM model versus the background model. The log-odds score evaluates a single site, but does not assess the likelihood of finding such a site in a longer sequence. Several approaches can be used to evaluate the statistical significance of these log-odds scores, either through simulating the score distribution134,135 or from a sequence database136.
Many approaches have been developed for identifying motif clusters. Simply scanning genomic sequence for windows containing multiple motif matches has been used for predicting CRMs58, but choosing appropriate significance thresholds can be difficult. One of the first machine learning approaches for CRMs used the positions and scores of strong matches to PWMs as predictors in a logistic regression model. An advantage of such an approach is the ability to capture not just clustering of motifs, but constraints on the organization of motifs in a cluster (such as order). Regardless of the approach used to find clusters, several methods have been developed that use statistical models to assess the significance of motif clusters in a sequence46, even in the presence of constraints on organization136.
Three examples illustrate that while some enhancers are subject to strong evolutionary constraint over long phylogenetic distances, others show less constraint and still others appear to be lost in a lineage-specific manner. The deeply preserved enhancer R1388 in the gene Zfpm1 (panel a) is bound in vivo by GATA transcription factors and TAL1 in both human and mouse erythroid cells, as shown by ChIP-seq data20,137,138, has a strong signature of evolutionary constraint by the phyloP score139 and contains phylogenetic footprints for a GATA1 binding site motif (boxed in red). The genome browser views are shown at increasing resolution, as appropriate for each feature; known enhancers are shown as blue-filled boxes.
The enhancer GHP10112 (panel b) is occupied by erythroid TFs in mouse and human, but other predicted CRMs for the Chsy1 gene differ between mouse and human. GHP10 has a sparser phyloP signal for constraint compared to Zfpm1R13, but preservation of some GATA1 binding site motifs.
The enhancer GHP88112 (panel c) is found in an intron of the mouse Abdh2 gene, but no GATA1 occupancy is observed at this position in human. In contrast, human-specific binding is seen upstream from ABHD2 (arrow or asterisk:). While a GATA1 binding site motif is conserved in the rodent, horse and cow homologs of GHP88, no homologous sequence is found in human or rhesus, indicating a primate-specific deletion that leads to a negative signal for phyloP. Gray lines in the alignments indicate that no orthologous sequence is found in the comparison species.
The most common methods for demonstrating that a DNA segment can function in the regulation of gene expression are gain-of-function assays after transferring a reporter gene, encoding an readily-assayed enzyme, into cultured cells (transfections, panel a) or whole animals (transgenic assays, panel b). For promoter assays (panel a), the predicted CRM is placed in front of a reporter gene (Luciferase, Luc) lacking a promoter and transferred into cultured cells. For enhancer assays (panel a, lower; panel b), the predicted CRM is added to a reporter gene already driven by a low-activity promoter (pr). The enzyme assays after cell transfection give a quantitative estimate of enhancer activity (panel a, right; box plots show the distribution of enhancement measurements for multiple determinations88,112). Information about tissue- and developmental-stage specificity is limited by the cell types investigated by transfection. Staining transgenic mouse or fly embryos carrying the lacZ gene encoding beta-galactosidase shows blue staining in the tissues in which an enhancer is active, providing information on tissue specificity. Loss-of-function tests of predicted CRMs (preCRMs), e.g. by targeted deletion, are desirable, but they are more difficult and not used as frequently.
Other methods for investigating predicted CRMs examine the expression patterns of presumptive target genes. The most common assumption is that the gene with a transcription start site closest to the predicted CRM is the likely target. If a likely target gene has an expression pattern expected for the features used to predict CRMs, such as expression in muscle for CRMs predicted by the occurrence of binding site motifs for muscle determination factors, then this supports the validity of the enhancer. Of course, this is not as powerful as a direct experimental demonstration.
A novel approach that uses expression of presumptive target genes is to search for genes not previously known to be required in a tissue of interest. Instead of testing the function of the preCRMs, the effect of specific knock-down of the presumptive target can be monitored, e.g. employing morpholino oligonucleotides that interfere with gene function. Defective development or aberrant function of the tissue would serve to validate the activity of the predicted CRMs.
Prediction of binding by a transcription factor to a DNA sequence can be tested by measurement of occupancy in vivo, e.g. using chromatin immunoprecipitation. This method is appropriate for determining that a protein is bound to the DNA sequence, but it provides no information about a role in regulation. The older literature contains many studies of binding by purified proteins or proteins in nuclear extracts to specific DNA sequences. Studies with appropriate controls to distinguish specific binding have some utility, but these results are largely superseded by current ChIP-seq data on in vivo occupancy.
Investigators using genomic data to find transcriptional regulatory regions in animal DNA will find all three approaches to be useful, but each should be employed for a different aspect of their investigations. If data on epigenetic features can be obtained, that should be the starting point for predicting CRMs. High quality datasets on such features provide a relatively unbiased view of the regulatory landscape. We expect that most of the important regulatory regions will be present, assuming the relevant transcription factors are examined in an appropriate cell type for the question of interest. Even if that is not the case, the profile of DNase HSs in a battery of cells across loci of interest could be a good initial guide101.
The approaches based on multi-species alignments can then be applied to infer the evolutionary histories of the predicted CRMs and the motifs within them. Indeed, a large number of CRMs may be predicted based on the epigenetic features, and partitioning them based on the extent of phylogenetic conservation can be informative. Conservation can prioritize candidates for functional testing; conservation of TFBS motifs across multiple species of Drosophila was found to be strongly associated with regulatory function66, and GATA1-occupied DNA segments with TFBS motifs that are deeply preserved across mammals were active as enhancers substantially more frequently than those with lineage-specific motifs112. However, the hypothesis that evolutionary constraint helps to distinguish TF-occupied segments that are active from bound but passive sites needs much more extensive testing. Most DNA segments bound by a liver transcription factor in one mammal are not bound by that factor at the homologous DNA in a different mammal92,140, and some lineage-specific occupied segments are active in regulation (Box 3). Thus we recommend using conservation as a means to partition predicted CRMs and to infer their history, but not as a filter to remove them from further consideration.
Partitioning predicted CRMs by depth of conservation may provide insight into the functions of their target genes. An initial exploration of that question found significant enrichments that differed for CRMs conserved to distinct phylogenetic distances91, and this could be a productive area for more complete investigation. Also, the depth of conservation could reflect variation in the severity of constraint on different aspects of regulatory mechanisms. For example, an interesting hypothesis to test is that CRMs conserved across all vertebrates play a more central mechanistic role in regulation, while lineage-specific CRMs could be modulating that core activity.
Just as analysis of conservation leads to insights about CRMs predicted by epigenetic features, so will an analysis of TFBS motifs. It is still important to find TFBS motifs for several reasons, including generalizing insights from a well-studied set of CRMs to whole-genome analysis, for making predictions about function, and for understanding the structure of a particular CRM. Epigenetic marks have limited resolution, and motif-based bioinformatics approaches can untangle what is going on inside the modules. Indeed, the conservation analysis just discussed is most informative when applied to the TFBS motifs rather than the entire TF-occupied segment67,91,112,132. Furthermore, recent work shows that combining one or more datasets on epigenetic features with TFBS motif models improves the ability to find TF-occupied sites141.
Support is from NIH grants R01 DK065806, RC2 HG005573, and U54 HG004695, and funds from Emory University to JT.
Ross C. Hardison received his Ph.D. in biochemistry from the University of Iowa and was a postdoctoral fellow at the California Institute of Technology in the laboratory of Dr. Tom Maniatis. He is the T. Ming Chu Professor of Biochemistry and Molecular Biology at the Pennsylvania State University. His current research uses mapping of epigenetic features and comparative genomics to identify cis-regulatory modules, their cognate transcription factors, and chromatin states involved in global mechanisms of gene regulation, with a
special emphasis on hematopoiesis.
James Taylor received his Ph.D. in Computer Science and Engineering from the Pennsylvania State University and was a postdoctoral fellow at the Courant Institute of Mathematical Sciences at New York University. He is an Assistant Professor of Biology and Mathematics & Computer Science at Emory University in Atlanta, Georgia, USA. His current research in bioinformatics and computational biology focuses on understanding how complex function is encoded in the genome and on making “high-end” computational biology more accessible.