|Home | About | Journals | Submit | Contact Us | Français|
Proper development and functioning of an organism depends on precise spatial and temporal expression of all its genes. These coordinated expression-patterns are maintained primarily through the process of transcriptional regulation. Transcriptional regulation is mediated by proteins binding to regulatory elements on the DNA in a combinatorial manner, where particular combinations of transcription factor binding sites establish specific regulatory codes. In this review, we survey experimental and computational approaches geared towards the identification of proximal and distal gene regulatory elements in the genomes of complex eukaryotes. Available approaches that decipher the genetic structure and function of regulatory elements by exploiting various sources of information like gene expression data, chromatin structure, DNA-binding specificities of transcription factors, cooperativity of transcription factors, etc. are highlighted. We also discuss the relevance of regulatory elements in the context of human health through examples of mutations in some of these regions having serious implications in misregulation of genes and being strongly associated with human disorders.
A human body has numerous different types of cells , which are further organized into tissues with distinct structures and functions. The proper development and functioning of all these tissues require precise spatial and temporal expression of the thousands of genes encoded in the human genome. A cell achieves this primarily by regulating the rate of transcription of its genes, a mechanism commonly referred to as transcriptional regulation.
Transcriptional regulation is mostly mediated by sequence-specific binding of proteins called transcription factors (TFs) to regions on the DNA called TF binding sites. Rarely does a single TF–DNA binding event control the transcription of the target gene in eukaryotes. Instead, different combinations of ubiquitous and cell type-specific TFs act together, binding to regulatory elements, which harbour the respective TF binding sites. As a result of this combinatorial control, a human cell is able to regulate the transcription of its large number of protein coding genes (between 20 000 and 25 000 ) by a relatively small number of TFs (8% of all proteins ). Furthermore, a regulatory element can operate tens of thousands of base pairs (bp) away from the target gene , adding another layer of complexity to transcriptional regulation.
While most genes have been successfully annotated in the human genome, our knowledge of regulatory elements controlling these genes in different cell types, at various time-points and under different environmental stimuli is still limited. Recent studies have shown that mutations in many of the known regulatory elements are associated with diseases , indicating the important role regulatory elements could play in disease diagnostics and drug discovery.
Here we review the problem of identifying regulatory elements and their functions in various cellular processes. We briefly look at the different bio-molecules that participate in transcriptional regulation and examine the distinct roles of regulatory elements. We then discuss the close relationship between mutations within these elements and diseases emphasizing the importance of identifying regulatory elements. Finally, we survey the popular methods used to identify regulatory elements, with a special focus on computational approaches.
The transcript of a protein-coding gene, which originates from the transcription start site (TSS), is produced by the enzyme RNA polymerase II. However, the enzyme by itself does not directly recognize the TSS, and requires the presence of other factors called general transcription factors (GTFs). These GTFs assemble on the DNA at a region known as the core promoter, which includes the TSS as well as other binding sites recognized by different subunits of the GTFs. See Thomas and Chiang  for a review. After the GTFs form a complex with the core promoter, the polymerase binds to it, forming a transcription initiation complex (TIC). The main players regulating the formation and activity of the TIC can be classified into two groups based on their mode of activity: trans-acting factors that are not part of the DNA and cis-acting elements that are regions along the DNA.
The assembly of the TIC on DNA has been shown to produce RNA transcripts from DNA templates in vitro . However, in vivo, the recruitment of TIC requires additional factors, which can be classified into two groups:
The final rate of transcription of a gene depends on the combined effect of gene activating and gene repressing mechanisms. As in the case of gene activation, cells employ many different mechanisms to repress genes . The proteins involved in repression can be similarly classified into two groups:
In this review, we primarily focus on the DNA elements that contribute to transcriptional regulation. There are several different kinds of such regulatory elements that are utilized by activators or repressors, or are responsible in changing the chromatin landscape to either activate or repress transcription.
Promoters can be classified as core promoters (regions within 100 bp around the TSS ) and proximal promoters (regions further away from the TSS, but generally limited to a few hundred base pairs ). As mentioned previously, core promoters contain binding sites for ubiquitous GTFs, which are instrumental in recruiting the polymerase to the TSS. The proximal promoters contain binding sites for activator proteins that interact with the GTFs and can drive tissue-specific expression [13–15].
An enhancer is a regulatory region found at a greater distance from the TSS (compared to promoters), and can be either upstream or downstream of the gene or within an intron . Most enhancers act as modules independent of orientation and distance from the TSS of the target gene ; however, some cases have been reported where this is not true [17, 18]. An enhancer usually harbours binding sites of multiple activators spatially constrained to allow for a stable DNA–protein complex. Two mechanisms have been proposed for enhancer activity. The popular looping theory states that once activators bind the enhancer, the DNA between the enhancer and the core promoter loops out, bringing the activators close to the promoter . Specific protein–protein interactions between activators binding the enhancer and the promoter ensure that the correct target gene is activated. DNA scanning is the alternative proposed mechanism, where after binding enhancers, activators move continuously along the DNA until they encounter their target promoter .
Silencers are regulatory regions that have a repressing effect on the target gene. Many silencers act in a distance- and orientation-independent manner, although some have been reported to act only within promoters and UTRs. Silencers can be present within enhancers or can act as independent modules with binding sites for repressors .
Insulators are regulatory regions that stop the activating or repressing transcriptional activity in a locus from spreading to an adjacent locus. They can do so in one of two ways, either by inhibiting the interaction between enhancer/silencer and promoter, or by preventing the spread of heterochromatin through the formation of a barrier . An insulator usually contains multiple binding sites for TFs and the strength of the insulator is directly proportional to the number of binding sites .
Eukaryotic genomes contain two additional types of regulatory regions: locus control regions (LCRs) and matrix attachment regions (MARs). LCRs are regulatory elements that enhance the expression of a cluster of genes in a specific cell type. LCRs can contain several different enhancers, silencers or insulators, each of which can be bound by different TFs . MARs are elements on the DNA that make contact with the nuclear matrix. These regions are AT-rich and are believed to facilitate dynamic changes in chromatin structure to allow accessibility to TFs at their binding sites .
As described in the previous subsections, transcriptional regulation is a collaborative effort between different TFs, chromatin remodeling complexes and other non-DNA-binding co-factors. These proteins can be either ubiquitous or cell type specific, but together activate or repress genes by targeting specific regulatory elements. As a result, regulatory elements harbouring multiple TF binding sites are often referred to as cis-regulatory modules (CRMs), with each CRM contributing to a specific spatial and temporal expression pattern of the gene . CRMs typically range from 50 bp to a few 100 bp, and rarely >1 kb in length .
Activation or repression is seldom a binary switch of ON and OFF. Rather, the rate of transcription is modulated between the two extremes by the relative concentrations of activators and repressors in each cell type. For example, the Yellow gene in Drosophila has two upstream tissue-specific enhancers. One of the enhancers drives expression of Yellow at low levels in large parts of the wings, giving them a light grey colour. The other enhancer is stronger, driving expression at high levels in the abdomen, giving it a darker hue .
Genetic disorders are commonly associated with mutations in protein coding genes, including non-synonymous nucleotide substitutions, deletions, insertions and introduction of premature stop codons. Genetic abnormalities associated with gene mutations have been reported for Parkinson's disease , breast cancer , cystic fibrosis  and hundreds of other diseases .
Mutations in regulatory elements are generally assumed less likely to have a pronounced phenotypic impact, as these affect the expression pattern of a gene, not the structure or function of a protein. Furthermore, it is common for a gene to have multiple regulatory elements with each one having a small contributory function [31, 32]. This redundancy in the function of regulatory elements in a locus [31, 33, 34] has been argued to provide an explanation to why deletions of certain ultraconserved non-coding elements with enhancer activity lead to no observable phenotype . However, contrary to these findings, the number of recorded cases of non-coding mutations linked to human diseases has been growing rapidly. HTRA1 promoter mutation has been linked to macular degeneration , PKLR promoter mutation to pyruvate kinase deficiency , erythropoietin promoter mutation to diabetic eye and kidney complications . Multiple other promoter mutations have been also associated with different diseases . An intronic mutation in the RET gene has been linked to Hirschsprung disease risk with a 20-fold greater contribution to risk than rare alleles . DAX-1 intronic mutation has been shown to cause X-linked adrenal hypoplasia . There are many cases of distant intergenic mutations as well. One of the classical examples is the SHH mutation that causes pre-axial polydactyly  and resides 1Mb away from the misregulated gene in a well-conserved region. Another polymorphism in the IRF6 enhancer located 10 kb upstream of the TSS is associated with cleft lip .
Genome-wide association studies (GWAS) provide a high-throughput approach to rapidly identify disease-causing polymorphisms by scanning markers across genomes of many people. Results of many GWAS are available in the database of genotypes and phenotypes (dbGaP) . A GWAS study involving myocardial infarction (MI) warrants a special mention in the context of disease-associated mutations in non-coding elements. MI is a common presentation of ischaemic heart disease; a disease accounting for >12% of deaths worldwide . A recent study of early-onset MI performed by the Myocardial Infarction Genetics Consortium limited strong genetic associations of the disease to nine single nucleotide polymorphisms (SNPs) . All of them reside in non-coding regions of the human genome.
Identifying mutations within coding regions that cause diseases and/or disrupt normal biological processes is generally an easier task than identifying those within non-coding regulatory regions. This is primarily due to the inherent difficulty in identifying non-coding regulatory regions. Whereas coding regions have characteristic exonic features making them relatively easy to spot, non-coding regulatory regions have few distinguishing sequence signatures. Moreover, establishing the identity of the target gene can be a further challenge since these elements do not necessarily control the gene closest to them. As a result, although several regulatory elements have so far been identified, the list is far from comprehensive. Indeed, the fraction of genes regulated by each type of regulatory element (enhancers, silencers, insulators, LCRs and MARs) has not yet been established . In the next two sections, we discuss current approaches geared towards identifying and characterizing regulatory elements.
Since the discovery of the first long-range regulatory element acting upon a mammalian gene  almost three decades ago, technologies for detecting regulatory elements have advanced tremendously. In this section, we examine experimental strategies, both small-scale and high-throughput, for identifying regulatory elements.
One of the most effective ways of examining the regulatory activity of a DNA region is with a reporter gene assay. In such assays, plasmids containing the region of interest and a reporter gene whose expression level can be measured accurately (e.g. green fluorescent protein) are introduced into cells of the organism of interest. The structure of the plasmid depends largely on the kind of role the element is expected to play in regulation. If the element is being tested for promoter activity, it is placed immediately upstream of the reporter gene. If the element is suspected of being an enhancer, a weak promoter that needs an enhancer to drive expression is placed immediately upstream of the reporter gene and the element to be tested is placed either upstream or downstream of the promoter-gene construct. If the element is a silencer, the weak promoter is replaced by a strong promoter that is sufficient to drive ubiquitous expression. If the element to be tested is an insulator, it is placed between a well-characterized enhancer–promoter pair, upstream of the reporter gene. In this case, it is important to check that the placement of the element beyond the enhancer (upstream of both enhancer and promoter) does not repress transcription. For a review on designing reporter gene assays, see Carey and Smale .
Transfection assay is a commonly used reporter gene assay where the plasmid is introduced into cultured cells using a transfection procedure. This can be done in a transient or stable manner. In the former, the plasmid usually remains episomal and does not get integrated into the host genome. As a result, these regions may not be in an appropriate chromatin configuration and could lead to aberrant observations. This limitation can be overcome in stable transfection assays, where special measures are taken to ensure that the plasmid gets integrated into the host genome. A major advantage of transfection assays is that they can be performed in a high-throughput manner [47, 48].
Transfection assays are performed in immortalized cell lines that may not resemble environments naturally occurring in the organism. Transgenic assays overcome this limitation by employing animal models. In these assays, the plasmid is integrated into a fertilized egg at several random locations within the host genome. The in vivo expression pattern of the reporter gene in the embryo indicates the tissue-specific activity of the inserted element. These assays have been successful in various animal models like fly, fish, frogs, chickens and mice [49–54]. A large-scale study involving human regions tested in mouse embryos identified 75 enhancers active at a particular time-point . Since the publication of the original study, the data set of these tissue-specific enhancers has grown to 497 enhancers. 
Assay-based methods are usually time-consuming and expensive. In addition, they are limited by the number of elements that can be tested at a time. In this section, we review some high-throughput techniques which can locate regulatory elements on a genome-wide scale.
A chromatin immunoprecipitation (ChIP) experiment is used to determine the genomic sequences bound by a particular protein in vivo. The protein of interest is cross-linked to the chromatin in the cells, which are then lysed and the DNA is sheared into pieces of desired size. Using an antibody specific to the protein of interest, protein–DNA complexes are precipitated from the mixture. The identity of DNA regions that are part of the complex can be determined either by using microarrays (ChIP-chip) or by high-throughput sequencing (ChIP-seq). A major advantage of this technology is that the whole genome is tested for in vivo binding of the protein of interest. Also, this method can detect different kinds of regulatory elements depending on the function of the profiled protein. For instance, ChIP experiments profiling the insulator protein CTCF have identified locations of putative CTCF-binding insulators in multiple organisms and cell types [56–58]. Similar experiments with different proteins have been used to identify promoters, enhancers and silencers [59–62].
One drawback of this technology is that a specific antibody needs to be created for every TF of interest. Furthermore, finding all regulatory elements active in a particular cell type in principle requires the identity of all TFs likely to act in that cell type. Visel et al.  approached this problem in a different manner: they profiled co-activator p300 associated with enhancers  instead of a specific DNA-binding TF. ChIP-seq profiling of this protein in three different tissues of the developing mouse embryo identified distinct putative enhancers. Visel et al. further demonstrated that a large fraction of these regions were indeed tissue-specific enhancers.
Chromatin structure is another indirect indicator of regulatory elements. DNase hypersensitive sites (HSs), i.e. nucleosome-depleted regions that are easily digested by DNase I enzyme, have long been associated with regulatory elements that are bound by TFs. The novel DNase-chip  (or DNase-seq) technology has provided a genome-wide view of DNase HS in T cells, which are indeed enriched for binding sites of TFs active in the same cell type . Similarly, regulatory elements are known to be enriched for certain histone modifications [181, 182]. Recent genome-wide profiles of various histone modifications using chIP experiments have revealed the location of several putative regulatory regions in different cell types [183, 184].
Small-scale experiments, while specific and generally reproducible, are labour-intensive and impractical when many elements need to be tested. Current high-throughput experiments test several regions simultaneously, but are usually noise-prone and still limited to a few cell types and environmental conditions. With 98% of the 3 Gb human genome being non-coding and therefore likely to harbour regulatory signals, computational approaches towards detecting them are proving invaluable. In this section, we discuss two related parts of the problem. As mentioned previously, TFs bind DNA in a sequence-specific manner, and hence, detecting binding specificities of individual TFs constitutes the first part of the problem. These binding specificities can then be used to determine potential binding sites of TFs in the genome, which leads to the second part of the problem: identifying functional clusters of binding sites of TFs constituting regulatory elements.
TF binding specificities are often represented as position weight matrices (PWMs)  with each position in the binding site modelled as a multinomial distribution over the four nucleotides. Small-scale experiments like electrophoretic mobility shift arrays  and DNA footprinting  can test the binding affinity of a TF with a few DNA templates at a time; doing so for a large number of DNA templates is highly impractical. Indeed, only a small fraction of human TFs have been well characterized using such methods and are listed in the TRANSFAC  and JASPAR  databases. Recently, Berger and Bulyk  developed a novel large-scale technology where a large number of DNA substrates can be tested simultaneously for binding by a purified protein using protein binding microarrays. The database UniPROBE  contains over 200 eukaryotic TFs characterized by this methodology.
Large-scale in vivo experiments like ChIP-chip or ChIP-seq can locate all genomic regions bound by the profiled TF. A common overrepresented signature or ‘motif’ can then be identified from these regions using de novo motif discovery programs, yielding a PWM for the TF. Similar programs are also applied to detect motifs in promoters of co-expressed genes, the assumption being that such a set of genes is likely to be regulated (and therefore bound) by a common TF. A plethora of de novo motif discovery programs have been developed so far, from early ones that identified signals close to TSSs  in prokaryotes to the more recent ones focused on eukaryotes .
Motif discovery methods usually fall in one of two main categories: (i) enumerative, which examine the frequency of all DNA strings and compute overrepresented strings to form a PWM [76–79] and (ii) probabilistic, which tackle the problem by creating a multiple local alignment of all sequences while simultaneously learning the PWM parameters using methods like expectation–maximization [80–83], Gibbs sampling [84–89] or greedy approaches . Each category has certain advantages over the other. Enumerative approaches exhaustively search the whole space and therefore (unlike probabilistic methods) do not run the risk of getting stuck in a local optimum. In contrast, probabilistic methods can handle arbitrary variations in the motif model and are not affected by the length of the motif. A combination of the two approaches has also been proposed [91, 92].
An assessment of 13 publicly available methods  showed that no method consistently surpassed others in all data sets, indicating that the problem of motif discovery is far from solved. In addition, most tools performed better on yeast data sets than similarly created data sets from more complex organisms like human and mouse.
Recently developed methods have approached the problem of improving the detection of motifs in two ways . The first is by improving the model for representing binding sites. Since a PWM cannot model the dependence of nucleotide preferences between positions, more flexible models like pair-correlation models , trees , mixtures of PWMs and trees , non-parametric models  and feature-based models  have been developed and shown to be more effective for some TFs. The other direction has been towards using additional biological information like sequence conservation [99–107], TF concentration  computed based on gene-expression data, locational preferences of binding sites within co-bound sequences [109, 110], chromatin state of the genome , TF structural information [112–117] and DNA structural information .
The aforementioned methodologies identify PWMs recognized by TFs and the short 5–15 bp regions most likely to be bound by TFs within the set of input sequences. These methods usually treat each site independently and are employed when the search is carried out in a set of co-bound sequences not longer than a few 100 bp. However, searching for regulatory elements, even if the TF PWM is known, is trickier: a simple scan of the genome for sequences similar to learned PWMs can often lead to spurious matches which occur frequently in the genome by chance and are not necessarily utilized by the TFs in vivo. One way to solve this problem is by finding clusters of TF binding sites. As mentioned previously, transcriptional regulation is a collaborative effort between different TFs binding next to each other forming CRMs. Simply put, solitary binding sites are less likely to act as regulatory elements than binding sites occurring in clusters, which is the primary basis in the rapidly growing field of CRM detection.
CRM detection was initially developed to identify core promoters. Two early programs used different approaches to solve the problem: PromoterScan  used known GTF motifs, the TATA box and motifs of other TFs known to be enriched near the core, while PromFind  used the variations in hexamer frequencies across promoter, coding and non-coding regions. Since then, several programs have employed more complex computational techniques [121–127] to solve this problem (see Bajic et al.  and Bajic et al.  for an assessment of various promoter prediction methods on human genomic data and Table 1 for details of some of these methods). Not surprisingly, the accuracy of these programs increases with an increase in high-quality hand-curated training data. Incorporation of large-scale data from recent cap analyses of gene expression (CAGE)  experiments, which identify the 5′ ends of cDNAs, has enabled computational approaches to detect core promoters and TSSs with high resolution and remarkable accuracy [131, 132].
Many CRM-detection algorithms have also been developed to detect distal regulatory elements, from early ones which modelled the co-occurrence of two TF binding sites [133, 134] to more complex ones which use sequence conservation, gene-expression data, inter-dependence between various TF binding sites, etc. Predictions based solely on sequence conservation have been shown to achieve remarkable success [31, 33, 135], although they are likely to miss many species-specific elements or functional elements that do not produce ‘high scoring’ alignments with currently available tools . Conversely, conservation across species does not necessarily imply regulatory functionality of the region [137, 138]. However, when interpreted appropriately, sequence conservation holds tremendous potential in reducing the vast search space of the non-coding genome and is used extensively by most CRM detection algorithms.
TF motifs have been used to produce a genome-wide map of TF binding sites , and predicting CRMs based on their higher densities has been shown to be beneficial [140–143]. If the identity of TFs active in the cell type of interest and their motifs is known, the predictive power of the methods increases for that cell type [144–150]. In a complementary approach, the loci of genes with a similar function can be searched for common TF binding sites [151–154]. In such approaches, TFs specific to that function can also be learned. This has also been attempted without prior knowledge of motifs, by learning overrepresented words [155, 156] in loci of co-regulated genes.
Methods have been targeted to find a special class of CRMs, those containing binding site clusters of the same TF, also known as homotypic clusters. Homotypic clusters have been widely studied in Drosophila [157–159], but are yet relatively unexplored in mammalian genomes. A large fraction of methods use a set of elements believed to be functional in a particular process or cell type and train a model based on the frequencies and relative distributions of motifs within them. One of the earliest CRM discoveries in mammalian co-regulated sequences was performed by Wasserman and colleagues in muscle cells  and later in liver cells . The approach there was to compile PWMs of known muscle (liver) TFs and use them to learn a logistic regression model to classify between muscle (liver) and non-muscle (non-liver) regulatory regions. Since then many methods have been developed that train a model based on TF motifs occurring in a set of CRMs to make novel predictions; in many cases, a set of motifs is needed to be provided by the user [149, 150, 162–164], in others overrepresented motifs or words are learned de novo from the data [165–169].
Table 2 shows a description of several CRM-detection methods grouped according to the type of data they require. All these methods make use of a subset of the following biological data: libraries of binding specificities of known TFs, PWMs of TFs known to act in a cooperative manner, cross-species sequence conservation, known CRMs and gene-expression data. More recently, methods have been devised to exploit other kinds of biological information. Quantitative high-resolution imaging has made available concentrations of regulatory proteins targeting segmentation genes in the nuclei of a Drosophila embryo at different time-points during development . These concentrations of TFs and their PWMs have been used to model the likelihood of a DNA sequence driving expression of a segmentation gene and hence being a regulatory element [171, 172]. Some methods have shown significant improvements in the accuracy of detecting CRMs using chromatin information, either in the form of histone modification data [173, 174] or DNase HS data .
During the last decade that encompassed the sequencing of the human and many other vertebrate genomes, our understanding of mechanisms of gene regulation has grown remarkably. The convergence of computer algorithms and bio-technology has played a major role in deciphering the architecture of the regulatory landscape of complex genomes.
While this review provides additional details, we summarize main aspects of the developments that have had the most notable impact on the advancement of the field. ChIP-chip (and later ChIP-seq) experiments have been instrumental in describing the genome map of active TF binding sites, histone modifications and chromatin structure. With the rapid sampling of additional TFs and broader sets of cell lines, we are moving towards a comprehensive landscape of the regulatory genome. Assay-based testing in model organisms like Drosophila, mouse and zebrafish, has produced large data sets of tissue-specific developmental regulatory sequences. On the computational front, modelling the composition of TF binding sites and inter-TF interactions in CRMs has greatly improved the precision of CRM predictors. Most importantly, it is the clever use of high-throughput data arising from various experiments (which directly or indirectly indicate functionality of DNA regions) that has enabled machine learning approaches to make accurate novel predictions. We must emphasize the leading role of the ENCODE project  in facilitating and supporting many of these studies; first by targeting only 1% of the human genome, and then by expanding to the entire human genome and genomes of model organisms (modENCODE) .
With the increasing amount of publicly available GWAS data for multiple diseases, we have begun to observe the large role that the gene regulation plays in human disorders. As our understanding of the functions of non-coding mutations in diseases grows, our ability to effectively screen patients for fitness and survival will increase. The former is closely tied with advances in computational and high-throughput technologies that accurately identify regulatory elements and predict their function. Having such tools will reduce the search space from 2.9 Gb of non-coding DNA in the 3 Gb human genome to a manageable subset of functionally relevant regulatory elements, thereby ensuring stronger associations. Additionally, knowing the structure of regulatory elements and the TFs that utilize these elements will benefit drug therapeutics through characterization of novel candidate drug targets.
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.
We are grateful to Leila Taher and Valer Gotea for critical comments and assistance with manuscript preparation.
Leelavati Narlikar is a postdoctoral fellow at the Computational Biology Branch of the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH). Her research interests include modeling the architecture of tissue-specific enhancers and developing computational techniques to identify novel regulatory elements.
Ivan Ovcharenko is a Principal Investigator at the Computational Biology Branch of the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH). His research is focused on the computational analysis of gene regulation in the human and other vertebrate genomes. Ovcharenko laboratory is particularly interested in determining the genomic code of tissue-specific regulatory elements, evolutionary divergence of enhancers and silencers, population variation in non-coding DNA and non-coding polymorphisms associated with genetic disorders.