|Home | About | Journals | Submit | Contact Us | Français|
Despite remarkable recent advances in genomics that have enabled us to identify most of the genes in the human genome, comparable efforts to define transcriptional cis-regulatory elements that control gene expression are lagging behind. The difficulty of this task stems from two equally important problems: our knowledge of how regulatory elements are encoded in genomes remains elementary, and there is a vast genomic search space for regulatory elements, since most of mammalian genomes are noncoding. Comparative genomic approaches are having a remarkable impact on the study of transcriptional regulation in eukaryotes and currently represent the most efficient and reliable methods of predicting noncoding sequences likely to control the patterns of gene expression. By subjecting eukaryotic genomic sequences to computational comparisons and subsequent experimentation, we are inching our way toward a more comprehensive catalog of common regulatory motifs that lie behind fundamental biological processes. We are still far from comprehending how the transcriptional regulatory code is encrypted in the human genome and providing an initial global view of regulatory gene networks, but collectively, the continued development of comparative and experimental approaches will rapidly expand our knowledge of the transcriptional regulome.
In contrast to the genomic landscape of many prokaryotic organisms that are compact and gene rich, most eukaryotic, particularly metazoan, genomes have a small ratio of genes to noncoding DNA since only a minority of the genome is transcribed and translated into proteins. The focus of the initial analysis of both the human (Lander et al., 2001; Venter et al., 2001) and mouse genomes (Waterston et al., 2002) has been to catalog all mammalian protein-coding genes, now estimated to be in the vicinity of ~25,000 unique transcripts (Collins, 2004), and spanning less than 2% of the human genome. An additional 40–45% of the human genome is covered by repetitive DNA elements, while the remaining ~53% is composed of noncoding DNA (Lander et al., 2001; Venter et al., 2001; Waterston et al., 2002). Despite this vast amount of noncoding DNA, little progress has been made in conclusively determining whether it plays any vital functional role. Although some parts of noncoding regions within our genome will eventually reveal no detectable biological function, a growing hypothesis speculates that much of an organism’s genetic complexity is due to elaborate transcriptional regulatory signals embedded in our noncoding DNA that determine when, where, and what amount of a gene transcript is expressed.
Natural selection is a major driving force in stabilizing functionally important regions within genomes, preserving the sequences of orthologous coding exons and transcriptional regulatory elements. Evolutionary comparisons have long held the promise for identifying transcriptional response elements in eukaryotic genomes, where initially searches were conducted using consensus sequences and positional weight matrices in a method often termed phylogenetic footprinting (Blanchette et al., 2002; Chiang et al., 2003; Fink et al., 1996; Wasserman and Fickett, 1998). A new generation of ab initio approaches have increasingly shown great potential for identifying novel functional motifs, but in general the sheer size and complexity of mammalian genomes has precluded the extension of these approaches to study mammalian transcription on a genome scale. This has been primarily because these simple motifs are short and highly degenerate and create overwhelming predictions with high rates of false positives when used in whole-genome analysis. The availability of large amounts of sequence data from numerous organisms and new user-friendly alignment tools have totally altered our contemporary approach to transcriptional regulation, where evolutionary comparisons have become the first tier of analysis routinely performed when searching for regulatory elements. This is reflected by the dramatic increase in the number of studies reporting the identification of functional sequences through the use of comparative genomics, and these emerging studies are providing compelling evidence in support for the use of evolutionary comparisons as a robust strategy for highlighting functional coding (Gilligan et al., 2002; Pennacchio et al., 2001) and noncoding sequences (Gottgens et al., 2000; Loots et al., 2000; Nobrega et al., 2003; Pennacchio et al., 2006; Touchman et al., 2000). In particular, aligning whole genomes and identifying evolutionarily conserved regions (ECR) on a large scale has become a robust approach for discovering transcriptional regulatory elements in noncoding DNA (Pennacchio et al., 2006; Woolfe et al., 2005). Here we will discuss methods of applying comparative genomics to the identification of transcriptional regulatory elements in the human genome, and functional approaches for validating and characterizing computationally predicted elements.
It is not fully understood how one could precisely define a human gene locus, since all functional elements have yet to be determined for each transcript, but in general, one could view a typical animal gene as a promoter linked to the transcript, both of which are embedded in a sea of positively and negatively regulating elements positioned anywhere in relation to the transcriptional start site (5′, 3′, and intronic), and acting at a distance across large segments of DNA (up to megabases in lengths) (Fig. 10.1). The positively regulating elements or enhancers are each responsible for a subset of the total gene expression pattern and usually drive transcription within a specific tissue or subset of cell types. A typical enhancer can range in size from as little as 100 base pairs (bp) (Banet et al., 2000; Catena et al., 2004; Krebsbach et al., 1996) to several kilobases (kb) in length (Chi et al., 2005; Danielian et al., 1997), but on average would be about 500 bp in length (Kamat et al., 1999; Loots et al., 2005; Marshall et al., 2001). Within enhancer elements are docking sites for regulatory proteins or transcription factors (TFs) that physically interact with specific DNA sequences or transcription factor binding sites (TFBSs). TFs recognize and bind to short (6–12 bp), highly degenerate sequence motifs that occur very frequently in a genome; therefore, computationally predicting TFBSs that are functionally significant is a great challenge. It is not yet known how many TFBSs are needed to build a functional enhancer, nor how many different TFs need to synergistically cooperate to drive expression. One hypothesis suggests that a typical enhancer contains a minimum of 10 TFBSs for at least 3 different TFs (Levine and Tjian, 2003).
The core promoter serves as a platform for the assembly of transcriptional preinitiation complex (PIC) that includes TFIIA, TFIIB, TFIID, TFIIE, TFIIF, TFIIH, and RNA polymerase II (Pol II), which function collectively to specify the transcription start site. The PIC usually begins with TFIID binding to the TATA box (a TATA box is a DNA sequence found in the promoter region of eukaryotic genes, specified as 5′-TATAA-4-3′ or a variant), initiator, and/or downstream promoter element (DPE) found in most core promoters, followed by the entry of other general transcription factors (GTFs) and Pol II through either a sequential assembly or a preassembled Pol II holoenzyme pathway (Thomas and Chiang, 2006). This promoter-bound complex is sufficient to drive basal level of transcription, but would require additional cofactors to transmit regulatory signals between gene-specific activators and the general transcription machinery. Three classes of general cofactors, including TATA-binding protein (TBP)-associated factors (TAFs), mediator, and upstream stimulatory activity (USA)-derived positive cofactors (PC1/PARP-1, PC2, PC3/DNA topoisomerase I, and PC4) and negative cofactor 1 (NC1/HMGB1), normally function independently or in combination to fine-tune the promoter activity in a gene-specific or cell-type-specific manner. In addition, other cofactors, such as TAF1, BTAF1, and NC2, can also modulate TBP or TFIID binding to the core promoter. Many genes also contain binding TFBSs for proximal regulatory factors located just 5′ of the core promoter. These factors do not always function as classic activators or repressors; instead, they might serve as a connecter between distal enhancers and the core promoter. In addition, a different class of regulatory elements, insulators or boundary elements, serve as gatekeepers and prevent enhancers from inappropriately regulating neighboring genes.
What makes transcriptional genomics in vertebrates a highly intricate problem stems from two recent observations: (1) all regulatory elements associated with a transcript can be scattered over great distances that can reach megabases (Mb) in length (Nobrega et al., 2003; Sagai et al., 2005), and (2) some regulatory elements are capable of controlling multiple transcripts, skip intercalating genes, or regulate one transcript while being positioned within a different transcript (Loots et al., 2000; Zuniga et al., 2004). In Fig. 10.2 we depict three such examples. For the human interleukin 4 (IL4) gene cluster on human chromosome 5, it was long hypothesized that a common regulatory element or locus control region possibly controls the Th2 expression of several cytokine genes. Using comparative genomics one highly conserved element positioned between IL4 and IL13 was removed from a human yeast artificial chromosome transgene (Loots et al., 2000) as well as from the mouse genome (Mohrs et al., 2001) to show that by removing this element the expression of three cytokines IL4, IL13 and IL5 is affected in Th2 cells. What was peculiar about this discovery was the fact that this regulatory element was positioned 120 kb away from the promoter of IL5 gene, and it was able to exclusively control these three cytokines at the transcriptional level, leaving the intercalating gene, RAD50, unaffected when removed from the genome (Fig. 10.2A) (Loots et al., 2000; Mohrs et al., 2001).
In the case of limb deformity mutations originally mapped to the C-terminal region of the formin gene almost two decades ago (Mass et al., 1990; Woychik and Alagramam, 1998; Woychik et al., 1985, 1990), it was long hypothesized that formin is the gene responsible for the disruption of limb bud morphogenesis. Recently when a null mutation in the neighboring gene gremlin was generated, it was shown that by removing this BMP antagonist one recapitulates the limb phenotypes recorded for all limb mutations mapped to the formin locus (Khokha et al., 2003). Consequent complementation studies together with in vivo enhancer expression assays confirmed that the limb deformity phenotypes were indeed a result of gremlin-specific regulatory element mutations, and that this limb-specific enhancer resides in an intron, in the C terminus region of the large formin transcript, and that formin does not contribute to the limb morphogenic abnormalities due to these mutations (Fig. 10.2B) (Zuniga et al., 2004). In a more dramatic example, a region in the fifth intron of Lmbr1 gene has long been suggested as the responsible element for preaxial polydactyly recorded in mouse and human mutations (Heutink et al., 1994). In these mice, Shh expression is perturbed in the anterior margin of the limb bud mesenchyme (Masuya et al., 1995), a gene positioned 1 Mb away from the Lmbr1 transcript, recapitulating phenotypic aspects of the Shh knockout. These studies were followed by transgenic reporter experiments which revealed that this intronic region from Lmbr1 is indeed an enhancer that drives expression in the posterior mesenchyme of the developing mouse limb bud (Lettice et al., 2003). Most recently, this enhancer was removed from the mouse genome to conclusively show that this element located 1 Mb away from the Shh transcriptional start site is required for the limb-specific expression of Shh, and its ablation results in a limb phenotype identical to the limb phenotype observed in Shh knockout, and therefore it functions as a limb-specific Shh enhancer (Fig. 10.2C) (Sagai et al., 2005).
In the absence of sequence information, one method that was employed to genetically map congenital abnormalities was to karyotype affected individuals and determine whether chromosomal abnormalities in the form of deletions and translocations segregate with the phenotype in affected families. While generally, detailed mapping and targeted sequencing of affected individuals lead to the discovery of the causative gene and identification of the deleterious mutations, in some instances these chromosomal aberrations do not disrupt any genes or coding regions—several such examples are listed in Table 10.1 (Ahituv et al., 2004). For example, mutations in the coding region of the gene encoding for the developmental transcription factor SALL1 lead to autosomal dominant Townes–Brocks syndrome, while a thoroughly characterized translocation in one patient 180 kb telomeric to SALL1 also leads to a similar phenotype(Marlin et al., 1999). One likely explanation for these observations is that noncoding cis-regulatory sequences have been mutated or removed from the genome, affecting the expression pattern or expression level of the gene they normally regulate. Since many recorded diseases have no documented coding mutations, it is likely that disruption in the communication between a vital cis-regulatory sequence and the gene it regulates could potentially result in a disease that resembles hypomorphic or null alleles of the causative gene. However, providing definitive proof that the noncoding sequence change is indeed causing a particular phenotype is a highly complex problem, difficult to address experimentally. Recently, it has been suggested that engineered bacterial artificial chromosomes (BAC) may be used to determine if noncoding deletions deleteriously impact gene expression of disease-causing genes (Loots et al., 2005). The authors investigated whether a homozygous 52-kb noncoding deletion linked to the sclerosteosis disease-causing gene, SOST (Balemans et al., 2001), and homozygous in Van Buchem (VB) patients affects SOST gene expression by expressing a wild-type BAC and a genetically modified BAC mimicking the VB allele. They proceeded to show that a SOST wild-type allele expresses human SOST according to its endogenous expression pattern, primarily in the adult bone, while the VB allele fails to drive SOST expression in the bone. They further proceeded to use comparative sequence analysis and enhancer assays to identify a distant enhancer element that is able to drive transgenic expression in osteocyte-like cell lines, and in the mouse skeletal anlage at E14.5 (Loots et al., 2005).
The majority of available computational tools for predicting regulatory elements are based on constructing alignments between orthologous sequences and/or detecting TF DNA binding motifs. Investigators now have the option to deduce phylogenetic relationships among sequences either by generating their own alignments (Bray et al., 2003; Brudno et al., 2003; Mayor et al., 2000; Schwartz et al., 2003) or by using ready-made DNA conservation plots available at various genome browsers (Kuhn et al., 2007; Ovcharenko et al., 2004b; Schwartz et al., 2003). There are several different approaches for scanning sequences for putative regulatory elements using pattern recognition. First, a number of computational tools predict TFBSs using a library of known motifs (Heinemeyer et al., 1998, 1999; Loots and Ovcharenko, 2004), or identify conserved sequence blocks in a multiple sequence alignment (Blanchette et al., 2002; Hertz and Stormo, 1999). Clustering of TFBS has been implemented as a second approach for predicting regulatory elements or cis-regulatory modules (CRMs). A few programs analyze homogenous clusters of a single overrepresented DNA motif or heterogenous clusters of multiple different sequence motifs (synergistic motifs)(Berman et al., 2002; Kim et al., 2006; Loots et al., 2002). A third approach for predicting sequences with specific regulatory properties is to identify CRMs shared by multiple functionally related sequences from the same organism (Jegga et al., 2005, 2007; Sharan et al., 2004). Expression profiling experiments have the potential to identify groups of coexpressed genes that respond to similar environmental and metabolic stimuli, and it has been speculated that such gene sets often share similar types of CRMs because their coregulation is mediated by the same set of regulatory proteins. Several new computational approaches use microarray expression data to predict tissue-specific regulatory elements in coexpressed set of genes (Jegga et al., 2005; Ovcharenko and Nobrega, 2005; Pennacchio et al., 2007).
Since sequences that mediate gene expression tend to be evolutionarily conserved, one can identify putative enhancers by comparing genomes and determining regions of high homology (Loots et al., 2000, 2005; Nobrega et al., 2003). To identify evolutionarily conserved noncoding sequences (ncECRs), one needs to be able to generate reliable alignments between orthologous noncoding regions from different organisms. Aligning short sequences from closely related organisms is a straightforward process, while determining sequence similarity between larger, highly divergent regions is a more difficult task due to significant DNA rearrangements. Even highly orthologous regions are rich in insertions and deletions as well as many single base-pair mutations which can lack orthologous counterparts and be represented as gaps within alignments. An additional potential difficulty in obtaining accurate syntenic alignments is created by the large numbers of tandem and segmental duplications found in the human genome. Some assembly strategies are unable to differentiate highly homologous duplications from true overlapping sequences, resulting in erroneous genomic assemblies with underrepresented paralogs. The most complex problem is posed by lineage-specific segmental duplications that arose since the separation of two species from their most recent common ancestor. In this situation identifying true orthologous syntenic sequences from paralogous ones is a difficult task since determining true orthology and synteny represents a major challenge in the absence of a one-to-one sequence match.
The majority of pairwise sequence alignment programs utilize dynamic programming of global alignments (Needleman and Wunsch, 1970), local alignments (Altschul et al., 1990), or database searches (Altschul et al., 1997). A small fraction of alignment programs use hidden Markov models (HMM) such as WABA (Kent and Zahler, 2000) or suffix tree such as MUMmer (Majoros et al., 2005) and AVID (Bray et al., 2003). In a very simplistic view, global alignments assume that there is a colinearity of DNA sequences while local alignments focus on detecting short matches between two sequences independent of their location and orientation. Local alignments are very powerful in detecting evolutionary rearrangements resulting in DNA reshuffling and segmental duplications (paralogs) as well as species-specific tandem gene expansions. In addition, local alignment tools are also useful when highly divergent genomes are compared, since gene structure and order is not well preserved over large evolutionary distances.
The first available alignment tools were designed to recognize and align highly homologous protein sequences. The basic local alignment search tool (BLAST) was created to rapidly match a relatively short stretch of DNA with homologous regions from a large collection of sequences stored in the National Center for Biotechnology Information (NCBI) database. Most recently, BLAST has evolved into a family of alignment tools able to detect matches for various types of sequences and evolutionary distances including blastn (nucleotide), blastp (protein), blastx [nucleotide query—protein database (db)], tblastn (protein query—translated db), tblastx (nucleotide query—translated db), and mega-blast (highly conserved matches). The “Blast2sequences” (bl2seq) tool was created to apply all the BLASTs to both nucleotide and protein pairwise sequence comparisons and is extremely powerful in annotating genomic sequences by comparing large contigs with mRNA sequences (Altschul et al., 1990, 1997). Despite the great versatility of the BLAST, its application becomes limited when trying to align large genomic loci, megabases in length. Processing large alignments require graphical interfaces that allow the compact visualization of genes and repetitive elements along with the evolutionarily conservation profile of the aligned sequences.
A new generation of alignment tools can efficiently process two or more input sequences that can be up to genome size in length, are publicly accessible, and have user-friendly web interfaces. Some examples are listed in Table 10.2. PipMaker (Schwartz et al., 2000), zPicture (Ovcharenko et al., 2004a), and Mulan (Ovcharenko et al., 2005) are based on the BLASTZ (Schwartz et al., 2003) local alignment program. They combine suffix tree algorithms with dynamic programming techniques and have the capacity to align very long genomic intervals in a very short period of time. VISTA (Frazer et al., 2004; Mayor et al., 2000) is a visualization tool for global alignments generated by AVID (Bray et al., 2003) or LAGAN (Brudno et al., 2003) programs. All these alignment engines provide the user with informative, high-resolution graphical displays of the resulting alignments depicting both the genomic location of the conserved regions in the reference sequence and the degree of similarity for each aligned DNA segment. Individual features of the DNA sequence, such as coding exons, untranslated regions (UTR), and repetitive elements, can be distinguished in the graphical output through the use of different color schemes, allowing the identification of evolutionarily conserved sequences present in noncoding regions.
Comparative sequence alignment data can also be retrieved from genome browsers. Genome browsers are web-based database interfaces designed to allow the navigation across an entire genome by scrolling and zooming through any region of DNA and visualizing all available annotation data. In general, annotations include mRNAs, expressed sequence tags (EST), gene predictions, single nucleotide polymorphisms (SNPs), as well as many other features. Users can enter a region of the genome by searching for a landmark such as the name or acronym of a known gene, the accession number of a DNA sequence, the numeric position within a chromosome, or even through a homology search by providing a piece of sequence. The two main genome browsers are Ensembl (Stalker et al., 2004) and the UCSC Genome Browser database (Kent et al., 2002; Kuhn et al., 2007), both of which were originally designed to support the assembly and annotation needs of the Human Genome Project by creating an efficient, user-friendly data storage and retrieval system with a compact visual presentation. The UCSC Browser has rapidly expanded to provide access to other available genome assemblies and their accompanying annotations, which now include 14 vertebrate species. Recently, UCSC Browser has also incorporated comparative genomic tracks to visualize regions of DNA conservation between two fully sequenced genomes, as well as a regulatory potential track based on normalized log-odds scores calculated using an HMM model that distinguishes known regulatory regions from ancestral repeats. Similar pairwise whole-genome alignments have been generated using a combination of local and global alignment strategies and can be visualized in the ECR Browser (Ovcharenko et al., 2004b) and in the VISTA Genome Browser (Brudno et al., 2004). In addition to ready-made pairwise alignments, the ECR Browser also aligns user-provided sequences to available genomes and incorporates any available annotation into the visual display of the generated alignment. For a comprehensive review of genome browsers and databases, see Ureta-Vidal et al. (2003).
In eukaryotes, modulation of gene expression is achieved through the complex interaction of regulatory proteins (trans-factors) with specific DNA regions (cis-acting regulatory sequences). Intensive efforts over the last decades have identified numerous regulatory proteins or TFs and the DNA sequences they recognize. TRANSFAC database (http://www.biobase.de) represent the most comprehensive collection of TF- binding specificities, summarized as position weight matrices (PWMs) (Heinemeyer et al., 1998, 1999). Pattern-recognition programs such as MATCH or MatInspector (Quandt et al., 1995) use these libraries of TF-PWMs to identify significant matches in DNA sequences. A major confounding factor in the use of PWMs to identify TFBSs is that TFs bind to short (6–12 bp), degenerate sequence motifs that occur very frequently in a genome, and only a small fraction of the predicted TFBSs are functionally significant.
It has been shown that by combining pattern recognition with comparative sequence analysis the number of false positives is dramatically reduced while the number of functional sites is preserved. These results suggest an alternative strategy for sequence-based discovery of biologically relevant regulatory elements. The rVISTA (Loots and Ovcharenko, 2004) and Consite (Sandelin et al., 2004) web-based tools combine TFBS motif searches and cross-species sequence analysis. rVISTA analysis proceeds in four major steps: (1) identify TFBS matches in each individual sequence using PWM from TRANSFAC database, (2) detect and calculate the percent identity of each locally aligned TFBS, (3) select TFBSs present in regions of high DNA conservation, and (4) graphically display individual or clustered TFBSs (Loots and Ovcharenko, 2004).
Phylogenetic footprinting is a method for identifying highly conserved DNA motifs present in a multiple sequence alignment. It is usually performed by computing a global multiple alignment of three or more orthologous sequences, and by identifying regions of high conservation in the alignment. FootPrinter (Blanchette and Tompa, 2003; Fang and Blanchette, 2006), FOOTER (Corcoran et al., 2005), TRES, and PhyloGibbs (Siddharthan et al., 2005) are some of the algorithms available for generating motif predictions and reporting motif sequences with the lowest parsimony scores, calculated with respect to the phylogenetic tree relating the input sequences. A more successful recent approach to phylogenetic footprinting is to use motif discovery algorithms such as MEME (Bailey et al., 2006). Programs like MEME neither take into account the phylogenetic relationship among the input sequences nor do they rely on precalculated PWM stored in a database; they treat input sequences individually and the patterns are learned through several rounds of ungapped local alignments. The sampled alignments are used to fit a set of weights and the best weights are used to define an alignment, similar to the Gibbs sampling method (Schug and Overton, 1997).
In eukaryotes, transcriptional gene regulation is directed by a cohort of several different TFs that cooperatively bind to clusters of TFBS known as gene CRMs. One of the main objectives of transcriptional genomics is to decode the structure of CRMs and distinguish between the footprints of functional TFBS from genomic intervals devoid of biological significance and determine which CRM structures confer which tissue specificity. Searches for clusters of multiple adjacent binding sites for regulatory proteins have been successful in analyzing regulatory regions involved in mammalian muscle (Wasserman and Fickett, 1998) and liver-specific gene expression (Krivan and Wasserman, 2001). MSCAN (Alkema et al., 2004) and rVISTA (Loots and Ovcharenko, 2004) are two examples of web-based tools that allow users to search for clusters of cis-elements, either using precalculated matrices from the TRANSFAC database or using consensus sequences provided by the user. Using these tools one could search for regions of high density of repeated TFBS for a single or multi different TFs. Recently, a new tool has been made available, SYNOR, which allows users to search for any configurations of TFBS across the whole human genome to predict functional regions with a distinct TFBS profile (Ovcharenko and Nobrega, 2005).
A new generation of computational tools aimed at predicting tissue-specific regulatory elements are blending together three elements: (a) comparative sequence analysis, (b) TFBS analysis, and (c) microarray expression analysis. In this approach, coexpressed genes are mapped to a genome and their promoters and surrounding conserved noncoding regions are used to identify CRMs that are overrepresented in the data set when compared with the distribution of the same CRMs across the whole genome. Pilpel and colleagues proposed such a method for modeling transcription regulatory networks in complex eukaryotes by combining microarray expression data with insights from combinatorial structure of promoter regions (Pilpel et al., 2001). They were able to show that it is possible to discover novel functional PWMs by identifying statistically significant synergistic motifs in promoters of coexpressed genes, using a process called transcription factor centric clustering (TFCC). TFCC strategies are designed to create an explicit link between CRMs and the TFs that bind to them (Zhu et al., 2002). These methods permit the detection of enriched TFBSs that are used as a seed to bicluster genes and compare gene expression with TFs’ distribution in a two-dimensional space. Sharan et al. (2003) built on Pilpel’s strategy by analyzing humans–mouse conserved promoter elements of cell cycle and stress response-related genes. Their analyses revealed several clusters of TFBS specific to the coexpressed genes. The significance of such co-occurrences was statistically evaluated and showed direct correlation between the identified CRMs and the biologically validated target genes derived from the microarray expression data. They proceeded to incorporate this method into a publicly available software, CREME, which can perform combinatorial cluster analysis, statistically evaluate the detected co-occurrences, and graphically display predicted CRMs in a browser (Sharan et al., 2003, 2004).
In general, computational predictions have strongly correlated with functionally characterized regulatory elements mostly because the training sets used for these analyses have been carefully chosen from biologically validated data sets. On the contrary, the majority of novel predictions have not yet been biologically confirmed, and the limitations of computational tools have not been carefully assessed. Functional characterization of noncoding sequences represents the largest bottleneck which prevents us from expanding the annotation of regulatory elements from small target regions to entire genomes. The field of in silico biology is still in its infancy, but is evolving at a fast pace, presenting researchers with new theoretical solutions for the analysis of noncoding sequences. The computationally derived regulatory predictions may not all be functionally significant at this point, but by centering biological focus on a handful of high-priority regions to be tested, computational tools have already surpassed expectations for identifying regulatory elements. In Section IV. A we will review several experimental approaches employed to validate and characterize predicted transcriptional regulatory elements.
Almost two decades ago, Kothary et al. (1988) created a mouse heat shock 68 promoter (hsp68) β-galactosidase (LacZ) transgene (hsp68-LacZ) and generated several independent lines of transgenic mice aiming to study heat shock gene regulation in vivo. In six of these transgenic lines, the transgene was consistently silent until subjected to heat shock treatment; however, one line of transgenic mice expressed LacZ in a neural-specific pattern independent of heat shock. The transgene integrated into the gene responsible for dystonia musculorum (OMIM 113810) (Bressman, 2003), mutated the gene, and acquired its transcriptional profile. This study was able to show that the hsp68 promoter which is normally silent at physiological temperature is able to activate transcription of a reporter gene in response to positive regulatory elements and therefore can be used to trap enhancers in mammalian genomes (Kothary et al., 1988). It was not until the human and mouse genome projects were well underway that this transgene was fully exploited and transformed into an essential tool for validating tissue-specific enhancer elements in transient transgenic mice.
Unlike stable transgenic lines that are screened for germline transmission of exogenous DNA, transient transgenic mice are transgenic animals that are analyzed in F0, without having to pass the transgene to future generations. In this method, embryos are injected with the reporter construct, transferred to the recipient mom, allowed to develop to a desired embryonic stage (usually between E10.5 and E14.5) when the moms are sacrificed and the embryos are harvested and examined for reporter transgene expression. This method dramatically shortens the experimental time for collecting expression data, since founder mice do not need to be established before carrying out the expression analysis, making this procedure highly efficient for validating a set of putative enhancers at a desired developmental time point. It is also a more cost-efficient approach, since it eliminates the need for breeding mice and establishing founder lines. The use of this method as a validation and characterization tool has dramatically grown over the past years, but in general has remained gene centric, where individual investigators have focused on testing conserved elements in the context of a well-characterized locus to identify tissue-specific enhancers that follow the expression pattern of one gene of interest (Bejerano et al., 2006; Forghani et al., 2001; Loots et al., 2005; Nobrega et al., 2003; Rojas et al., 2005; Wang et al., 2001; Zhu et al., 2004). In an attempt to apply this method to more comprehensive, whole-genome analysis, this approach has proven very useful in validating highly conserved human elements grouped into two major categories: deeply conserved (evolutionarily conserved from human to fish) (Nobrega et al., 2003; Pennacchio et al., 2006) and ultraconserved (elements that have close to 100% identify from human to mouse) (Bejerano et al., 2006; Pennacchio et al., 2006; Poulin et al., 2005).
While this method is currently considered high throughput in mice, there are several obstacles that preclude it from effectively being applied on a genome-wide scale. First, since mouse embryos develop in utero, collecting transgenic embryos is a terminal procedure that permits a litter of mice to be analyzed at only one given time point. In the absence of gene expression information for the genes putative enhancers are expected to regulate, screening for enhancer function could become a fishing expedition with low probability of success. For example, if an enhancer drives expression only at E17.5 in the medulla oblongata, with no detectable expression at any other time or tissue during development, the investigator would have to assay this particular time point to be able to detect its function. By assaying any other time point, the consistent lack of expression would drive the investigator to erroneously assume that the element has no function. A second drawback is posed by the transgene visualization method for LacZ. LacZ is a bacterial gene whose gene product, galactosidase, catalyzes the hydrolysis of galactosides or X-gal, and produces a blue color that can be visualized. This procedure requires fixation, and hence is terminal. Other caveats to this experimental approach include position effect, promoter specificity, and restriction to enhancer detection (one cannot detect a repressor or silencer element).
To overcome some of these problems, some alternatives have been proposed that include (1) the use of a transgenic model system that develops ex utero [fish (zebra fish) or frogs (Xenopus)]; (2) the use of reporter genes that do not require terminal fixation for visualization, such as green fluorescent protein (GFP); and (3) the use of larger transgene and “knocked” in reporters that track the protein expression from the endogenous promoter. Such transgenic systems would allow investigators to more efficiently monitor the transgene expression during development and determine both the temporal and spatial window of enhancer activity an element may pose. Using zebra fish transgenesis, Woolfe and his colleagues tested 25 ECRs from a set of 1400 elements identified by comparisons between the human and the puffer fish Fugu rubripes genomes. This study is of great significance because it makes two very important points. First, the authors were able to show that distant comparisons between human and Fugu enrich for a special category of regulatory elements which cluster around genes that have vital roles during embryonic development (e.g., TFs). Second, they were able to confirm that 23 of the total 25 tested ncECRs exhibit tissue-specific enhancer activity, suggesting that most of deeply conserved elements do indeed function as transcriptional regulatory elements (Woolfe et al., 2005).
Recently, several reports have emerged that describe transposon-based gene delivery methods in both zebra fish and Xenopus embryos. These technologies can potentially evolve into a rapid system for transgenesis and expedite enhancer validation. In this approach, an ECR is cloned upstream of a minimal promoter driving a fluorescent reporter, where the entire transgenic cassette is flanked by transposable elements, such as Tol2 (Allende et al., 2006; Balciunas et al., 2006; Hamlet et al., 2006), Sleeping Beauty (Sinzelle et al., 2006), piggyBAC (Wu et al., 2006), or Frog Prince (Miskey et al., 2003), and the reporter constructs are assayed for cis-regulatory activity. Using the Tol2 transposable system in both zebra fish and Xenopus transgenic experiments, Allende and his colleagues have recently shown that most of the 50 ECRs they tested behave as positive modulators of gene expression and contribute to the specific temporal and spatial expression patterns of the endogenous genes they regulate (Allende et al., 2006). The continuation of studies such as the ones described above will further our understanding of tissue-specific transcriptional regulation and will aid the discovery of all functional regulatory elements in the human genome.
An alternative approach to the use of heterologous minimal promoters to assay for enhancer activity in transient transgenics is to tag a transcript with a reporter gene (LacZ or GFP) within the context of a larger genomic region by modifying a yeast artificial chromosome (YAC) or BAC. This method has several advantages. First, the reporter gene is driven by the endogenous promoter and responds to all regulatory elements included in the transgenic construct; therefore, by comparing the expression pattern of the reporter transgene to the endogenous expression pattern of the mouse gene one can determine what components of the complete tissue-specific expression profile is controlled by elements residing within the BAC region (Bouchard et al., 2005; Gebhard et al., 2007; Mortlock et al., 2003; Tallini et al., 2006). Second, transgenic animals generated using larger DNA constructs are less likely to be affected by position effects; therefore, most of the time the transgenic expression faithfully resembles the endogenous gene expression (Gebhard et al., 2007). Third, by mutating individual ECRs within a BAC construct, one can identify not only regulatory elements that positively modulate transcription but also cis-elements that act as negative regulators, such as repressors, silencers, or boundary elements.
The methods described in the previous sections are considered “gain-of-function” approaches to determining whether a conserved noncoding sequence possesses biological activity. However, these experiments do not provide any information whether these elements are functionally critical, and whether mutating them can lead to serious congenital abnormalities or whether they contribute to susceptibility to disease. The ultimate functional test that confirms essential physiological activity of a ncECR is to mutate these elements by changing individual base pairs or by removing them from an animal’s genome. Loss of function alleles can be generated by two main methods: random mutagenesis or targeted knock-out (KO). In a random mutagenesis experiment, one would subject an animal to a mutagen that either causes large chromosomal abnormalities, such as deletions and translocations, or has smaller effects by mutating single base pairs or removing a few nucleotides. Mutagenesis experiments are feasible for most experimental organisms, but require rigorous screenings to detect individuals that carry a desired mutation. A targeted mutation can only be engineered in animals for which KO technologies have been established, but unfortunately rodents are the only mammals for which targeted deletions can be carried out efficiently, primarily because embryonic stem cell lines have not been derived for other mammals. In this section, we will discuss functionally characterizing putative regulatory elements through loss-of-function experiments in vivo, in genetically modified mice by either removing an ECR from a large transgene or removing it from the mouse genome.
Homologous recombination techniques in yeast and bacteria have facilitated the genetic modification of artificial chromosomes [YACs, Pl-based artificial chromosomes (PACs), and BACs] (Imam et al., 2000; Lee et al., 2001; Loots, 2006; Nistala and Sigmund, 2002; Warming et al., 2005) to study gene expression and transcriptional regulation across large genomic loci. The attraction of using such DNA constructs is primarily because they carry large fragments of genomic DNA (>100 kb) and therefore are likely to contain most of the cis-regulatory elements required for the expression of a gene; therefore, even when inserted randomly into the mouse genome, these transgenes are likely to behave similarly to their native environment recapitulating the gene expression pattern of endogenous loci. An extra advantage is our ability to modify these transgenes by inserting sequences prone to recombination such as loxP or FRT sites in the presence of recombinases. By flanking a ncECR with loxP sites, one can determine in vivo the transcriptional effects an element will have on a transcript when the ncECR is present or absent from the locus (Loots et al., 2000), independent of position effects. In most situations, the integrated loxP sites do not affect gene expression and the floxed allele behaves equivalent to the wild-type allele. Upon administration of recombinase protein, the loxP sites excise the ncECR element, leaving behind a new deleted allele. Finally the investigator compares expression of the transgene with and without the ncECR to determine if the ncECR has any impact on transcription. This method was used to show that a highly conserved noncoding DNA sequence controls the expression of three cytokine genes, IL4, IL13, and IL5, in the context of a human YAC (Loots et al., 2000). Similarly, one can use in vitro recombination to create several variants of a transgene, by either removing a putative regulatory element (Xu et al., 2006) or removing a large noncoding region (Loots et al., 2005). Since each transgene randomly integrates into the mouse genome, to ensure that an observed difference in expression is due to the mutation and not to position effect, several independent transgenic lines have to be analyzed for each allele. Finally, the most informative and reliable method of testing whether a ncECR impacts gene expression and causes a deleterious phenotype is to remove it from the mouse genome through targeted KO strategies. Although this approach remains the ultimate proof of biological significance because it is technically challenging, laborious, expensive, and time-consuming to generate KO animals, to date very few ncECR have been mutated in mice (Mohrs et al., 2001; Sagai et al., 2005).
Comparative genomic approaches are having a remarkable impact on the study of transcriptional regulation in eukaryotes. Many eukaryotic genome sequences are being explored by new computational methods and high-throughput experimental tools. These tools are enabling efficient searches for common regulatory motifs which will eventually lead to elucidating the genome’s second code: understanding the building blocks of tissue-specific gene regulation encoded in noncoding DNA. Experimental validation and characterization, however, continues to be a major bottleneck, and hence extending the limits of current techniques will greatly enhance the discovery of transcriptional regulatory elements in mammals, moving us closer to a systematic deciphering of transcriptional regulatory elements and providing the first global insights into gene regulatory networks. In addition to the methods described here, other recent advances in transcriptional regulation approaches that include probing DNA–protein interactions by ChIP-chip or comparing patterns of gene expression are moving the field of transcriptional genomics forward.