The AthaMap database generates a genome-wide map for putative transcription factor binding sites for A. thaliana. When analyzing transcriptional regulation using AthaMap it may be important to learn which genes are also post-transcriptionally regulated by inhibitory RNAs. Therefore, a unified database for transcriptional and post-transcriptional regulation will be highly useful for the analysis of gene expression regulation.
To identify putative microRNA target sites in the genome of A. thaliana, processed mature miRNAs from 243 annotated miRNA genes were used for screening with the psRNATarget web server. Positional information, target genes and the psRNATarget score for each target site were annotated to the AthaMap database. Furthermore, putative target sites for small RNAs from seven small RNA transcriptome datasets were used to determine small RNA target sites within the A. thaliana genome.
Putative 41,965 genome wide miRNA target sites and 10,442 miRNA target genes were identified in the A. thaliana genome. Taken together with genes targeted by small RNAs from small RNA transcriptome datasets, a total of 16,600 A. thaliana genes are putatively regulated by inhibitory RNAs. A novel web-tool, ‘MicroRNA Targets’, was integrated into AthaMap which permits the identification of genes predicted to be regulated by selected miRNAs. The predicted target genes are displayed with positional information and the psRNATarget score of the target site. Furthermore, putative target sites of small RNAs from selected tissue datasets can be identified with the new ‘Small RNA Targets’ web-tool.
The integration of predicted miRNA and small RNA target sites with transcription factor binding sites will be useful for AthaMap-assisted gene expression analysis. URL: http://www.athamap.de/
Arabidopsis thaliana; AthaMap; MicroRNAs; Small RNAs; Post-transcriptional regulation
The AthaMap database generates a map of cis-regulatory elements for the whole Arabidopsis thaliana genome. This database has been extended by new tools to identify common cis-regulatory elements in specific regions of user-provided gene sets. A resulting table displays all cis-regulatory elements annotated in AthaMap including positional information relative to the respective gene. Further tables show overviews with the number of individual transcription factor binding sites (TFBS) present and TFBS common to the whole set of genes. Over represented cis-elements are easily identified. These features were used to detect specific enrichment of drought-responsive elements in cold-induced genes. For identification of co-regulated genes, the output table of the colocalization function was extended to show the closest genes and their relative distances to the colocalizing TFBS. Gene sets determined by this function can be used for a co-regulation analysis in microarray gene expression databases such as Genevestigator or PathoPlant. Additional improvements of AthaMap include display of the gene structure in the sequence window and a significant data increase. AthaMap is freely available at .
The AthaMap database generates a map of cis-regulatory elements for the Arabidopsis thaliana genome. AthaMap contains more than 7.4 × 106 putative binding sites for 36 transcription factors (TFs) from 16 different TF families. A newly implemented functionality allows the display of subsets of higher conserved transcription factor binding sites (TFBSs). Furthermore, a web tool was developed that permits a user-defined search for co-localizing cis-regulatory elements. The user can specify individually the level of conservation for each TFBS and a spacer range between them. This web tool was employed for the identification of co-localizing sites of known interacting TFs and TFs containing two DNA-binding domains. More than 1.8 × 105 combinatorial elements were annotated in the AthaMap database. These elements can also be used to identify more complex co-localizing elements consisting of up to four TFBSs. The AthaMap database and the connected web tools are a valuable resource for the analysis and the prediction of gene expression regulation at .
The AthaMap database generates a map of potential transcription factor binding sites (TFBS) and small RNA target sites in the Arabidopsis thaliana genome. The database contains sites for 115 different transcription factors (TFs). TFBS were identified with positional weight matrices (PWMs) or with single binding sites. With the new web tool ‘Gene Identification’, it is possible to identify potential target genes for selected TFs. For these analyses, the user can define a region of interest of up to 6000 bp in all annotated genes. For TFBS determined with PWMs, the search can be restricted to high-quality TFBS. The results are displayed in tables that identify the gene, position of the TFBS and, if applicable, individual score of the TFBS. In addition, data files can be downloaded that harbour positional information of TFBS of all TFs in a region between −2000 and +2000 bp relative to the transcription or translation start site. Also, data content of AthaMap was increased and the database was updated to the TAIR8 genome release.
Database URL: http://www.athamap.de/gene_ident.php
Gene expression is controlled mainly by the binding of transcription factors to regulatory sequences. To generate a genomic map for regulatory sequences, the Arabidopsis thaliana genome was screened for putative transcription factor binding sites. Using publicly available data from the TRANSFAC database and from publications, alignment matrices for 23 transcription factors of 13 different factor families were used with the pattern search program Patser to determine the genomic positions of more than 2.4 × 106 putative binding sites. Due to the dense clustering of genes and the observation that regulatory sequences are not restricted to upstream regions, the prediction of binding sites was performed for the whole genome. The genomic positions and the underlying data were imported into the newly developed AthaMap database. This data can be accessed by positional information or the Arabidopsis Genome Initiative identification number. Putative binding sites are displayed in the defined region. Data on the matrices used and on the thresholds applied in these screens are given in the database. Considering the high density of sites it will be a valuable resource for generating models on gene expression regulation. The data are available at http://www.athamap.de.
The number of online databases and web-tools for gene expression analysis in Arabidopsis thaliana has increased tremendously during the last years. These resources permit the database-assisted identification of putative cis-regulatory DNA sequences, their binding proteins, and the determination of common cis-regulatory motifs in coregulated genes. DNA binding proteins may be predicted by the type of cis-regulatory motif. Further questions of combinatorial control based on the interaction of DNA binding proteins and the colocalization of cis-regulatory motifs can be addressed. The database-assisted spatial and temporal expression analysis of DNA binding proteins and their target genes may help to further refine experimental approaches. Signal transduction pathways upstream of regulated genes are not yet fully accessible in databases mainly because they need to be manually annotated. This review focuses on the use of the AthaMap and PathoPlant® databases for gene expression regulation analysis and discusses similar and complementary online databases and web-tools. Online databases are helpful for the development of working hypothesis and for designing subsequent experiments.
Bioinformatics; databases; gene expression; plants; transcription; web-server.
The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is a genome database for Arabidopsis thaliana, an important reference organism for many fundamental aspects of biology as well as basic and applied plant biology research. TAIR serves as a central access point for Arabidopsis data, annotates gene function and expression patterns using controlled vocabulary terms, and maintains and updates the A. thaliana genome assembly and annotation. TAIR also provides researchers with an extensive set of visualization and analysis tools. Recent developments include several new genome releases (TAIR8, TAIR9 and TAIR10) in which the A. thaliana assembly was updated, pseudogenes and transposon genes were re-annotated, and new data from proteomics and next generation transcriptome sequencing were incorporated into gene models and splice variants. Other highlights include progress on functional annotation of the genome and the release of several new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature.
The binding of transcription factors to DNA plays an essential role in the regulation of gene expression. Numerous experiments elucidated binding sequences which subsequently have been used to derive statistical models for predicting potential transcription factor binding sites (TFBS). The rapidly increasing number of genome sequence data requires sophisticated computational approaches to manage and query experimental and predicted TFBS data in the context of other epigenetic factors and across different organisms.
We have developed D-Light, a novel client-server software package to store and query large amounts of TFBS data for any number of genomes. Users can add small-scale data to the server database and query them in a large scale, genome-wide promoter context. The client is implemented in Java and provides simple graphical user interfaces and data visualization. Here we also performed a statistical analysis showing what a user can expect for certain parameter settings and we illustrate the usage of D-Light with the help of a microarray data set.
D-Light is an easy to use software tool to integrate, store and query annotation data for promoters. A public D-Light server, the client and server software for local installation and the source code under GNU GPL license are available at http://biwww.che.sbg.ac.at/dlight.
COTRASIF is a web-based tool for the genome-wide search of evolutionary conserved regulatory regions (transcription factor-binding sites, TFBS) in eukaryotic gene promoters. Predictions are made using either a position-weight matrix search method, or a hidden Markov model search method, depending on the availability of the matrix and actual sequences of the target TFBS. COTRASIF is a fully integrated solution incorporating both a gene promoter database (based on the regular Ensembl genome annotation releases) and both JASPAR and TRANSFAC databases of TFBS matrices. To decrease the false-positives rate an integrated evolutionary conservation filter is available, which allows the selection of only those of the predicted TFBS that are present in the promoters of the related species’ orthologous genes. COTRASIF is very easy to use, implements a regularly updated database of promoters and is a powerful solution for genome-wide TFBS searching. COTRASIF is freely available at http://biomed.org.ua/COTRASIF/.
In the past several years, there has been a tremendous effort to construct physical maps and to sequence the genome of Arabidopsis thaliana. As a result, four of the five chromosomes are completely covered by overlapping clones except at the centromeric and nucleolus organizer regions (NOR). In addition, over 30% of the genome has been sequenced and completion is anticipated by the end of the year 2000. Despite these accomplishments, the physical maps are provided in many formats on laboratories' Web sites. These data are thus difficult to obtain in a coherent manner for researchers. To alleviate this problem, AtDB (Arabidopsis thaliana DataBase, URL: http://genome-www.stanford.edu/Arabidopsis/) has constructed a unified display of the physical maps where all publicly available physical-map data for all chromosomes are presented through the Web in a clickable, 'on-the-fly' graphic, created by CGI programs that directly consult our relational database.
The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is the model organism database for the fully sequenced and intensively studied model plant Arabidopsis thaliana. Data in TAIR is derived in large part from manual curation of the Arabidopsis research literature and direct submissions from the research community. New developments at TAIR include the addition of the GBrowse genome viewer to the TAIR site, a redesigned home page, navigation structure and portal pages to make the site more intuitive and easier to use, the launch of several TAIR web services and a new genome annotation release (TAIR7) in April 2007. A combination of manual and computational methods were used to generate this release, which contains 27 029 protein-coding genes, 3889 pseudogenes or transposable elements and 1123 ncRNAs (32 041 genes in all, 37 019 gene models). A total of 681 new genes and 1002 new splice variants were added. Overall, 10 098 loci (one-third of all loci from the previous TAIR6 release) were updated for the TAIR7 release.
Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions.
We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI.
Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions.
promoter; tissue-specific gene expression; position weight matrix; regulatory motif
MetNet (http://www.botany.iastate.edu/∼mash/metnetex/metabolicnetex.html) is publicly
available software in development for analysis of genome-wide RNA, protein
and metabolite profiling data. The software is designed to enable the biologist to
visualize, statistically analyse and model a metabolic and regulatory network map
of Arabidopsis, combined with gene expression profiling data. It contains a JAVA
interface to an interactions database (MetNetDB) containing information on regulatory
and metabolic interactions derived from a combination of web databases (TAIR,
KEGG, BRENDA) and input from biologists in their area of expertise. FCModeler
captures input from MetNetDB in a graphical form. Sub-networks can be identified
and interpreted using simple fuzzy cognitive maps. FCModeler is intended to develop
and evaluate hypotheses, and provide a modelling framework for assessing the large
amounts of data captured by high-throughput gene expression experiments. FCModeler
and MetNetDB are currently being extended to three-dimensional virtual reality
display. The MetNet map, together with gene expression data, can be viewed using
multivariate graphics tools in GGobi linked with the data analytic tools in R. Users
can highlight different parts of the metabolic network and see the relevant expression
data highlighted in other data plots. Multi-dimensional expression data can be
rotated through different dimensions. Statistical analysis can be computed alongside
the visual. MetNet is designed to provide a framework for the formulation of testable
hypotheses regarding the function of specific genes, and in the long term provide
the basis for identification of metabolic and regulatory networks that control plant
composition and development.
Motivation: In functional genomics, it is frequently useful to correlate expression levels of genes to identify transcription factor binding sites (TFBS) via the presence of common sequence motifs. The underlying assumption is that co-expressed genes are more likely to contain shared TFBS and, thus, TFBS can be identified computationally. Indeed, gene pairs with a very high expression correlation show a significant excess of shared binding sites in yeast. We have tested this assumption in a more complex organism, Drosophila melanogaster, by using experimentally determined TFBS and microarray expression data. We have also examined the reverse relationship between the expression correlation and the extent of TFBS sharing.
Results: Pairs of genes with shared TFBS show, on average, a higher degree of co-expression than those with no common TFBS in Drosophila. However, the reverse does not hold true: gene pairs with high expression correlations do not share significantly larger numbers of TFBS. Exception to this observation exists when comparing expression of genes from the earliest stages of embryonic development. Interestingly, semantic similarity between gene annotations (Biological Process) is much better associated with TFBS sharing, as compared to the expression correlation. We discuss these results in light of reverse engineering approaches to computationally predict regulatory sequences by using comparative genomics.
DNA methylation can regulate gene expression by modulating the interaction between DNA and proteins or protein complexes. Conserved consensus motifs exist across the human genome ("predicted transcription factor binding sites": "predicted TFBS") but the large majority of these are proven by chromatin immunoprecipitation and high throughput sequencing (ChIP-seq) not to be biological transcription factor binding sites ("empirical TFBS"). We hypothesize that DNA methylation at conserved consensus motifs prevents promiscuous or disorderly transcription factor binding.
Using genome-wide methylation maps of the human heart and sperm, we found that all conserved consensus motifs as well as the subset of those that reside outside CpG islands have an aggregate profile of hyper-methylation. In contrast, empirical TFBS with conserved consensus motifs have a profile of hypo-methylation. 40% of empirical TFBS with conserved consensus motifs resided in CpG islands whereas only 7% of all conserved consensus motifs were in CpG islands. Finally we further identified a minority subset of TF whose profiles are either hypo-methylated or neutral at their respective conserved consensus motifs implicating that these TF may be responsible for establishing or maintaining an un-methylated DNA state, or whose binding is not regulated by DNA methylation.
Our analysis supports the hypothesis that at least for a subset of TF, empirical binding to conserved consensus motifs genome-wide may be controlled by DNA methylation.
Transcription factors are important controllers of gene expression and mapping transcription factor binding sites (TFBS) is key to inferring transcription factor regulatory networks. Several methods for predicting TFBS exist, but there are no standard genome-wide datasets on which to assess the performance of these prediction methods. Also, it is believed that information about sequence conservation across different genomes can generally improve accuracy of motif-based predictors, but it is not clear under what circumstances use of conservation is most beneficial.
Here we use published ChIP-seq data and an improved peak detection method to create comprehensive benchmark datasets for prediction methods which use known descriptors or binding motifs to detect TFBS in genomic sequences. We use this benchmark to assess the performance of five different prediction methods and find that the methods that use information about sequence conservation generally perform better than simpler motif-scanning methods. The difference is greater on high-affinity peaks and when using short and information-poor motifs. However, if the motifs are specific and information-rich, we find that simple motif-scanning methods can perform better than conservation-based methods.
Our benchmark provides a comprehensive test that can be used to rank the relative performance of transcription factor binding site prediction methods. Moreover, our results show that, contrary to previous reports, sequence conservation is better suited for predicting strong than weak transcription factor binding sites.
Advances in sequencing technology have boosted population genomics and made it possible to map the positions of transcription factor binding sites (TFBSs) with high precision. Here we investigate TFBS variability by combining transcription factor binding maps generated by ENCODE, modENCODE, our previously published data and other sources with genomic variation data for human individuals and Drosophila isogenic lines.
We introduce a metric of TFBS variability that takes into account changes in motif match associated with mutation and makes it possible to investigate TFBS functional constraints instance-by-instance as well as in sets that share common biological properties. We also take advantage of the emerging per-individual transcription factor binding data to show evidence that TFBS mutations, particularly at evolutionarily conserved sites, can be efficiently buffered to ensure coherent levels of transcription factor binding.
Our analyses provide insights into the relationship between individual and interspecies variation and show evidence for the functional buffering of TFBS mutations in both humans and flies. In a broad perspective, these results demonstrate the potential of combining functional genomics and population genetics approaches for understanding gene regulation.
The Genome Database (GDB, http://www.gdb.org ) is a public repository of data on human genes, clones, STSs, polymorphisms and maps. GDB entries are highly cross-linked to each other, to literature citations and to entries in other databases, including the sequence databases, OMIM, and the Mouse Genome Database. Mapping data from large genome centers and smaller mapping efforts are added to GDB on an ongoing basis. The database can be searched by a variety of methods, ranging from keyword searches to complex queries. Major functionality extensions in the last year include the ongoing computation of integrated human genome maps, called Comprehensive Maps, and the use of those maps to support positional queries and graphic displays. The capabilities of the GDB map viewer (Mapview) have been extended to include map printing and the graphical display of ad hoc query results. The HUGO Nomenclature Committee continues to curate the proposed and official gene symbols and related data in collaboration with GDB. As genome research shifts its emphasis from mapping to sequencing and functional analysis, the scope of the GDB schema is being extended. We are in the process of adding representations of gene function and expression, and improving our representation of human polymorphism and mutation.
The amount of transcription factor binding sites (TFBS) in an organism’s genome positively correlates with the complexity of the regulatory network of the organism. However, the manner by which TFBS arise and accumulate in genomes and the effects of regulatory network complexity on the organism’s fitness are far from being known. The availability of TFBS data from many organisms provides an opportunity to explore these issues, particularly from an evolutionary perspective.
We analyzed TFBS data from five model organisms – E. coli K12, S. cerevisiae, C. elegans, D. melanogaster, A. thaliana – and found a positive correlation between the amount of non-coding DNA (ncDNA) in the organism’s genome and regulatory complexity. Based on this finding, we hypothesize that the amount of ncDNA, combined with the population size, can explain the patterns of regulatory complexity across organisms. To test this hypothesis, we devised a genome-based regulatory pathway model and subjected it to the forces of evolution through population genetic simulations. The results support our hypothesis, showing neutral evolutionary forces alone can explain TFBS patterns, and that selection on the regulatory network function does not alter this finding.
The cis-regulome is not a clean functional network crafted by adaptive forces alone, but instead a data source filled with the noise of non-adaptive forces. From a regulatory perspective, this evolutionary noise manifests as complexity on both the binding site and pathway level, which has significant implications on many directions in microbiology, genetics, and synthetic biology.
This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information.
We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA.
We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.
Chromatin immunoprecipitation (ChIP) coupled with high-throughput techniques (ChIP-X), such as next generation sequencing (ChIP-Seq) and microarray (ChIP–chip), has been successfully used to map active transcription factor binding sites (TFBS) of a transcription factor (TF). The targeted genes can be activated or suppressed by the TF, or are unresponsive to the TF. Microarray technology has been used to measure the actual expression changes of thousands of genes under the perturbation of a TF, but is unable to determine if the affected genes are direct or indirect targets of the TF. Furthermore, both ChIP-X and microarray methods produce a large number of false positives. Combining microarray expression profiling and ChIP-X data allows more effective TFBS analysis for studying the function of a TF. However, current web servers only provide tools to analyze either ChIP-X or expression data, but not both. Here, we present ChIP-Array, a web server that integrates ChIP-X and expression data from human, mouse, yeast, fruit fly and Arabidopsis. This server will assist biologists to detect direct and indirect target genes regulated by a TF of interest and to aid in the functional characterization of the TF. ChIP-Array is available at http://jjwanglab.hku.hk/ChIP-Array, with free access to academic users.
S-adenosyl-l-methionine-dependent rRNA dimethylases mediate the methylation of two conserved adenosines near the 3′ end of the rRNA in the small ribosomal subunits of bacteria, archaea and eukaryotes. Proteins related to this family of dimethylases play an essential role as transcription factors (mtTFBs) in fungal and animal mitochondria. Human mitochondrial rRNA is methylated and human mitochondria contain two related mtTFBs, one proposed to act as rRNA dimethylase, the other as transcription factor. The nuclear genome of Arabidopsis thaliana encodes three dimethylase/mtTFB-like proteins, one of which, Dim1B, is shown here to be imported into mitochondria. Transcription initiation by mitochondrial RNA polymerases appears not to be stimulated by Dim1B in vitro. In line with this finding, phylogenetic analyses revealed Dim1B to be more closely related to a group of eukaryotic non-mitochondrial rRNA dimethylases (Dim1s) than to fungal and animal mtTFBs. We found that Dim1B was capable of substituting the E. coli rRNA dimethylase activity of KsgA. Moreover, we observed methylation of the conserved adenines in the 18S rRNA of Arabidopsis mitochondria; this modification was not detectable in a mutant lacking Dim1B. These data provide evidence: (i) for rRNA methylation in Arabidopsis mitochondria; and (ii) that Dim1B is the enzyme catalyzing this process.
rRNA dimethyltransferases; mitochondria; Arabidopsis; mitochondrial transcription; molecular phylogeny
In post-genomic era, the study of transcriptional regulation is pivotal to decode genetic information. Transcription factors (TFs) are central proteins for transcriptional regulation, and interactions between TFs and their DNA targets (TFBSs) are important for downstream genes’ expression. However, the lack of knowledge about interactions between TFs and TFBSs is still baffling people to investigate the mechanism of transcription.
To expand the knowledge about interactions between TFs and TFBSs, three biological features (sequence feature, structure feature, and evolution feature) were utilized to build TFBS identification models for studying binding preference between TFs and their DNA targets in mammals. Results show that each feature does have fairly well performance to capture TFBSs, and the hybrid model combined all three features is more robust for TFBS identification. Subsequently, correspondence between TFs and their TFBSs was investigated to explore interactions among them in mammals. Results indicate that TFs and TFBSs are reciprocal in sequence, structure, and evolution level.
Our work demonstrates that, to some extent, TFs and TFBSs have developed a coevolutionary relationship in order to keep their physical binding and maintain their regulatory functions. In summary, our work will help understand transcriptional regulation and interpret binding mechanism between proteins and DNAs.
The genome of the hyperthermophile archaeon Pyrococcus furiosus encodes two transcription factor B (TFB) paralogs, one of which (TFB1) was previously characterized in transcription initiation. The second TFB (TFB2) is unusual in that it lacks recognizable homology to the archaeal TFB/eukaryotic TFIIB B-finger motif. TFB2 functions poorly in promoter-dependent transcription initiation, but photochemical cross-linking experiments indicated that the orientation and occupancy of transcription complexes formed with TFB2 at the strong gdh promoter are similar to the orientation and occupancy of transcription complexes formed with TFB1. Initiation complexes formed by TFB2 display a promoter opening defect that can be bypassed with a preformed transcription bubble, suggesting a mechanism to explain the low TFB2 transcription activity. Domain swaps between TFB1 and TFB2 showed that the low activity of TFB2 is determined mainly by its N terminus. The low activity of TFB2 in promoter opening and transcription can be partially relieved by transcription factor E (TFE). The results indicate that the TFB N-terminal region, containing conserved Zn ribbon and B-finger motifs, is important in promoter opening and that TFE can compensate for defects in the N terminus through enhancement of promoter opening.
Systematic annotation of gene regulatory elements is a major challenge in genome science. Direct mapping of chromatin modification marks and transcriptional factor binding sites genome-wide 1,2 has successfully identified specific subtypes of regulatory elements 3. In Drosophila several pioneering studies have provided genome-wide identification of Polycomb-Response Elements 4, chromatin states 5, transcription factor binding sites (TFBS) 6–9, PolII regulation 8, and insulator elements 10; however, comprehensive annotation of the regulatory genome remains a significant challenge. Here we describe results from the modENCODE cis-regulatory annotation project. We produced a map of the Drosophila melanogaster regulatory genome based on more than 300 chromatin immuno-precipitation (ChIP) datasets for eight chromatin features, five histone deacetylases (HDACs) and thirty-eight site-specific transcription factors (TFs) at different stages of development. Using these data we inferred more than 20,000 candidate regulatory elements and we validated a subset of predictions for promoters, enhancers, and insulators in vivo. We also identified nearly 2,000 genomic regions of dense TF binding associated with chromatin activity and accessibility. We discovered hundreds of new TF co-binding relationships and defined a TF network with over 800 potential regulatory relationships.