The AthaMap database generates a map of predicted transcription factor binding sites (TFBS) for the whole Arabidopsis thaliana genome. AthaMap has now been extended to include data on post-transcriptional regulation. A total of 403 173 genomic positions of small RNAs have been mapped in the A. thaliana genome. These identify 5772 putative post-transcriptionally regulated target genes. AthaMap tools have been modified to improve the identification of common TFBS in co-regulated genes by subtracting post-transcriptionally regulated genes from such analyses. Furthermore, AthaMap was updated to the TAIR7 genome annotation, a graphic display of gene analysis results was implemented, and the TFBS data content was increased. AthaMap is freely available at http://www.athamap.de/.
The AthaMap database generates a map of cis-regulatory elements for the whole Arabidopsis thaliana genome. This database has been extended by new tools to identify common cis-regulatory elements in specific regions of user-provided gene sets. A resulting table displays all cis-regulatory elements annotated in AthaMap including positional information relative to the respective gene. Further tables show overviews with the number of individual transcription factor binding sites (TFBS) present and TFBS common to the whole set of genes. Over represented cis-elements are easily identified. These features were used to detect specific enrichment of drought-responsive elements in cold-induced genes. For identification of co-regulated genes, the output table of the colocalization function was extended to show the closest genes and their relative distances to the colocalizing TFBS. Gene sets determined by this function can be used for a co-regulation analysis in microarray gene expression databases such as Genevestigator or PathoPlant. Additional improvements of AthaMap include display of the gene structure in the sequence window and a significant data increase. AthaMap is freely available at .
The AthaMap database generates a map of potential transcription factor binding sites (TFBS) and small RNA target sites in the Arabidopsis thaliana genome. The database contains sites for 115 different transcription factors (TFs). TFBS were identified with positional weight matrices (PWMs) or with single binding sites. With the new web tool ‘Gene Identification’, it is possible to identify potential target genes for selected TFs. For these analyses, the user can define a region of interest of up to 6000 bp in all annotated genes. For TFBS determined with PWMs, the search can be restricted to high-quality TFBS. The results are displayed in tables that identify the gene, position of the TFBS and, if applicable, individual score of the TFBS. In addition, data files can be downloaded that harbour positional information of TFBS of all TFs in a region between −2000 and +2000 bp relative to the transcription or translation start site. Also, data content of AthaMap was increased and the database was updated to the TAIR8 genome release.
Database URL: http://www.athamap.de/gene_ident.php
The AthaMap database generates a genome-wide map for putative transcription factor binding sites for A. thaliana. When analyzing transcriptional regulation using AthaMap it may be important to learn which genes are also post-transcriptionally regulated by inhibitory RNAs. Therefore, a unified database for transcriptional and post-transcriptional regulation will be highly useful for the analysis of gene expression regulation.
To identify putative microRNA target sites in the genome of A. thaliana, processed mature miRNAs from 243 annotated miRNA genes were used for screening with the psRNATarget web server. Positional information, target genes and the psRNATarget score for each target site were annotated to the AthaMap database. Furthermore, putative target sites for small RNAs from seven small RNA transcriptome datasets were used to determine small RNA target sites within the A. thaliana genome.
Putative 41,965 genome wide miRNA target sites and 10,442 miRNA target genes were identified in the A. thaliana genome. Taken together with genes targeted by small RNAs from small RNA transcriptome datasets, a total of 16,600 A. thaliana genes are putatively regulated by inhibitory RNAs. A novel web-tool, ‘MicroRNA Targets’, was integrated into AthaMap which permits the identification of genes predicted to be regulated by selected miRNAs. The predicted target genes are displayed with positional information and the psRNATarget score of the target site. Furthermore, putative target sites of small RNAs from selected tissue datasets can be identified with the new ‘Small RNA Targets’ web-tool.
The integration of predicted miRNA and small RNA target sites with transcription factor binding sites will be useful for AthaMap-assisted gene expression analysis. URL: http://www.athamap.de/
Arabidopsis thaliana; AthaMap; MicroRNAs; Small RNAs; Post-transcriptional regulation
Gene expression is controlled mainly by the binding of transcription factors to regulatory sequences. To generate a genomic map for regulatory sequences, the Arabidopsis thaliana genome was screened for putative transcription factor binding sites. Using publicly available data from the TRANSFAC database and from publications, alignment matrices for 23 transcription factors of 13 different factor families were used with the pattern search program Patser to determine the genomic positions of more than 2.4 × 106 putative binding sites. Due to the dense clustering of genes and the observation that regulatory sequences are not restricted to upstream regions, the prediction of binding sites was performed for the whole genome. The genomic positions and the underlying data were imported into the newly developed AthaMap database. This data can be accessed by positional information or the Arabidopsis Genome Initiative identification number. Putative binding sites are displayed in the defined region. Data on the matrices used and on the thresholds applied in these screens are given in the database. Considering the high density of sites it will be a valuable resource for generating models on gene expression regulation. The data are available at http://www.athamap.de.
The number of online databases and web-tools for gene expression analysis in Arabidopsis thaliana has increased tremendously during the last years. These resources permit the database-assisted identification of putative cis-regulatory DNA sequences, their binding proteins, and the determination of common cis-regulatory motifs in coregulated genes. DNA binding proteins may be predicted by the type of cis-regulatory motif. Further questions of combinatorial control based on the interaction of DNA binding proteins and the colocalization of cis-regulatory motifs can be addressed. The database-assisted spatial and temporal expression analysis of DNA binding proteins and their target genes may help to further refine experimental approaches. Signal transduction pathways upstream of regulated genes are not yet fully accessible in databases mainly because they need to be manually annotated. This review focuses on the use of the AthaMap and PathoPlant® databases for gene expression regulation analysis and discusses similar and complementary online databases and web-tools. Online databases are helpful for the development of working hypothesis and for designing subsequent experiments.
Bioinformatics; databases; gene expression; plants; transcription; web-server.
The availability of a draft human genome sequence and ability to monitor the transcription of thousands of genes with DNA microarrays has necessitated the need for new computational tools that can analyze cis-regulatory elements controlling genes that display similar expression patterns. We have developed a tool designated EZ-Retrieve that can: (i) retrieve any particular region of human genome sequence from the NCBI database and (ii) analyze retrieved sequences for putative transcription factor-binding sites (TFBSs) as they appear on the TRANSFAC database. The tool is web-based, user-friendly and offers both batch sequence retrieval and batch TFBS prediction. A major application of EZ-Retrieve is the analysis of co-expressed genes that are highlighted as expression clusters in DNA microarray experiments.
Synonymous gene regulation, defined by regulatory elements driving shared temporal and/or spatial aspects of gene expression, is most probably predicated on genomic elements that contain similar modules of certain transcription factor binding sites (TFBS). We have developed a method to scan vertebrate genomes for evolutionary conserved modules of TFBS in a predefined configuration, and created a tool, named SynoR that identifies synonymous regulatory elements (SREs) in vertebrate genomes. SynoR performs de novo identification of SREs utilizing known patterns of TFBS in active regulatory elements (REs) as seeds for genome scans. Layers of multiple-species conservation allow the use of differential phylogenetic sequence conservation filters in search of SREs and the results are displayed such as to provide an extensive annotation of the genes containing the detected REs. Gene Ontology categories are utilized to further functionally classify the identified genes, and integrated GNF Expression Atlas 2 data allow the cataloging of tissue-specificities of the predicted SREs. SynoR is publicly available at .
A new strategy is proposed for identifying synergistic transcription factors by function conservation, leading to the identification of 51 homotypic transcription-factor combinations.
Previous methods employed for the identification of synergistic transcription factors (TFs) are based on either TF enrichment from co-regulated genes or phylogenetic footprinting. Despite the success of these methods, both have limitations.
We propose a new strategy to identify synergistic TFs by function conservation. Rather than aligning the regulatory sequences from orthologous genes and then identifying conserved TF binding sites (TFBSs) in the alignment, we developed computational approaches to implement the novel strategy. These methods include combinatorial TFBS enrichment utilizing distance constraints followed by enrichment of overlapping orthologous genes from human and mouse, whose regulatory sequences contain the enriched TFBS combinations. Subsequently, integration of function conservation from both TFBS and overlapping orthologous genes was achieved by correlation analyses. These techniques have been used for genome-wide promoter analyses, which have led to the identification of 51 homotypic TF combinations; the validity of these approaches has been exemplified by both known TF-TF interactions and function coherence analyses. We further provide computational evidence that our novel methods were able to identify synergistic TFs to a much greater extent than phylogenetic footprinting.
Function conservation based on the concordance of combinatorial TFBS enrichment along with enrichment of overlapping orthologous genes has been proven to be a successful means for the identification of synergistic TFs. This approach avoids the limitations of phylogenetic footprinting as it does not depend upon sequence alignment. It utilizes existing gene annotation data, such as those available in GO, thus providing an alternative method for functional TF discovery and annotation.
Construction of transcriptional regulatory networks (TRNs) is of priority concern in systems biology. Numerous high-throughput approaches, including microarray and next-generation sequencing, are extensively adopted to examine transcriptional expression patterns on the whole-genome scale; those data are helpful in reconstructing TRNs. Identifying transcription factor binding sites (TFBSs) in a gene promoter is the initial step in elucidating the transcriptional regulation mechanism. Since transcription factors usually co-regulate a common group of genes by forming regulatory modules with similar TFBSs. Therefore, the combinatorial interactions of transcription factors must be modeled to reconstruct the gene regulatory networks.
Description For systems biology applications, this work develops a novel database called Arabidopsis thaliana Promoter Analysis Net (AtPAN), capable of detecting TFBSs and their corresponding transcription factors (TFs) in a promoter or a set of promoters in Arabidopsis. For further analysis, according to the microarray expression data and literature, the co-expressed TFs and their target genes can be retrieved from AtPAN. Additionally, proteins interacting with the co-expressed TFs are also incorporated to reconstruct co-expressed TRNs. Moreover, combinatorial TFs can be detected by the frequency of TFBSs co-occurrence in a group of gene promoters. In addition, TFBSs in the conserved regions between the two input sequences or homologous genes in Arabidopsis and rice are also provided in AtPAN. The output results also suggest conducting wet experiments in the future.
The AtPAN, which has a user-friendly input/output interface and provide graphical view of the TRNs. This novel and creative resource is freely available online at http://AtPAN.itps.ncku.edu.tw/.
A strategy combining classical motif overrepresentation in co-regulated genes with comparative footprinting is applied to identify 80 transcription factor binding sites and 139 regulatory modules in Arabidopsis thaliana.
Transcriptional regulation plays an important role in the control of many biological processes. Transcription factor binding sites (TFBSs) are the functional elements that determine transcriptional activity and are organized into separable cis-regulatory modules, each defining the cooperation of several transcription factors required for a specific spatio-temporal expression pattern. Consequently, the discovery of novel TFBSs in promoter sequences is an important step to improve our understanding of gene regulation.
Here, we applied a detection strategy that combines features of classic motif overrepresentation approaches in co-regulated genes with general comparative footprinting principles for the identification of biologically relevant regulatory elements and modules in Arabidopsis thaliana, a model system for plant biology. In total, we identified 80 TFBSs and 139 regulatory modules, most of which are novel, and primarily consist of two or three regulatory elements that could be linked to different important biological processes, such as protein biosynthesis, cell cycle control, photosynthesis and embryonic development. Moreover, studying the physical properties of some specific regulatory modules revealed that Arabidopsis promoters have a compact nature, with cooperative TFBSs located in close proximity of each other.
These results create a starting point to unravel regulatory networks in plants and to study the regulation of biological processes from a systems biology point of view.
Detecting conserved noncoding sequences (CNSs) across species highlights the functional elements. Alignment procedures combined with computational prediction of transcription factor binding sites (TFBSs) can narrow down key regulatory elements. Repeat masking processes are often performed before alignment to mask insertion sequences such as transposable elements (TEs). However, recently such TEs have been reported to influence the gene regulatory network evolution. Therefore, an alignment approach that is robust to TE insertions is meaningful for finding novel conserved TFBSs in TEs.
We constructed a web server 'ReAlignerV' for complex alignment of genomic sequences. ReAlignerV returns ladder-like schematic alignments that integrate predicted TFBSs and the location of TEs. It also provides pair-wise alignments in which the predicted TFBS sites and their names are shown alongside each sequence. Furthermore, we evaluated false positive aligned sites by focusing on the species-specific TEs (SSTEs), and found that ReAlignerV has a higher specificity and robustness to insertions for sequences having more than 20% TE content, compared to LAGAN, AVID, MAVID and BLASTZ.
ReAlignerV can be applied successfully to TE-insertion-rich sequences without prior repeat masking, and this increases the chances of finding regulatory sequences hidden in TEs, which are important sources of the regulatory network evolution. ReAlignerV can be accessed through and downloaded from .
The computational analysis of regulatory SNPs (rSNPs) is an essential step in the elucidation of the structure and function of regulatory networks at the cellular level. In this work we focus in particular on SNPs that potentially affect a Transcription Factor Binding Site (TFBS) to a significant extent, possibly resulting in changes to gene expression patterns or alternative splicing. The application described here is based on the MAPPER platform, a previously developed web-based system for the computational detection of TFBSs in DNA sequences.
rSNP-MAPPER is a computational tool that analyzes SNPs lying within predicted TFBSs and determines whether the allele substitution results in a significant change in the TFBS predictive score. The application's simple and intuitive interface supports several usage modes. For example, the user may search for potential rSNPs in the promoters of one or more genes, specified as a list of identifiers or chosen among the members of a pathway. Alternatively, the user may specify a set of SNPs to be analyzed by uploading a list of SNP identifiers or providing the coordinates of a genomic region. Finally, the user can provide two alternative sequences (wildtype and mutant), and the system will determine the location of variants to be analyzed by comparing them.
In this paper we outline the architecture of rSNP-MAPPER, describing its intuitive and powerful user interface in detail. We then present several examples of the use of rSNP-MAPPER to reproduce and confirm experimental studies aimed at identifying regulatory SNPs in human genes, that show how rSNP-MAPPER is able to detect and characterize rSNPs with high accuracy. Results are richly annotated and can be displayed online or downloaded in a number of different formats.
rSNP-MAPPER is optimized for large scale work, allowing for the efficient annotation of thousands of SNPs, and is designed to assist in the genome-wide investigation of transcriptional regulatory networks, prioritizing potential rSNPs for subsequent experimental validation. rSNP-MAPPER is freely available at http://genome.ufl.edu/mapper/.
PromAn is a modular web-based tool dedicated to promoter analysis that integrates distinct complementary databases, methods and programs. PromAn provides automatic analysis of a genomic region with minimal prior knowledge of the genomic sequence. Prediction programs and experimental databases are combined to locate the transcription start site (TSS) and the promoter region within a large genomic input sequence. Transcription factor binding sites (TFBSs) can be predicted using several public databases and user-defined motifs. Also, a phylogenetic footprinting strategy, combining multiple alignment of large genomic sequences and assignment of various scores reflecting the evolutionary selection pressure, allows for evaluation and ranking of TFBS predictions. PromAn results can be displayed in an interactive graphical user interface, PromAnGUI. It integrates all of this information to highlight active promoter regions, to identify among the huge number of TFBS predictions those which are the most likely to be potentially functional and to facilitate user refined analysis. Such an integrative approach is essential in the face of a growing number of tools dedicated to promoter analysis in order to propose hypotheses to direct further experimental validations. PromAn is publicly available at .
The gene regulatory information is hardwired in the promoter regions formed by cis-regulatory elements that bind specific transcription factors (TFs). Hence, establishing the architecture of plant promoters is fundamental to understanding gene expression. The determination of the regulatory circuits controlled by each TF and the identification of the cis-regulatory sequences for all genes have been identified as two of the goals of the Multinational Coordinated Arabidopsis thaliana Functional Genomics Project by the Multinational Arabidopsis Steering Committee (June 2002).
AGRIS is an information resource of Arabidopsis promoter sequences, transcription factors and their target genes. AGRIS currently contains two databases, AtTFDB (Arabidopsis thaliana transcription factor database) and AtcisDB (Arabidopsis thaliana cis-regulatory database). AtTFDB contains information on approximately 1,400 transcription factors identified through motif searches and grouped into 34 families. AtTFDB links the sequence of the transcription factors with available mutants and, when known, with the possible genes they may regulate. AtcisDB consists of the 5' regulatory sequences of all 29,388 annotated genes with a description of the corresponding cis-regulatory elements. Users can search the databases for (i) promoter sequences, (ii) a transcription factor, (iii) a direct target genes for a specific transcription factor, or (vi) a regulatory network that consists of transcription factors and their target genes.
AGRIS provides the necessary software tools on Arabidopsis transcription factors and their putative binding sites on all genes to initiate the identification of transcriptional regulatory networks in the model dicotyledoneous plant Arabidopsis thaliana. AGRIS can be accessed from .
The identification and study of the cis-regulatory elements that control gene expression are important areas of biological research, but few resources exist to facilitate large-scale bioinformatics studies of cis-regulation in metazoan species. Drosophila melanogaster, with its well-annotated genome, exceptional resources for comparative genomics and long history of experimental studies of transcriptional regulation, represents the ideal system for regulatory bioinformatics. We have merged two existing Drosophila resources, the REDfly database of cis-regulatory modules and the FlyReg database of transcription factor binding sites (TFBSs), into a single integrated database containing extensive annotation of empirically validated cis-regulatory modules and their constituent binding sites. With the enhanced functionality made possible through this integration of TFBS data into REDfly, together with additional improvements to the REDfly infrastructure, we have constructed a one-stop portal for Drosophila cis-regulatory data that will serve as a powerful resource for both computational and experimental studies of transcriptional regulation. REDfly is freely accessible at http://redfly.ccr.buffalo.edu.
Computational identification of transcription factor binding sites (TFBSs) is a rapid, cost-efficient way to locate unknown regulatory elements. With increased potential for high-throughput genome sequencing, the availability of accurate computational methods for TFBS prediction has never been as important as it currently is. To date, identifying TFBSs with high sensitivity and specificity is still an open challenge, necessitating the development of novel models for predicting transcription factor-binding regulatory DNA elements.
Based on the information theory, we propose a model for transcription factor binding of regulatory DNA sites. Our model incorporates position interdependencies in effective ways. The model computes the information transferred (TI) between the transcription factor and the TFBS during the binding process and uses TI as the criterion to determine whether the sequence motif is a possible TFBS. Based on this model, we developed a computational method to identify TFBSs. By theoretically proving and testing our model using both real and artificial data, we found that our model provides highly accurate predictive results.
In this study, we present a novel model for transcription factor binding regulatory DNA sites. The model can provide an increased ability to detect TFBSs.
Motivation: The prediction and annotation of the genomic regions involved in gene expression has been largely explored. Most of the energy has been devoted to the development of approaches that detect transcription start sites, leaving the identification of regulatory regions and their functional transcription factor binding sites (TFBSs) largely unexplored and with important quantitative and qualitative methodological gaps.
Results: We have developed ReLA (for REgulatory region Local Alignment tool), a unique tool optimized with the Smith–Waterman algorithm that allows local searches of conserved TFBS clusters and the detection of regulatory regions proximal to genes and enhancer regions. ReLA's performance shows specificities of 81 and 50% when tested on experimentally validated proximal regulatory regions and enhancers, respectively.
Availability: The source code of ReLA's is freely available and can be remotely used through our web server under http://www.bsc.es/cg/rela.
Supplementary data are available at Bioinformatics online.
Identifying cis-regulatory elements is crucial to understanding gene expression, which highlights the importance of the computational detection of overrepresented transcription factor binding sites (TFBSs) in coexpressed or coregulated genes. However, this is a challenging problem, especially when considering higher eukaryotic organisms.
We have developed a method, named TFM-Explorer, that searches for locally overrepresented TFBSs in a set of coregulated genes, which are modeled by profiles provided by a database of position weight matrices. The novelty of the method is that it takes advantage of spatial conservation in the sequence and supports multiple species. The efficiency of the underlying algorithm and its robustness to noise allow weak regulatory signals to be detected in large heterogeneous data sets.
TFM-Explorer provides an efficient way to predict TFBS overrepresentation in related sequences. Promising results were obtained in a variety of examples in human, mouse, and rat genomes. The software is publicly available at .
Identifying and characterizing the transcription factor binding site (TFBS) patterns of cis-regulatory elements represents a challenge, but holds promise to reveal the regulatory language the genome uses to dictate transcriptional dynamics. Several studies have demonstrated that regulatory modules are under positive selection and, therefore, are often conserved between related species. Using this evolutionary principle, we have created a comparative tool, rVISTA, for analyzing the regulatory potential of noncoding sequences. Our ability to experimentally identify functional noncoding sequences is extremely limited, therefore, rVISTA attempts to fill this great gap in genomic analysis by offering a powerful approach for eliminating TFBSs least likely to be biologically relevant. The rVISTA tool combines TFBS predictions, sequence comparisons and cluster analysis to identify noncoding DNA regions that are evolutionarily conserved and present in a specific configuration within genomic sequences. Here, we present the newly developed version 2.0 of the rVISTA tool, which can process alignments generated by both the zPicture and blastz alignment programs or use pre-computed pairwise alignments of several vertebrate genomes available from the ECR Browser and GALA database. The rVISTA web server is closely interconnected with the TRANSFAC database, allowing users to either search for matrices present in the TRANSFAC library collection or search for user-defined consensus sequences. The rVISTA tool is publicly available at http://rvista.dcode.org/.
The elucidation of transcriptional regulation in plant genes is important area of research for plant scientists, following the mapping of various plant genomes, such as A. thaliana, O. sativa and Z. mays. A variety of bioinformatic servers or databases of plant promoters have been established, although most have been focused only on annotating transcription factor binding sites in a single gene and have neglected some important regulatory elements (tandem repeats and CpG/CpNpG islands) in promoter regions. Additionally, the combinatorial interaction of transcription factors (TFs) is important in regulating the gene group that is associated with the same expression pattern. Therefore, a tool for detecting the co-regulation of transcription factors in a group of gene promoters is required.
This study develops a database-assisted system, PlantPAN (Plant Promoter Analysis Navigator), for recognizing combinatorial cis-regulatory elements with a distance constraint in sets of plant genes. The system collects the plant transcription factor binding profiles from PLACE, TRANSFAC (public release 7.0), AGRIS, and JASPER databases and allows users to input a group of gene IDs or promoter sequences, enabling the co-occurrence of combinatorial transcription factor binding sites (TFBSs) within a defined distance (20 bp to 200 bp) to be identified. Furthermore, the new resource enables other regulatory features in a plant promoter, such as CpG/CpNpG islands and tandem repeats, to be displayed. The regulatory elements in the conserved regions of the promoters across homologous genes are detected and presented.
In addition to providing a user-friendly input/output interface, PlantPAN has numerous advantages in the analysis of a plant promoter. Several case studies have established the effectiveness of PlantPAN. This novel analytical resource is now freely available at .
COTRASIF is a web-based tool for the genome-wide search of evolutionary conserved regulatory regions (transcription factor-binding sites, TFBS) in eukaryotic gene promoters. Predictions are made using either a position-weight matrix search method, or a hidden Markov model search method, depending on the availability of the matrix and actual sequences of the target TFBS. COTRASIF is a fully integrated solution incorporating both a gene promoter database (based on the regular Ensembl genome annotation releases) and both JASPAR and TRANSFAC databases of TFBS matrices. To decrease the false-positives rate an integrated evolutionary conservation filter is available, which allows the selection of only those of the predicted TFBS that are present in the promoters of the related species’ orthologous genes. COTRASIF is very easy to use, implements a regularly updated database of promoters and is a powerful solution for genome-wide TFBS searching. COTRASIF is freely available at http://biomed.org.ua/COTRASIF/.
The comprehensive identification of functional transcription factor binding sites (TFBSs) is an important step in understanding complex transcriptional regulatory networks. This study presents a motif-based comparative approach, STAT-Finder, for identifying functional DNA binding sites of STAT3 transcription factor. STAT-Finder combines STAT-Scanner, which was designed to predict functional STAT TFBSs with improved sensitivity, and a motif-based alignment to minimize false positive prediction rates. Using two reference sets containing promoter sequences of known STAT3 target genes, STAT-Finder identified functional STAT3 TFBSs with enhanced prediction efficiency and sensitivity relative to other conventional TFBS prediction tools. In addition, STAT-Finder identified novel STAT3 target genes among a group of genes that are over-expressed in human cancer cells. The binding of STAT3 to the predicted TFBSs was also experimentally confirmed through chromatin immunoprecipitation. Our proposed method provides a systematic approach to the prediction of functional TFBSs that can be applied to other TFs.
The discovery of cis-regulatory elements is a challenging problem in bioinformatics, owing to distal locations and context-specific roles of these elements in controlling gene regulation. Here we review the current bioinformatics methodologies and resources available for systematic discovery of cis-acting regulatory elements and conserved transcription factor binding sites in the chick genome. In addition, we propose and make available, a novel workflow using computational tools that integrate CTCF analysis to predict putative insulator elements, enhancer prediction and TFBS analysis. To demonstrate the usefulness of this computational workflow, we then use it to analyze the locus of the gene Sox2 whose developmental expression is known to be controlled by a complex array of cis-acting regulatory elements. The workflow accurately predicts most of the experimentally verified elements along with some that have not yet been discovered. A web version of the CTCF tool, together with instructions for using the workflow can be accessed from http://toolshed.g2.bx.psu.edu/view/mkhan1980/ctcf_analysis. For local installation of the tool, relevant Perl scripts and instructions are provided in the directory named “code” in the supplementary materials.
Systematic annotation of gene regulatory elements is a major challenge in genome science. Direct mapping of chromatin modification marks and transcriptional factor binding sites genome-wide 1,2 has successfully identified specific subtypes of regulatory elements 3. In Drosophila several pioneering studies have provided genome-wide identification of Polycomb-Response Elements 4, chromatin states 5, transcription factor binding sites (TFBS) 6–9, PolII regulation 8, and insulator elements 10; however, comprehensive annotation of the regulatory genome remains a significant challenge. Here we describe results from the modENCODE cis-regulatory annotation project. We produced a map of the Drosophila melanogaster regulatory genome based on more than 300 chromatin immuno-precipitation (ChIP) datasets for eight chromatin features, five histone deacetylases (HDACs) and thirty-eight site-specific transcription factors (TFs) at different stages of development. Using these data we inferred more than 20,000 candidate regulatory elements and we validated a subset of predictions for promoters, enhancers, and insulators in vivo. We also identified nearly 2,000 genomic regions of dense TF binding associated with chromatin activity and accessibility. We discovered hundreds of new TF co-binding relationships and defined a TF network with over 800 potential regulatory relationships.