The computational identification of functional transcription factor binding sites (TFBSs) remains a major challenge of computational biology.
We have analyzed the conserved promoter sequences for the complete set of human RefSeq genes using our conserved transcription factor binding site (CONFAC) software. CONFAC identified 16296 human-mouse ortholog gene pairs, and of those pairs, 9107 genes contained conserved TFBS in the 3 kb proximal promoter and first intron. To attempt to predict in vivo occupancy of transcription factor binding sites, we developed a novel marginal effect isolator algorithm that builds upon Bayesian methods for multigroup TFBS filtering and predicted the in vivo occupancy of two transcription factors with an overall accuracy of 84%.
Our analyses show that integration of chromatin immunoprecipitation data with conserved TFBS analysis can be used to generate accurate predictions of functional TFBS. They also show that TFBS cooccurrence can be used to predict transcription factor binding to promoters in vivo.
Microarray analysis has been used to understand how gene regulation plays a critical role in neuronal injury, survival and repair following ischemic stroke. To identify the transcriptional regulatory elements responsible for ischemia-induced gene expression, we examined gene expression profiles of rat brains following focal ischemia and performed computational analysis of consensus transcription factor binding sites (TFBS) in the genes of the dataset. In this study, rats were sacrificed 24 h after middle cerebral artery occlusion (MCAO) stroke and gene transcription in brain tissues following ischemia/reperfusion was examined using Affymetrix GeneChip technology. The CONserved transcription FACtor binding site (CONFAC) software package was used to identify over-represented TFBS in the upstream promoter regions of ischemia-induced genes compared to control datasets. CONFAC identified 12 TFBS that were statistically over-represented from our dataset of ischemia-induced genes, including three members of the Ets-1 family of transcription factors (TFs). Microarray results showed that mRNA for Ets-1 was increased following tMCAO but not pMCAO. Immunohistochemical analysis of Ets-1 protein in rat brains following MCAO showed that Ets-1 was highly expressed in neurons in the brain of sham control animals. Ets-1 protein expression was virtually abolished in injured neurons of the ischemic brain but was unchanged in peri-infarct brain areas. These data indicate that TFs, including Ets-1, may influence neuronal injury following ischemia. These findings could provide important insights into the mechanisms that lead to brain injury and could provide avenues for the development of novel therapies.
Ischemia; Microarray; Reperfusion; Stroke; Transcription Factors; Rat
The advent of DNA microarray technology and the sequencing of multiple vertebrate genomes has provided a unique opportunity for the integration of comparative genomics with high-throughput gene expression analysis. Here we describe the conserved transcription factor binding site (CONFAC) software that enables the high-throughput identification of conserved transcription factor binding sites (TFBSs) in the regulatory regions of hundreds of genes at a time (http://morenolab.whitehead.emory.edu/cgi-bin/confac/login.pl). The CONFAC software compares non-coding regulatory sequences between human and mouse genomes to enable identification of conserved TFBSs that are significantly enriched in promoters of gene clusters from microarray analyses compared to sets of unchanging control genes using a Mann–Whitney U-test. Analysis of random gene sets demonstrated that using our approach, over 98% of TFBSs had false positive rates below 5%. As a proof-of-principle, we have validated the CONFAC software using gene sets from four separate microarray studies and identified TFBSs known to be functionally important for regulation of each of the four gene sets.
Transcription factor binding site (TFBS) identification plays an important role in deciphering gene regulatory codes. With comprehensive knowledge of TFBSs, one can understand molecular mechanisms of gene regulation. In the recent decades, various computational approaches have been proposed to predict TFBSs in the genome. The TFBS dataset of a TF generated by each algorithm is a ranked list of predicted TFBSs of that TF, where top ranked TFBSs are statistically significant ones. However, whether these statistically significant TFBSs are functional (i.e. biologically relevant) is still unknown. Here we develop a post-processor, called the functional propensity calculator (FPC), to assign a functional propensity to each TFBS in the existing computationally predicted TFBS datasets. It is known that functional TFBSs reveal strong positional preference towards the transcriptional start site (TSS). This motivates us to take TFBS position relative to the TSS as the key idea in building our FPC. Based on our calculated functional propensities, the TFBSs of a TF in the original TFBS dataset could be reordered, where top ranked TFBSs are now the ones with high functional propensities. To validate the biological significance of our results, we perform three published statistical tests to assess the enrichment of Gene Ontology (GO) terms, the enrichment of physical protein-protein interactions, and the tendency of being co-expressed. The top ranked TFBSs in our reordered TFBS dataset outperform the top ranked TFBSs in the original TFBS dataset, justifying the effectiveness of our post-processor in extracting functional TFBSs from the original TFBS dataset. More importantly, assigning functional propensities to putative TFBSs enables biologists to easily identify which TFBSs in the promoter of interest are likely to be biologically relevant and are good candidates to do further detailed experimental investigation. The FPC is implemented as a web tool at http://santiago.ee.ncku.edu.tw/FPC/.
COTRASIF is a web-based tool for the genome-wide search of evolutionary conserved regulatory regions (transcription factor-binding sites, TFBS) in eukaryotic gene promoters. Predictions are made using either a position-weight matrix search method, or a hidden Markov model search method, depending on the availability of the matrix and actual sequences of the target TFBS. COTRASIF is a fully integrated solution incorporating both a gene promoter database (based on the regular Ensembl genome annotation releases) and both JASPAR and TRANSFAC databases of TFBS matrices. To decrease the false-positives rate an integrated evolutionary conservation filter is available, which allows the selection of only those of the predicted TFBS that are present in the promoters of the related species’ orthologous genes. COTRASIF is very easy to use, implements a regularly updated database of promoters and is a powerful solution for genome-wide TFBS searching. COTRASIF is freely available at http://biomed.org.ua/COTRASIF/.
Identifying the location of transcription factor bindings is crucial to understand transcriptional regulation. Currently, Chromatin Immunoprecipitation followed with high-throughput Sequencing (ChIP-seq) is able to locate the transcription factor binding sites (TFBSs) accurately in high throughput and it has become the gold-standard method for TFBS finding experimentally. However, due to its high cost, it is impractical to apply the method in a very large scale. Considering the large number of transcription factors, numerous cell types and various conditions, computational methods are still very valuable to accurate TFBS identification.
In this paper, we proposed a novel integrated TFBS prediction system, CTF, based on Conditional Random Fields (CRFs). Integrating information from different sources, CTF was able to capture patterns of TFBSs contained in different features (sequence, chromatin and etc) and predicted the TFBS locations with a high accuracy. We compared CTF with several existing tools as well as the PWM baseline method on a dataset generated by ChIP-seq experiments (TFBSs of 13 transcription factors in mouse genome). Results showed that CTF performed significantly better than existing methods tested.
CTF is a powerful tool to predict TFBSs by integrating high throughput data and different features. It can be a useful complement to ChIP-seq and other experimental methods for TFBS identification and thus improve our ability to investigate functional elements in post-genomic era.
Availability: CTF is freely available to academic users at: http://cbb.sjtu.edu.cn/~ccwei/pub/software/CTF/CTF.php
Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions.
We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI.
Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions.
promoter; tissue-specific gene expression; position weight matrix; regulatory motif
Using nuclear factor-κB (NF-κB) ChIP-Seq data, we present a framework for iterative learning of regulatory networks. For every possible transcription factor-binding site (TFBS)-putatively regulated gene pair, the relative distance and orientation are calculated to learn which TFBSs are most likely to regulate a given gene. Weighted TFBS contributions to putative gene regulation are integrated to derive an NF-κB gene network. A de novo motif enrichment analysis uncovers secondary TFBSs (AP1, SP1) at characteristic distances from NF-κB/RelA TFBSs. Comparison with experimental ENCODE ChIP-Seq data indicates that experimental TFBSs highly correlate with predicted sites. We observe that RelA-SP1-enriched promoters have distinct expression profiles from that of RelA-AP1 and are enriched in introns, CpG islands and DNase accessible sites. Sixteen novel NF-κB/RelA-regulated genes and TFBSs were experimentally validated, including TANK, a negative feedback gene whose expression is NF-κB/RelA dependent and requires a functional interaction with the AP1 TFBSs. Our probabilistic method yields more accurate NF-κB/RelA-regulated networks than a traditional, distance-based approach, confirmed by both analysis of gene expression and increased informativity of Genome Ontology annotations. Our analysis provides new insights into how co-occurring TFBSs and local chromatin context orchestrate activation of NF-κB/RelA sub-pathways differing in biological function and temporal expression patterns.
DNA methylation can regulate gene expression by modulating the interaction between DNA and proteins or protein complexes. Conserved consensus motifs exist across the human genome ("predicted transcription factor binding sites": "predicted TFBS") but the large majority of these are proven by chromatin immunoprecipitation and high throughput sequencing (ChIP-seq) not to be biological transcription factor binding sites ("empirical TFBS"). We hypothesize that DNA methylation at conserved consensus motifs prevents promiscuous or disorderly transcription factor binding.
Using genome-wide methylation maps of the human heart and sperm, we found that all conserved consensus motifs as well as the subset of those that reside outside CpG islands have an aggregate profile of hyper-methylation. In contrast, empirical TFBS with conserved consensus motifs have a profile of hypo-methylation. 40% of empirical TFBS with conserved consensus motifs resided in CpG islands whereas only 7% of all conserved consensus motifs were in CpG islands. Finally we further identified a minority subset of TF whose profiles are either hypo-methylated or neutral at their respective conserved consensus motifs implicating that these TF may be responsible for establishing or maintaining an un-methylated DNA state, or whose binding is not regulated by DNA methylation.
Our analysis supports the hypothesis that at least for a subset of TF, empirical binding to conserved consensus motifs genome-wide may be controlled by DNA methylation.
Computational prediction of Transcription Factor Binding Sites (TFBS) from sequence data alone is difficult and error-prone. Machine learning techniques utilizing additional environmental information about a predicted binding site (such as distances from the site to particular chromatin features) to determine its occupancy/functionality class show promise as methods to achieve more accurate prediction of true TFBS in silico. We evaluate the Bayesian Network (BN) and Support Vector Machine (SVM) machine learning techniques on four distinct TFBS data sets and analyze their performance. We describe the features that are most useful for classification and contrast and compare these feature sets between the factors.
Our results demonstrate good performance of classifiers both on TFBS for transcription factors used for initial training and for TFBS for other factors in cross-classification experiments. We find that distances to chromatin modifications (specifically, histone modification islands) as well as distances between such modifications to be effective predictors of TFBS occupancy, though the impact of individual predictors is largely TF specific. In our experiments, Bayesian network classifiers outperform SVM classifiers.
Our results demonstrate good performance of machine learning techniques on the problem of occupancy classification, and demonstrate that effective classification can be achieved using distances to chromatin features. We additionally demonstrate that cross-classification of TFBS is possible, suggesting the possibility of constructing a generalizable occupancy classifier capable of handling TFBS for many different transcription factors.
A new strategy is proposed for identifying synergistic transcription factors by function conservation, leading to the identification of 51 homotypic transcription-factor combinations.
Previous methods employed for the identification of synergistic transcription factors (TFs) are based on either TF enrichment from co-regulated genes or phylogenetic footprinting. Despite the success of these methods, both have limitations.
We propose a new strategy to identify synergistic TFs by function conservation. Rather than aligning the regulatory sequences from orthologous genes and then identifying conserved TF binding sites (TFBSs) in the alignment, we developed computational approaches to implement the novel strategy. These methods include combinatorial TFBS enrichment utilizing distance constraints followed by enrichment of overlapping orthologous genes from human and mouse, whose regulatory sequences contain the enriched TFBS combinations. Subsequently, integration of function conservation from both TFBS and overlapping orthologous genes was achieved by correlation analyses. These techniques have been used for genome-wide promoter analyses, which have led to the identification of 51 homotypic TF combinations; the validity of these approaches has been exemplified by both known TF-TF interactions and function coherence analyses. We further provide computational evidence that our novel methods were able to identify synergistic TFs to a much greater extent than phylogenetic footprinting.
Function conservation based on the concordance of combinatorial TFBS enrichment along with enrichment of overlapping orthologous genes has been proven to be a successful means for the identification of synergistic TFs. This approach avoids the limitations of phylogenetic footprinting as it does not depend upon sequence alignment. It utilizes existing gene annotation data, such as those available in GO, thus providing an alternative method for functional TF discovery and annotation.
Detecting conserved noncoding sequences (CNSs) across species highlights the functional elements. Alignment procedures combined with computational prediction of transcription factor binding sites (TFBSs) can narrow down key regulatory elements. Repeat masking processes are often performed before alignment to mask insertion sequences such as transposable elements (TEs). However, recently such TEs have been reported to influence the gene regulatory network evolution. Therefore, an alignment approach that is robust to TE insertions is meaningful for finding novel conserved TFBSs in TEs.
We constructed a web server 'ReAlignerV' for complex alignment of genomic sequences. ReAlignerV returns ladder-like schematic alignments that integrate predicted TFBSs and the location of TEs. It also provides pair-wise alignments in which the predicted TFBS sites and their names are shown alongside each sequence. Furthermore, we evaluated false positive aligned sites by focusing on the species-specific TEs (SSTEs), and found that ReAlignerV has a higher specificity and robustness to insertions for sequences having more than 20% TE content, compared to LAGAN, AVID, MAVID and BLASTZ.
ReAlignerV can be applied successfully to TE-insertion-rich sequences without prior repeat masking, and this increases the chances of finding regulatory sequences hidden in TEs, which are important sources of the regulatory network evolution. ReAlignerV can be accessed through and downloaded from .
Gene expression is in part regulated by sequences in promoters that bind transcription factors. Thus, co-expressed genes may have shared sequence motifs representing putative transcription factor binding sites (TFBSs). However, for agriculturally important animals the genomic sequence is often incomplete. The more complete human genome may be able to be used for this prediction by taking advantage of the expected evolutionary conservation in TFBSs between the species.
A method of de novo TFBS prediction based on MEME was implemented, tested, and validated on a muscle-specific dataset.
Muscle specific expression data from EST library analysis from cattle was used to predict sets of genes whose expression was enriched in muscle and cardiac tissues. The upstream 1500 bases from calculated orthologous genes were extracted from the human reference set. A set of common motifs were discovered in these promoters. Slightly over one third of these motifs were identified as known TFBSs including known muscle specific binding sites. This analysis also predicted several highly statistically significantly overrepresented sites that may be novel TFBS.
An independent analysis of the equivalent bovine genomic sequences was also done, this gave less detailed results than the human analysis due to both the quality of orthologue prediction and assembly in promoter regions. However, the most common motifs could be detected in both sets.
Using promoter sequences from human genes is a useful approach when studying gene expression in species with limited or non-existing genomic sequence. As the bovine genome becomes better annotated it can in turn serve as the reference genome for other agriculturally important ruminants, such as sheep, goat and deer.
The computational analysis of regulatory SNPs (rSNPs) is an essential step in the elucidation of the structure and function of regulatory networks at the cellular level. In this work we focus in particular on SNPs that potentially affect a Transcription Factor Binding Site (TFBS) to a significant extent, possibly resulting in changes to gene expression patterns or alternative splicing. The application described here is based on the MAPPER platform, a previously developed web-based system for the computational detection of TFBSs in DNA sequences.
rSNP-MAPPER is a computational tool that analyzes SNPs lying within predicted TFBSs and determines whether the allele substitution results in a significant change in the TFBS predictive score. The application's simple and intuitive interface supports several usage modes. For example, the user may search for potential rSNPs in the promoters of one or more genes, specified as a list of identifiers or chosen among the members of a pathway. Alternatively, the user may specify a set of SNPs to be analyzed by uploading a list of SNP identifiers or providing the coordinates of a genomic region. Finally, the user can provide two alternative sequences (wildtype and mutant), and the system will determine the location of variants to be analyzed by comparing them.
In this paper we outline the architecture of rSNP-MAPPER, describing its intuitive and powerful user interface in detail. We then present several examples of the use of rSNP-MAPPER to reproduce and confirm experimental studies aimed at identifying regulatory SNPs in human genes, that show how rSNP-MAPPER is able to detect and characterize rSNPs with high accuracy. Results are richly annotated and can be displayed online or downloaded in a number of different formats.
rSNP-MAPPER is optimized for large scale work, allowing for the efficient annotation of thousands of SNPs, and is designed to assist in the genome-wide investigation of transcriptional regulatory networks, prioritizing potential rSNPs for subsequent experimental validation. rSNP-MAPPER is freely available at http://genome.ufl.edu/mapper/.
Transcription factor binding sites (TFBSs) are short DNA sequences interacting with transcription factors (TFs), which regulate gene expression. Due to the relatively short length of such binding sites, it is largely unclear how the specificity of protein–DNA interaction is achieved. Here, we have performed a genome-wide analysis of TFBS-like sequences for the transcriptional repressor, RE1 Silencing Transcription Factor (REST), as well as for several other representative mammalian TFs (c-myc, p53, HNF-1 and CREB). We find a nonrandom distribution of inexact sites for these TFs, referred to as highly-degenerate TFBSs, that are enriched around the cognate binding sites. Comparisons among human, mouse and rat orthologous promoters reveal that these highly-degenerate sites are conserved significantly more than expected by random chance, suggesting their positive selection during evolution. We propose that this arrangement provides a favorable genomic landscape for functional target site selection.
Summary: A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Computational identification of transcription factor binding sites (TFBSs) is a rapid, cost-efficient way to locate unknown regulatory elements. With increased potential for high-throughput genome sequencing, the availability of accurate computational methods for TFBS prediction has never been as important as it currently is. To date, identifying TFBSs with high sensitivity and specificity is still an open challenge, necessitating the development of novel models for predicting transcription factor-binding regulatory DNA elements.
Based on the information theory, we propose a model for transcription factor binding of regulatory DNA sites. Our model incorporates position interdependencies in effective ways. The model computes the information transferred (TI) between the transcription factor and the TFBS during the binding process and uses TI as the criterion to determine whether the sequence motif is a possible TFBS. Based on this model, we developed a computational method to identify TFBSs. By theoretically proving and testing our model using both real and artificial data, we found that our model provides highly accurate predictive results.
In this study, we present a novel model for transcription factor binding regulatory DNA sites. The model can provide an increased ability to detect TFBSs.
Transcription factors are important controllers of gene expression and mapping transcription factor binding sites (TFBS) is key to inferring transcription factor regulatory networks. Several methods for predicting TFBS exist, but there are no standard genome-wide datasets on which to assess the performance of these prediction methods. Also, it is believed that information about sequence conservation across different genomes can generally improve accuracy of motif-based predictors, but it is not clear under what circumstances use of conservation is most beneficial.
Here we use published ChIP-seq data and an improved peak detection method to create comprehensive benchmark datasets for prediction methods which use known descriptors or binding motifs to detect TFBS in genomic sequences. We use this benchmark to assess the performance of five different prediction methods and find that the methods that use information about sequence conservation generally perform better than simpler motif-scanning methods. The difference is greater on high-affinity peaks and when using short and information-poor motifs. However, if the motifs are specific and information-rich, we find that simple motif-scanning methods can perform better than conservation-based methods.
Our benchmark provides a comprehensive test that can be used to rank the relative performance of transcription factor binding site prediction methods. Moreover, our results show that, contrary to previous reports, sequence conservation is better suited for predicting strong than weak transcription factor binding sites.
Networks of regulatory relations between transcription factors (TF) and their target genes (TG)- implemented through TF binding sites (TFBS)- are key features of biology. An idealized approach to solving such networks consists of starting from a consensus TFBS or a position weight matrix (PWM) to generate a high accuracy list of candidate TGs for biological validation. Developing and evaluating such approaches remains a formidable challenge in regulatory bioinformatics. We perform a benchmark study on 34 Drosophila TFs to assess existing TFBS and cis-regulatory module (CRM) detection methods, with a strong focus on the use of multiple genomes. Particularly, for CRM-modelling we investigate the addition of orthologous sites to a known PWM to construct phyloPWMs and we assess the added value of phylogenentic footprinting to predict contextual motifs around known TFBSs. For CRM-prediction, we compare motif conservation with network-level conservation approaches across multiple genomes. Choosing the optimal training and scoring strategies strongly enhances the performance of TG prediction for more than half of the tested TFs. Finally, we analyse a 35th TF, namely Eyeless, and find a significant overlap between predicted TGs and candidate TGs identified by microarray expression studies. In summary we identify several ways to optimize TF-specific TG predictions, some of which can be applied to all TFs, and others that can be applied only to particular TFs. The ability to model known TF-TG relations, together with the use of multiple genomes, results in a significant step forward in solving the architecture of gene regulatory networks.
Transcription factors (TFs) control transcription by binding to specific regions of DNA called transcription factor binding sites (TFBSs). The identification of TFBSs is a crucial problem in computational biology and includes the subtask of predicting the location of known TFBS motifs in a given DNA sequence. It has previously been shown that, when scoring matches to known TFBS motifs, interdependencies between positions within a motif should be taken into account. However, this remains a challenging task owing to the fact that sequences similar to those of known TFBSs can occur by chance with a relatively high frequency. Here we present a new method for matching sequences to TFBS motifs based on intuitionistic fuzzy sets (IFS) theory, an approach that has been shown to be particularly appropriate for tackling problems that embody a high degree of uncertainty.
We propose SCintuit, a new scoring method for measuring sequence-motif affinity based on IFS theory. Unlike existing methods that consider dependencies between positions, SCintuit is designed to prevent overestimation of less conserved positions of TFBSs. For a given pair of bases, SCintuit is computed not only as a function of their combined probability of occurrence, but also taking into account the individual importance of each single base at its corresponding position. We used SCintuit to identify known TFBSs in DNA sequences. Our method provides excellent results when dealing with both synthetic and real data, outperforming the sensitivity and the specificity of two existing methods in all the experiments we performed.
The results show that SCintuit improves the prediction quality for TFs of the existing approaches without compromising sensitivity. In addition, we show how SCintuit can be successfully applied to real research problems. In this study the reliability of the IFS theory for motif discovery tasks is proven.
Transcription factors control gene expression by binding to short specific DNA sequences, called transcription factor binding sites (TFBSs), in the promoter of a gene. Thus, studying the spatial distribution of TFBSs in the promoters may provide insights into the molecular mechanisms of gene regulation. I developed a method to construct the spatial distribution of TFBSs for any set of genes of interest. I found that different functional gene clusters have different spatial distributions of TFBSs, indicating that gene regulation mechanisms may be very different among different functional gene clusters. I also found that the binding sites for different transcription factors (TFs) may have different spatial distributions: a sharp peak, a plateau or no dominant single peak. The spatial distributions of binding sites for many TFs derived from my analyses are valuable prior information for TFBS prediction algorithm because different regions of a promoter can assign different possibilities for TFBS occurrence.
yeast; promoter; TFBS; spatial distribution
The computational prediction of Transcription Factor Binding Sites (TFBS) remains a challenge due to their short length and low information content. Comparative genomics approaches that simultaneously consider several related species and favor sites that have been conserved throughout evolution improve the accuracy (specificity) of the predictions but are limited due to a phenomenon called binding site turnover, where sequence evolution causes one TFBS to replace another in the same region. In parallel to this development, an increasing number of mammalian genomes are now sequenced and it is becoming possible to infer, to a surprisingly high degree of accuracy, ancestral mammalian sequences.
We propose a TFBS prediction approach that makes use of the availability of inferred ancestral mammalian genomes to improve its accuracy. This method aims to identify binding loci, which are regions of a few hundred base pairs that have preserved their potential to bind a given transcription factor over evolutionary time. After proposing a neutral evolutionary model of predicted TFBS counts in a DNA region of a given length, we use it to identify regions that have preserved the number of predicted TFBS they contain to an unexpected degree given their divergence. The approach is applied to human chromosome 1 and shows significant gains in accuracy as compared to both existing single-species and multi-species TFBS prediction approaches, in particular for transcription factors that are subject to high turnover rates.
The source code and predictions made by the program are available at http://www.cs.mcgill.ca/~blanchem/bindingLoci.
In silico prediction of transcription factor binding sites (TFBSs) is central to the task of gene regulatory network elucidation. Genomic DNA sequence information provides a basis for these predictions, due to the sequence specificity of TF-binding events. However, DNA sequence alone is an impoverished source of information for the task of TFBS prediction in eukaryotes, as additional factors, such as chromatin structure regulate binding events. We show that incorporating high-throughput chromatin modification estimates can greatly improve the accuracy of in silico prediction of in vivo binding for a wide range of TFs in human and mouse. This improvement is superior to the improvement gained by equivalent use of either transcription start site proximity or phylogenetic conservation information. Importantly, predictions made with the use of chromatin structure information are tissue specific. This result supports the biological hypothesis that chromatin modulates TF binding to produce tissue-specific binding profiles in higher eukaryotes, and suggests that the use of chromatin modification information can lead to accurate tissue-specific transcriptional regulatory network elucidation.
Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered.
To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies.
To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA.
Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies.
Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.
Motivation: In functional genomics, it is frequently useful to correlate expression levels of genes to identify transcription factor binding sites (TFBS) via the presence of common sequence motifs. The underlying assumption is that co-expressed genes are more likely to contain shared TFBS and, thus, TFBS can be identified computationally. Indeed, gene pairs with a very high expression correlation show a significant excess of shared binding sites in yeast. We have tested this assumption in a more complex organism, Drosophila melanogaster, by using experimentally determined TFBS and microarray expression data. We have also examined the reverse relationship between the expression correlation and the extent of TFBS sharing.
Results: Pairs of genes with shared TFBS show, on average, a higher degree of co-expression than those with no common TFBS in Drosophila. However, the reverse does not hold true: gene pairs with high expression correlations do not share significantly larger numbers of TFBS. Exception to this observation exists when comparing expression of genes from the earliest stages of embryonic development. Interestingly, semantic similarity between gene annotations (Biological Process) is much better associated with TFBS sharing, as compared to the expression correlation. We discuss these results in light of reverse engineering approaches to computationally predict regulatory sequences by using comparative genomics.