Collections of transcription factor binding profiles (Transfac, Jaspar) are essential to identify regulatory elements in DNA sequences. Subsets of highly similar profiles complicate large scale analysis of transcription factor binding sites.
We propose to identify and group similar profiles using two independent similarity measures: χ2 distances between position frequency matrices (PFMs) and correlation coefficients between position weight matrices (PWMs) scores.
We show that these measures complement each other and allow to associate Jaspar and Transfac matrices. Clusters of highly similar matrices are identified and can be used to optimise the search for regulatory elements. Moreover, the application of the measures is illustrated by assigning E-box matrices of a SELEX experiment and of experimentally characterised binding sites of circadian clock genes to the Myc-Max cluster.
Position-weight matrices (PWMs) are broadly used to locate transcription factor binding sites in DNA sequences. The majority of existing PWMs provide a low level of both sensitivity and specificity. We present a new computational algorithm, a modification of the Staden–Bucher approach, that improves the PWM. We applied the proposed technique on the PWM of the GC-box, binding site for Sp1. The comparison of old and new PWMs shows that the latter increase both sensitivity and specificity. The statistical parameters of GC-box distribution in promoter regions and in the human genome, as well as in each chromosome, are presented. The majority of commonly used PWMs are the 4-row mononucleotide matrices, although 16-row dinucleotide matrices are known to be more informative. The algorithm efficiently determines the 16-row matrices and preliminary results show that such matrices provide better results than 4-row matrices.
The identifying of binding sites for transcription factors is a key component of gene regulatory network analysis. This is often done using position-weight matrices (PWMs). Because of the importance of in silico mapping of tentative binding sites, we previously developed an approach for PWM optimization that substantially improves the accuracy of such mapping.
The present work implements the optimization algorithm applied to the existing PWM for GATA-3 transcription factor and builds a new di-nucleotide PWM. The existing available PWM is based on experimental data adopted from Jaspar. The optimized PWM substantially improves the sensitivity and specificity of the TF mapping compared to the conventional applications. The refined PWM also facilitates in silico identification of novel binding sites that are supported by experimental data. We also describe uncommon positioning of binding motifs for several T-cell lineage specific factors in human promoters.
Our proposed di-nucleotide PWM approach outperforms the conventional mono-nucleotide PWM approach with respect to GATA-3. Therefore our new di-nucleotide PWM provides new insight into plausible transcriptional regulatory interactions in human promoters.
Transcription factor; Binding sites; GATA-3; Human promoter; Position weight matrix; Optimization
By binding to short and highly conserved DNA sequences in genomes, DNA-binding proteins initiate, enhance or repress biological processes. Accurately identifying such binding sites, often represented by position weight matrices (PWMs), is an important step in understanding the control mechanisms of cells. When given coordinates of a DNA-binding domain (DBD) bound with DNA, a potential function can be used to estimate the change of binding affinity after base substitutions, where the changes can be summarized as a PWM. This technique provides an effective alternative when the chromatin immunoprecipitation data are unavailable for PWM inference. To facilitate the procedure of predicting PWMs based on protein–DNA complexes or even structures of the unbound state, the web server, DBD2BS, is presented in this study. The DBD2BS uses an atom-level knowledge-based potential function to predict PWMs characterizing the sequences to which the query DBD structure can bind. For unbound queries, a list of 1066 DBD–DNA complexes (including 1813 protein chains) is compiled for use as templates for synthesizing bound structures. The DBD2BS provides users with an easy-to-use interface for visualizing the PWMs predicted based on different templates and the spatial relationships of the query protein, the DBDs and the DNAs. The DBD2BS is the first attempt to predict PWMs of DBDs from unbound structures rather than from bound ones. This approach increases the number of existing protein structures that can be exploited when analyzing protein–DNA interactions. In a recent study, the authors showed that the kernel adopted by the DBD2BS can generate PWMs consistent with those obtained from the experimental data. The use of DBD2BS to predict PWMs can be incorporated with sequence-based methods to discover binding sites in genome-wide studies.
http://dbd2bs.csie.ntu.edu.tw/, http://dbd2bs.csbb.ntu.edu.tw/, and http://dbd2bs.ee.ncku.edu.tw.
We present the webserver 3D transcription factor (3DTF) to compute position-specific weight matrices (PWMs) of transcription factors using a knowledge-based statistical potential derived from crystallographic data on protein–DNA complexes. Analysis of available structures that can be used to construct PWMs shows that there are hundreds of 3D structures from which PWMs could be derived, as well as thousands of proteins homologous to these. Therefore, we created 3DTF, which delivers binding matrices given the experimental or modeled protein–DNA complex. The webserver can be used by biologists to derive novel PWMs for transcription factors lacking known binding sites and is freely accessible at http://www.gene-regulation.com/pub/programs/3dtf/.
Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions.
We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI.
Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions.
promoter; tissue-specific gene expression; position weight matrix; regulatory motif
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein–DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein–DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein–DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein–DNA interactions. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.
Position weight matrices (PWMs) have become a tool of choice for the identification of transcription factor binding sites in DNA sequences. DNA-binding proteins often show degeneracy in their binding requirement and thus the overall binding specificity of many proteins is unknown and remains an active area of research. Although existing PWMs are more reliable predictors than consensus string matching, they generally result in a high number of false positive hits. Our previous study introduced a promising approach to PWM refinement in which known motifs are used to computationally mine putative binding sites directly from aligned promoter regions using composition of similar sites. In the present study, we extended this technique originally tested on single examples of transcription factors (TFs) and showed its capability to optimize PWM performance to predict new binding sites in the fruit fly genome. We propose refined PWMs in mono- and dinucleotide versions similarly computed for a large variety of transcription factors of Drosophila melanogaster. Along with the addition of many auxiliary sites the optimization includes variation of the PWM motif length, the binding sites location on the promoters and the PWM score threshold. To assess the predictive performance of the refined PWMs we compared them to conventional TRANSFAC and JASPAR sources. The results have been verified using performed tests and literature review. Overall, the refined PWMs containing putative sites derived from real promoter content processed using optimized parameters had better general accuracy than conventional PWMs.
Positional weight matrix (PWM) remains the most popular for quantification of transcription factor (TF) binding. PWM supplied with a score threshold defines a set of putative transcription factor binding sites (TFBS), thus providing a TFBS model.
TF binding DNA fragments obtained by different experimental methods usually give similar but not identical PWMs. This is also common for different TFs from the same structural family. Thus it is often necessary to measure the similarity between PWMs. The popular tools compare PWMs directly using matrix elements. Yet, for log-odds PWMs, negative elements do not contribute to the scores of highly scoring TFBS and thus may be different without affecting the sets of the best recognized binding sites. Moreover, the two TFBS sets recognized by a given pair of PWMs can be more or less different depending on the score thresholds.
We propose a practical approach for comparing two TFBS models, each consisting of a PWM and the respective scoring threshold. The proposed measure is a variant of the Jaccard index between two TFBS sets. The measure defines a metric space for TFBS models of all finite lengths. The algorithm can compare TFBS models constructed using substantially different approaches, like PWMs with raw positional counts and log-odds. We present the efficient software implementation: MACRO-APE (MAtrix CompaRisOn by Approximate P-value Estimation).
MACRO-APE can be effectively used to compute the Jaccard index based similarity for two TFBS models. A two-pass scanning algorithm is presented to scan a given collection of PWMs for PWMs similar to a given query.
Availability and implementation
MACRO-APE is implemented in ruby 1.9; software including source code and a manual is freely available at http://autosome.ru/macroape/ and in supplementary materials.
Transcription factor binding site; TFBS; Transcription factor binding site model; Binding motif; Jaccard similarity; Position weight matrix; PWM; P-value; Position specific frequency matrix; PSFM; Macroape
DNA-binding proteins such as transcription factors use DNA-binding domains (DBDs) to bind to specific sequences in the genome to initiate many important biological functions. Accurate prediction of such target sequences, often represented by position weight matrices (PWMs), is an important step to understand many biological processes. Recent studies have shown that knowledge-based potential functions can be applied on protein-DNA co-crystallized structures to generate PWMs that are considerably consistent with experimental data. However, this success has not been extended to DNA-binding proteins lacking co-crystallized structures. This study aims at investigating the possibility of predicting the DNA sequences bound by DNA-binding proteins from the proteins' unbound structures (structures of the unbound state). Given an unbound query protein and a template complex, the proposed method first employs structure alignment to generate synthetic protein-DNA complexes for the query protein. Once a complex is available, an atomic-level knowledge-based potential function is employed to predict PWMs characterizing the sequences to which the query protein can bind. The evaluation of the proposed method is based on seven DNA-binding proteins, which have structures of both DNA-bound and unbound forms for prediction as well as annotated PWMs for validation. Since this work is the first attempt to predict target sequences of DNA-binding proteins from their unbound structures, three types of structural variations that presumably influence the prediction accuracy were examined and discussed. Based on the analyses conducted in this study, the conformational change of proteins upon binding DNA was shown to be the key factor. This study sheds light on the challenge of predicting the target DNA sequences of a protein lacking co-crystallized structures, which encourages more efforts on the structure alignment-based approaches in addition to docking- and homology modeling-based approaches for generating synthetic complexes.
Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
Transcription factors are critical proteins for sequence-specific control of transcriptional regulation. Finding where these proteins bind to DNA is of key importance for global efforts to decipher the complex mechanisms of gene regulation. Greater understanding of the regulation of transcription promises to improve human genetic analysis by specifying critical gene components that have eluded investigators. Classically, computational prediction of transcription factor binding sites (TFBS) is based on models giving weights to each nucleotide at each position. We introduce a novel statistical model for the prediction of TFBS tolerant of a broader range of TFBS configurations than can be conveniently accommodated by existing methods. The new models are designed to address the confounding properties of nucleotide composition, inter-positional sequence dependence and variable lengths (e.g. variable spacing between half-sites) observed in the more comprehensive experimental data now emerging. The new models generate scores consistent with DNA-protein affinities measured experimentally and can be represented graphically, retaining desirable attributes of past methods. It demonstrates the capacity of the new approach to accurately assess DNA-protein interactions. With the rich experimental data generated from chromatin immunoprecipitation experiments, a greater diversity of TFBS properties has emerged that can now be accommodated within a single predictive approach.
Promoter prediction has gained increased attention in studies related to transcriptional regulation of gene expression.We developed a
web server named PMSearch (Poly Matrix Search) which utilizes Position Frequency Matrices (PFMs) to predict transcription factor binding
sites (TFBSs) in DNA sequences. PMSearch takes PFMs (either user-defined or retrieved from local dataset which currently contains 507 PFMs
from Transfac Public 7.0 and JASPAR) and DNA sequences of interest as the input, then scans the DNA sequences with PFMs and reports the sites
of high scores as the putative binding sites. The output of the server includes 1) A plot for the distribution of predicted TFBS along the DNA
sequence, 2) A table listing location, score and motif for each putative binding site, and 3) Clusters of predicted binding sites. PMSearch also
provides links for accessing clusters of PFMs that are similar to the input PFMs to facilitate complicated promoter analysis.
PMSearch is available for free at
position frequency matrix; motif; transcription factor binding site; web server
Correct interactions between transcription factors (TFs) and their binding sites (TFBSs) are of central importance to gene regulation. Recently developed chromatin-immunoprecipitation DNA chip (ChIP-chip) techniques and the phylogenetic footprinting method provide ways to identify TFBSs with high precision. In this study, we constructed a user-friendly interactive platform for dynamic binding site mapping using ChIP-chip data and phylogenetic footprinting as two filters. MYBS (Mining Yeast Binding Sites) is a comprehensive web server that integrates an array of both experimentally verified and predicted position weight matrixes (PWMs) from eleven databases, including 481 binding motif consensus sequences and 71 PWMs that correspond to 183 TFs. MYBS users can search within this platform for motif occurrences (possible binding sites) in the promoters of genes of interest via simple motif or gene queries in conjunction with the above two filters. In addition, MYBS enables users to visualize in parallel the potential regulators for a given set of genes, a feature useful for finding potential regulatory associations between TFs. MYBS also allows users to identify target gene sets of each TF pair, which could be used as a starting point for further explorations of TF combinatorial regulation. MYBS is available at http://cg1.iis.sinica.edu.tw/~mybs/.
Identifying transcription factor (TF) binding sites (TFBSs) is an important step towards understanding transcriptional regulation. A common approach is to use gaplessly aligned, experimentally supported TFBSs for a particular TF, and algorithmically search for more occurrences of the same TFBSs. The largest publicly available databases of TF binding specificities contain models which are represented as position weight matrices (PWM). There are other methods using more sophisticated representations, but these have more limited databases, or aren't publicly available. Therefore, this paper focuses on methods that search using one PWM per TF. An algorithm, MATCHTM, for identifying TFBSs corresponding to a particular PWM is available, but is not based on a rigorous statistical model of TF binding, making it difficult to interpret or adjust the parameters and output of the algorithm. Furthermore, there is no public description of the algorithm sufficient to exactly reproduce it. Another algorithm, MAST, computes a p-value for the presence of a TFBS using true probabilities of finding each base at each offset from that position. We developed a statistical model, BaSeTraM, for the binding of TFs to TFBSs, taking into account random variation in the base present at each position within a TFBS. Treating the counts in the matrices and the sequences of sites as random variables, we combine this TFBS composition model with a background model to obtain a Bayesian classifier. We implemented our classifier in a package (SBaSeTraM). We tested SBaSeTraM against a MATCHTM implementation by searching all probes used in an experimental Saccharomyces cerevisiae TF binding dataset, and comparing our predictions to the data. We found no statistically significant differences in sensitivity between the algorithms (at fixed selectivity), indicating that SBaSeTraM's performance is at least comparable to the leading currently available algorithm. Our software is freely available at: http://wiki.github.com/A1kmm/sbasetram/building-the-tools.
Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. However, many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. This approach utilizes an energy function that is uniquely trained on each structure. Our approach leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request.
Knowledge of transcription factor-DNA binding patterns is crucial for understanding gene transcription. Numerous DNA-binding proteins are annotated as transcription factors in the literature, however, for many of them the corresponding DNA-binding motifs remain uncharacterized.
The position weight matrices (PWMs) of transcription factors from different structural classes have been determined using a knowledge-based statistical potential. The scoring function calibrated against crystallographic data on protein-DNA contacts recovered PWMs of various members of widely studied transcription factor families such as p53 and NF-κB. Where it was possible, extensive comparison to experimental binding affinity data and other physical models was made. Although the p50p50, p50RelB, and p50p65 dimers belong to the same family, particular differences in their PWMs were detected, thereby suggesting possibly different in vivo binding modes. The PWMs of p63 and p73 were computed on the basis of homology modeling and their performance was studied using upstream sequences of 85 p53/p73-regulated human genes. Interestingly, about half of the p63 and p73 hits reported by the Match algorithm in the altogether 126 promoters lay more than 2 kb upstream of the corresponding transcription start sites, which deviates from the common assumption that most regulatory sites are located more proximal to the TSS. The fact that in most of the cases the binding sites of p63 and p73 did not overlap with the p53 sites suggests that p63 and p73 could influence the p53 transcriptional activity cooperatively. The newly computed p50p50 PWM recovered 5 more experimental binding sites than the corresponding TRANSFAC matrix, while both PWMs showed comparable receiver operator characteristics.
A novel algorithm was developed to calculate position weight matrices from protein-DNA complex structures. The proposed algorithm was extensively validated against experimental data. The method was further combined with Homology Modeling to obtain PWMs of factors for which crystallographic complexes with DNA are not yet available. The performance of PWMs obtained in this work in comparison to traditionally constructed matrices demonstrates that the structure-based approach presents a promising alternative to experimental determination of transcription factor binding properties.
To date, only a limited number of transcriptional regulatory interactions have been uncovered. In a pilot study integrating sequence data with microarray data, a position weight matrix (PWM) performed poorly in inferring transcriptional interactions (TIs), which represent physical interactions between transcription factors (TF) and upstream sequences of target genes. Inferring a TI means that the promoter sequence of a target is inferred to match the consensus sequence motifs of a potential TF, and their interaction type such as AT or RT is also predicted. Thus, a robust PWM (rPWM) was developed to search for consensus sequence motifs. In addition to rPWM, one feature extracted from ChIP-chip data was incorporated to identify potential TIs under specific conditions. An interaction type classifier was assembled to predict activation/repression of potential TIs using microarray data. This approach, combining an adaptive (learning) fuzzy inference system and an interaction type classifier to predict transcriptional regulatory networks, was named AdaFuzzy.
AdaFuzzy was applied to predict TIs using real genomics data from Saccharomyces cerevisiae. Following one of the latest advances in predicting TIs, constrained probabilistic sparse matrix factorization (cPSMF), and using 19 transcription factors (TFs), we compared AdaFuzzy to four well-known approaches using over-representation analysis and gene set enrichment analysis. AdaFuzzy outperformed these four algorithms. Furthermore, AdaFuzzy was shown to perform comparably to 'ChIP-experimental method' in inferring TIs identified by two sets of large scale ChIP-chip data, respectively. AdaFuzzy was also able to classify all predicted TIs into one or more of the four promoter architectures. The results coincided with known promoter architectures in yeast and provided insights into transcriptional regulatory mechanisms.
AdaFuzzy successfully integrates multiple types of data (sequence, ChIP, and microarray) to predict transcriptional regulatory networks. The validated success in the prediction results implies that AdaFuzzy can be applied to uncover TIs in yeast.
Predicting the binding specificity of transcription factors is a critical step in the characterization and computational identification and of cis-regulatory elements in genomic sequences. Here we use protein–DNA structures to predict binding specificity and consider the possibility of predicting position weight matrices (PWM) for an entire protein family based on the structures of just a few family members. A particular focus is the sensitivity of prediction accuracy to the docking geometry of the structure used. We investigate this issue with the goal of determining how similar two docking geometries must be for binding specificity predictions to be accurate. Docking similarity is quantified using our recently described interface alignment score (IAS). Using a molecular-mechanics force field, we predict high-affinity nucleotide sequences that bind to the second zinc-finger (ZF) domain from the Zif268 protein, using different C2H2 ZF domains as structural templates. We identify a strong relationship between IAS values and prediction accuracy, and define a range of IAS values for which accurate structure-based predictions of binding specificity is to be expected. The implication of our results for large-scale, structure-based prediction of PWMs is discussed.
Computational identification of transcription factor binding sites is an important research area of computational biology. Positional weight matrix (PWM) is a model to describe the sequence pattern of binding sites. Usually, transcription factor binding sites prediction methods based on PWMs require user-defined thresholds. The arbitrary threshold and also the relatively low specificity of the algorithm prevent the result of such an analysis from being properly interpreted. In this study, a method was developed to identify over-represented cis-elements with PWM-based similarity scores. Three sets of closely related promoters were analyzed, and only over- represented motifs with high PWM similarity scores were reported. The thresholds to evaluate the similarity scores to the PWMs of putative transcription factors binding sites can also be automatically determined during the analysis, which can also be used in further research with the same PWMs. The online program is available on the website: http://www.bioinfo.tsinghua.edu.cn/∼zhengjsh/OTFBS/.
Identifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as “position weight matrices” (PWMs). Though convenient, the PWM has significant limitations, in particular the assumed independence of positions within the binding motif; and predictions based on PWMs are usually not very specific to known functional sites. Analysis here on binding sites in yeast suggests that correlation of dinucleotides is not limited to near-neighbours, but can extend over considerable gaps.
I describe a straightforward generalization of the PWM model, that considers frequencies of dinucleotides instead of individual nucleotides. Unlike previous efforts, this method considers all dinucleotides within an extended binding region, and does not make an attempt to determine a priori the significance of particular dinucleotide correlations. I describe how to use a “dinucleotide weight matrix” (DWM) to predict binding sites, dealing in particular with the complication that its entries are not independent probabilities. Benchmarks show, for many factors, a dramatic improvement over PWMs in precision of predicting known targets. In most cases, significant further improvement arises by extending the commonly defined “core motifs” by about 10bp on either side. Though this flanking sequence shows no strong motif at the nucleotide level, the predictive power of the dinucleotide model suggests that the “signature” in DNA sequence of protein-binding affinity extends beyond the core protein-DNA contact region.
While computationally more demanding and slower than PWM-based approaches, this dinucleotide method is straightforward, both conceptually and in implementation, and can serve as a basis for future improvements.
Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF–DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF–TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.
Different approaches have been developed to dissect the interplay between transcription factors (TFs) and their cis-acting sequences on DNA in order to identify TF target genes. Here we used a combination of computational and experimental approaches to identify novel direct targets of TFAP2A, a key TF for a variety of physiological and pathological cellular processes. Gene expression profiles of HeLa cells either silenced for TFAP2A by RNA interference or not were previously compared and a set of differentially expressed genes was revealed.
The regulatory regions of 494 TFAP2A-modulated genes were analyzed for the presence of TFAP2A binding sites, employing the canonical TFAP2A Positional Weight Matrix (PWM) reported in Jaspar http://jaspar.genereg.net/. 264 genes containing at least 2 high score TFAP2A binding sites were identified, showing a central role in "Cellular Movement" and "Cellular Development". In an attempt to identify TFs that could cooperate with TFAP2A, a statistically significant enrichment for SP1 binding sites was found for TFAP2A-activated but not repressed genes. The direct binding of TFAP2A or SP1 to a random subset of TFAP2A-modulated genes was demonstrated by Chromatin ImmunoPrecipitation (ChIP) assay and the TFAP2A-driven regulation of DCBLD2/ESDN/CLCP1 gene studied in details.
We proved that our computational approaches applied to microarray selected genes are valid tools to identify functional TF binding sites in gene regulatory regions as confirmed by experimental validations. In addition, we demonstrated a fine-tuned regulation of DCBLD2/ESDN transcription by TFAP2A.
The analysis of regulatory regions in genome sequences is strongly based on the detection of potential transcription factor binding sites. The preferred models for representation of transcription factor binding specificity have been termed position-specific scoring matrices. JASPAR is an open-access database of annotated, high-quality, matrix-based transcription factor binding site profiles for multicellular eukaryotes. The profiles were derived exclusively from sets of nucleotide sequences experimentally demonstrated to bind transcription factors. The database is complemented by a web interface for browsing, searching and subset selection, an online sequence analysis utility and a suite of programming tools for genome-wide and comparative genomic analysis of regulatory regions. JASPAR is available at http://jaspar.cgb.ki.se.
Proteins with sequence-specific DNA binding function are important for a wide range of biological activities. De novo prediction of their DNA-binding specificities from sequence alone would be a great aid in inferring cellular networks. Here we introduce a method for predicting DNA-binding specificities for Cys2His2 zinc fingers (C2H2-ZFs), the largest family of DNA-binding proteins in metazoans. We develop a general approach, based on empirical calculations of pairwise amino acid–nucleotide interaction energies, for predicting position weight matrices (PWMs) representing DNA-binding specificities for C2H2-ZF proteins. We predict DNA-binding specificities on a per-finger basis and merge predictions for C2H2-ZF domains that are arrayed within sequences. We test our approach on a diverse set of natural C2H2-ZF proteins with known binding specificities and demonstrate that for >85% of the proteins, their predicted PWMs are accurate in 50% of their nucleotide positions. For proteins with several zinc finger isoforms, we show via case studies that this level of accuracy enables us to match isoforms with their known DNA-binding specificities. A web server for predicting a PWM given a protein containing C2H2-ZF domains is available online at http://zf.princeton.edu and can be used to aid in protein engineering applications and in genome-wide searches for transcription factor targets.