DNA-binding proteins such as transcription factors use DNA-binding domains (DBDs) to bind to specific sequences in the genome to initiate many important biological functions. Accurate prediction of such target sequences, often represented by position weight matrices (PWMs), is an important step to understand many biological processes. Recent studies have shown that knowledge-based potential functions can be applied on protein-DNA co-crystallized structures to generate PWMs that are considerably consistent with experimental data. However, this success has not been extended to DNA-binding proteins lacking co-crystallized structures. This study aims at investigating the possibility of predicting the DNA sequences bound by DNA-binding proteins from the proteins' unbound structures (structures of the unbound state). Given an unbound query protein and a template complex, the proposed method first employs structure alignment to generate synthetic protein-DNA complexes for the query protein. Once a complex is available, an atomic-level knowledge-based potential function is employed to predict PWMs characterizing the sequences to which the query protein can bind. The evaluation of the proposed method is based on seven DNA-binding proteins, which have structures of both DNA-bound and unbound forms for prediction as well as annotated PWMs for validation. Since this work is the first attempt to predict target sequences of DNA-binding proteins from their unbound structures, three types of structural variations that presumably influence the prediction accuracy were examined and discussed. Based on the analyses conducted in this study, the conformational change of proteins upon binding DNA was shown to be the key factor. This study sheds light on the challenge of predicting the target DNA sequences of a protein lacking co-crystallized structures, which encourages more efforts on the structure alignment-based approaches in addition to docking- and homology modeling-based approaches for generating synthetic complexes.
The discovery of regulatory motifs enriched in sets of DNA or RNA sequences is fundamental to the analysis of a great variety of functional genomics experiments. These motifs usually represent binding sites of proteins or non-coding RNAs, which are best described by position weight matrices (PWMs). We have recently developed XXmotif, a de novo motif discovery method that is able to directly optimize the statistical significance of PWMs. XXmotif can also score conservation and positional clustering of motifs. The XXmotif server provides (i) a list of significantly overrepresented motif PWMs with web logos and E-values; (ii) a graph with color-coded boxes indicating the positions of selected motifs in the input sequences; (iii) a histogram of the overall positional distribution for selected motifs and (iv) a page for each motif with all significant motif occurrences, their P-values for enrichment, conservation and localization, their sequence contexts and coordinates. Free access: http://xxmotif.genzentrum.lmu.de.
Correct interactions between transcription factors (TFs) and their binding sites (TFBSs) are of central importance to gene regulation. Recently developed chromatin-immunoprecipitation DNA chip (ChIP-chip) techniques and the phylogenetic footprinting method provide ways to identify TFBSs with high precision. In this study, we constructed a user-friendly interactive platform for dynamic binding site mapping using ChIP-chip data and phylogenetic footprinting as two filters. MYBS (Mining Yeast Binding Sites) is a comprehensive web server that integrates an array of both experimentally verified and predicted position weight matrixes (PWMs) from eleven databases, including 481 binding motif consensus sequences and 71 PWMs that correspond to 183 TFs. MYBS users can search within this platform for motif occurrences (possible binding sites) in the promoters of genes of interest via simple motif or gene queries in conjunction with the above two filters. In addition, MYBS enables users to visualize in parallel the potential regulators for a given set of genes, a feature useful for finding potential regulatory associations between TFs. MYBS also allows users to identify target gene sets of each TF pair, which could be used as a starting point for further explorations of TF combinatorial regulation. MYBS is available at http://cg1.iis.sinica.edu.tw/~mybs/.
We present the webserver 3D transcription factor (3DTF) to compute position-specific weight matrices (PWMs) of transcription factors using a knowledge-based statistical potential derived from crystallographic data on protein–DNA complexes. Analysis of available structures that can be used to construct PWMs shows that there are hundreds of 3D structures from which PWMs could be derived, as well as thousands of proteins homologous to these. Therefore, we created 3DTF, which delivers binding matrices given the experimental or modeled protein–DNA complex. The webserver can be used by biologists to derive novel PWMs for transcription factors lacking known binding sites and is freely accessible at http://www.gene-regulation.com/pub/programs/3dtf/.
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. However, many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. This approach utilizes an energy function that is uniquely trained on each structure. Our approach leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request.
Identifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as “position weight matrices” (PWMs). Though convenient, the PWM has significant limitations, in particular the assumed independence of positions within the binding motif; and predictions based on PWMs are usually not very specific to known functional sites. Analysis here on binding sites in yeast suggests that correlation of dinucleotides is not limited to near-neighbours, but can extend over considerable gaps.
I describe a straightforward generalization of the PWM model, that considers frequencies of dinucleotides instead of individual nucleotides. Unlike previous efforts, this method considers all dinucleotides within an extended binding region, and does not make an attempt to determine a priori the significance of particular dinucleotide correlations. I describe how to use a “dinucleotide weight matrix” (DWM) to predict binding sites, dealing in particular with the complication that its entries are not independent probabilities. Benchmarks show, for many factors, a dramatic improvement over PWMs in precision of predicting known targets. In most cases, significant further improvement arises by extending the commonly defined “core motifs” by about 10bp on either side. Though this flanking sequence shows no strong motif at the nucleotide level, the predictive power of the dinucleotide model suggests that the “signature” in DNA sequence of protein-binding affinity extends beyond the core protein-DNA contact region.
While computationally more demanding and slower than PWM-based approaches, this dinucleotide method is straightforward, both conceptually and in implementation, and can serve as a basis for future improvements.
Many dimeric protein complexes bind cooperatively to families of bipartite nucleic acid sequence elements, which consist of pairs of conserved half-site sequences separated by intervening distances that vary among individual sites.
We introduce the Bipad Server , a web interface to predict sequence elements embedded within unaligned sequences. Either a bipartite model, consisting of a pair of one-block position weight matrices (PWM's) with a gap distribution, or a single PWM matrix for contiguous single block motifs may be produced. The Bipad program performs multiple local alignment by entropy minimization and cyclic refinement using a stochastic greedy search strategy. The best models are refined by maximizing incremental information contents among a set of potential models with varying half site and gap lengths.
The web service generates information positional weight matrices, identifies binding site motifs, graphically represents the set of discovered elements as a sequence logo, and depicts the gap distribution as a histogram. Server performance was evaluated by generating a collection of bipartite models for distinct DNA binding proteins.
The structures of DNA–protein complexes have illuminated the diversity of DNA–protein binding mechanisms shown by different protein families. This lack of generality could pose a great challenge for predicting DNA–protein interactions. To address this issue, we have developed a knowledge-based method, DNA-binding Domain Hunter (DBD-Hunter), for identifying DNA-binding proteins and associated binding sites. The method combines structural comparison and the evaluation of a statistical potential, which we derive to describe interactions between DNA base pairs and protein residues. We demonstrate that DBD-Hunter is an accurate method for predicting DNA-binding function of proteins, and that DNA-binding protein residues can be reliably inferred from the corresponding templates if identified. In benchmark tests on ∼4000 proteins, our method achieved an accuracy of 98% and a precision of 84%, which significantly outperforms three previous methods. We further validate the method on DNA-binding protein structures determined in DNA-free (apo) state. We show that the accuracy of our method is only slightly affected on apo-structures compared to the performance on holo-structures cocrystallized with DNA. Finally, we apply the method to ∼1700 structural genomics targets and predict that 37 targets with previously unknown function are likely to be DNA-binding proteins. DBD-Hunter is freely available at http://cssb.biology.gatech.edu/skolnick/webservice/DBD-Hunter/.
Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions.
We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI.
Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions.
promoter; tissue-specific gene expression; position weight matrix; regulatory motif
Transcription factor-DNA interactions, central to cellular regulation and control, are commonly described by position weight matrices (PWMs). These matrices are frequently used to predict transcription factor binding sites in regulatory regions of DNA to complement and guide further experimental investigation. The DNA sequence preferences of transcription factors, encoded in PWMs, are dictated primarily by select residues within the DNA binding domain(s) that interact directly with DNA. Therefore, the DNA binding properties of homologous transcription factors with identical DNA binding domains may be characterized by PWMs derived from different species. Accordingly, we have implemented a fully automated domain-level homology searching method for identical DNA binding sequences.
By applying the domain-level homology search to transcription factors with existing PWMs in the JASPAR and TRANSFAC databases, we were able to significantly increase coverage in terms of the total number of PWMs associated with a given species, assign PWMs to transcription factors that did not previously have any associations, and increase the number of represented species with PWMs over an order of magnitude. Additionally, using protein binding microarray (PBM) data, we have validated the domain-level method by demonstrating that transcription factor pairs with matching DNA binding domains exhibit comparable DNA binding specificity predictions to transcription factor pairs with completely identical sequences.
The increased coverage achieved herein demonstrates the potential for more thorough species-associated investigation of protein-DNA interactions using existing resources. The PWM scanning results highlight the challenging nature of transcription factors that contain multiple DNA binding domains, as well as the impact of motif discovery on the ability to predict DNA binding properties. The method is additionally suitable for identifying domain-level homology mappings to enable utilization of additional information sources in the study of transcription factors. The domain-level homology search method, resulting PWM mappings, web-based user interface, and web API are publicly available at http://dodoma.systemsbiology.netdodoma.systemsbiology.net.
Computational identification of transcription factor binding sites is an important research area of computational biology. Positional weight matrix (PWM) is a model to describe the sequence pattern of binding sites. Usually, transcription factor binding sites prediction methods based on PWMs require user-defined thresholds. The arbitrary threshold and also the relatively low specificity of the algorithm prevent the result of such an analysis from being properly interpreted. In this study, a method was developed to identify over-represented cis-elements with PWM-based similarity scores. Three sets of closely related promoters were analyzed, and only over- represented motifs with high PWM similarity scores were reported. The thresholds to evaluate the similarity scores to the PWMs of putative transcription factors binding sites can also be automatically determined during the analysis, which can also be used in further research with the same PWMs. The online program is available on the website: http://www.bioinfo.tsinghua.edu.cn/∼zhengjsh/OTFBS/.
We identified binding sites for Epstein-Barr virus (EBV) nuclear antigen 1 (EBNA1) in the human genome using chromatin immunoprecipitation and microarrays. The sequences for these newly identified sites were used to generate a position-weighted matrix (PWM) for EBNA1's DNA-binding sites. This PWM helped identify additional DNA-binding sites for EBNA1 in the genomes of EBV, Kaposi's sarcoma-associated herpesvirus, and cercopithecine herpesvirus 15 (CeHV-15) (also called herpesvirus papio 15). In particular, a homologue of the Rep* locus in EBV was predicted in the genome of CeHV-15, which is notable because Rep* of EBV was not predicted by the previously developed consensus sequence for EBNA1's binding DNA. The Rep* of CeHV-15 functions as an origin of DNA synthesis in the EBV-positive cell line Raji; this finding thus builds on a set of DNA-binding sites for EBNA1 predicted in silico.
Position-weight matrices (PWMs) are broadly used to locate transcription factor binding sites in DNA sequences. The majority of existing PWMs provide a low level of both sensitivity and specificity. We present a new computational algorithm, a modification of the Staden–Bucher approach, that improves the PWM. We applied the proposed technique on the PWM of the GC-box, binding site for Sp1. The comparison of old and new PWMs shows that the latter increase both sensitivity and specificity. The statistical parameters of GC-box distribution in promoter regions and in the human genome, as well as in each chromosome, are presented. The majority of commonly used PWMs are the 4-row mononucleotide matrices, although 16-row dinucleotide matrices are known to be more informative. The algorithm efficiently determines the 16-row matrices and preliminary results show that such matrices provide better results than 4-row matrices.
Binding of many eukaryotic transcription regulatory proteins to their DNA recognition sequences results in conformational changes in DNA. To test the effect of altering DNA topology by prebending a transcription factor binding site, we examined the interaction of the estrogen receptor (ER) DNA binding domain (DBD) with prebent estrogen response elements (EREs). When the ERE in minicircle DNA was prebent toward the major groove, which is in the same direction as the ER-induced DNA bend, there was no significant effect on ER DBD binding relative to the linear counterparts. However, when the ERE was bent toward the minor groove, in a direction that opposes the ER-induced DNA bend, there was a four- to eightfold reduction in ER DBD binding. Since reduced binding was also observed with the ERE in nicked circles, the reduction in binding was not due to torsional force induced by binding of ER DBD to the prebent ERE in covalently closed minicircles. To determine the mechanism responsible for reduced binding to the prebent ERE, we examined the effect of prebending the ERE on the association and dissociation of the ER DBD. Binding of the ER DBD to ERE-containing minicircles was rapid when the EREs were prebent toward either the major or minor groove of the DNA (k(on) of 9.9 x 10(6) to 1.7 x 10(7) M(-1) s(-1)). Prebending the ERE toward the minor groove resulted in an increase in k(off) of four- to fivefold. Increased dissociation of the ER DBD from the ERE is, therefore, the major factor responsible for reduced binding of the ER DBD to an ERE prebent toward the minor groove. These data provide the first direct demonstration that the interaction of a eukaryotic transcription factor with its recognition sequence can be strongly influenced by altering DNA topology through prebending the DNA.
Summary: The transcriptional activator AREA is a member of the GATA family of transcription factors and mediates nitrogen metabolite repression in the fungus Aspergillus nidulans. The nutritional versatility of A. nidulans and its amenability to classical and reverse genetic manipulations make the AREA DNA binding domain (DBD) a useful model for analyzing GATA family DBDs, particularly as structures of two AREA-DNA complexes have been determined. The 109 extant mutant forms of the AREA DBD surveyed here constitute one of the highest totals of eukaryotic transcription factor DBD mutants, are discussed in light of the roles of individual residues, and are compared to corresponding mutant sequence changes in other fungal GATA factor DBDs. Other topics include delineation of the DBD using both homology and mutational truncation, use of frameshift reversion to detect regions of tolerance to mutational change, the finding that duplication of the DBD can apparently enhance AREA function, and use of the AREA system to analyze a vertebrate GATA factor DBD. Some major points to emerge from work on the AREA DBD are (i) tolerance to sequence change (with retention of function) is surprisingly great, (ii) mutational changes in a transcription factor can have widely differing, even opposing, effects on expression of different structural genes so that monitoring expression of one or even several structural genes can be insufficient and possibly misleading, and (iii) a mutational change altering local hydrophobic packing and DNA binding target specificity can markedly influence the behavior of mutational changes elsewhere in the DBD.
Knowledge of transcription factor-DNA binding patterns is crucial for understanding gene transcription. Numerous DNA-binding proteins are annotated as transcription factors in the literature, however, for many of them the corresponding DNA-binding motifs remain uncharacterized.
The position weight matrices (PWMs) of transcription factors from different structural classes have been determined using a knowledge-based statistical potential. The scoring function calibrated against crystallographic data on protein-DNA contacts recovered PWMs of various members of widely studied transcription factor families such as p53 and NF-κB. Where it was possible, extensive comparison to experimental binding affinity data and other physical models was made. Although the p50p50, p50RelB, and p50p65 dimers belong to the same family, particular differences in their PWMs were detected, thereby suggesting possibly different in vivo binding modes. The PWMs of p63 and p73 were computed on the basis of homology modeling and their performance was studied using upstream sequences of 85 p53/p73-regulated human genes. Interestingly, about half of the p63 and p73 hits reported by the Match algorithm in the altogether 126 promoters lay more than 2 kb upstream of the corresponding transcription start sites, which deviates from the common assumption that most regulatory sites are located more proximal to the TSS. The fact that in most of the cases the binding sites of p63 and p73 did not overlap with the p53 sites suggests that p63 and p73 could influence the p53 transcriptional activity cooperatively. The newly computed p50p50 PWM recovered 5 more experimental binding sites than the corresponding TRANSFAC matrix, while both PWMs showed comparable receiver operator characteristics.
A novel algorithm was developed to calculate position weight matrices from protein-DNA complex structures. The proposed algorithm was extensively validated against experimental data. The method was further combined with Homology Modeling to obtain PWMs of factors for which crystallographic complexes with DNA are not yet available. The performance of PWMs obtained in this work in comparison to traditionally constructed matrices demonstrates that the structure-based approach presents a promising alternative to experimental determination of transcription factor binding properties.
The identifying of binding sites for transcription factors is a key component of gene regulatory network analysis. This is often done using position-weight matrices (PWMs). Because of the importance of in silico mapping of tentative binding sites, we previously developed an approach for PWM optimization that substantially improves the accuracy of such mapping.
The present work implements the optimization algorithm applied to the existing PWM for GATA-3 transcription factor and builds a new di-nucleotide PWM. The existing available PWM is based on experimental data adopted from Jaspar. The optimized PWM substantially improves the sensitivity and specificity of the TF mapping compared to the conventional applications. The refined PWM also facilitates in silico identification of novel binding sites that are supported by experimental data. We also describe uncommon positioning of binding motifs for several T-cell lineage specific factors in human promoters.
Our proposed di-nucleotide PWM approach outperforms the conventional mono-nucleotide PWM approach with respect to GATA-3. Therefore our new di-nucleotide PWM provides new insight into plausible transcriptional regulatory interactions in human promoters.
Transcription factor; Binding sites; GATA-3; Human promoter; Position weight matrix; Optimization
Summary: In the post-genomic era, the annotation of protein function facilitates the understanding of various biological processes. To extend the range of function annotation methods to the twilight zone of sequence identity, we have developed approaches that exploit both protein tertiary structure and/or protein sequence evolutionary relationships. To serve the scientific community, we have integrated the structure prediction tools, TASSER, TASSER-Lite and METATASSER, and the functional inference tools, FINDSITE, a structure-based algorithm for binding site prediction, Gene Ontology molecular function inference and ligand screening, EFICAz2, a sequence-based approach to enzyme function inference and DBD-hunter, an algorithm for predicting DNA-binding proteins and associated DNA-binding residues, into a unified web resource, Protein Structure and Function prediction Resource (PSiFR).
Availability and implementation: PSiFR is freely available for use on the web at http://psifr.cssb.biology.gatech.edu/
EBNA3C is a potent repressor of transcription when bound to DNA as a fusion with the DNA binding domain (DBD) of GALA. A survey of promoters has revealed that the wild-type, unfused EBNA3C can specifically repress expression from reporter plasmids containing the Epstein-Barr virus Cp latency-associated promoter. Repression of Cp activity required amino acids 207 to 368, which encompasses a region resembling a basic DBD adjacent to a leucine zipper DNA binding motif and a site which binds to the cellular factor CBF1/RBP-Jkappa. However, amino acids 207 to 368 are dispensable when the protein is bound to DNA as a fusion with the GAL4 DBD, thus implicating this region in DNA binding. Mutation of the CBF1/RBP-Jkappa binding site in EBNA3C abrogated repression, strongly suggesting that CBF1/RBP-Jkappa is necessary for targeting the viral protein to Cp. Consistent with this result, mutation of the EBNA2 response element (a CBF1/RBP-Jkappa binding site) in Cp also prevented significant repression. In addition, amino acids 346 to 543, which were previously defined as important for the repressor activity of the GAL4-EBNA3C fusion proteins, also appear to be necessary for the repression of Cp. Since repression by these fusions was not observed in all cell types, it seems likely that EBNA3C either depends on a corepressor which may interact with amino acids 346 to 543 or is modified in a cell-specific manner in order to repress. These data are consistent with EBNA3C contributing to the regulation of EBNA expression in latently infected B cells through CBF1/RBP-Jkappa and another factor, but this need not directly involve EBNA2. Finally, although it has been reported that EBNA3C can upregulate CD21 in some B cells, we were unable to demonstrate any effect of EBNA3C on reporter plasmids which contain the CD21 promoter.
Accurate prediction of transcription factor binding sites (TFBSs) is a prerequisite for identifying cis-regulatory modules that underlie transcriptional regulatory circuits encoded in the genome. Here, we present a computational framework for detecting TFBSs, when multiple position weight matrices (PWMs) for a transcription factor are available. Grouping multiple PWMs of a transcription factor (TF) based on their sequence similarity improves the specificity of TFBS prediction, which was evaluated using multiple genome-wide ChIP-Seq data sets from 26 TFs. The Z-scores of the area under a receiver operating characteristic curve (AUC) values of 368 TFs were calculated and used to statistically identify co-occurring regulatory motifs in the TF bound ChIP loci. Motifs that are co-occurring along with the empirical bindings of E2F, JUN or MYC have been evaluated, in the basal or stimulated condition. Results prove our method can be useful to systematically identify the co-occurring motifs of the TF for the given conditions.
Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.
The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.
We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.
To date, only a limited number of transcriptional regulatory interactions have been uncovered. In a pilot study integrating sequence data with microarray data, a position weight matrix (PWM) performed poorly in inferring transcriptional interactions (TIs), which represent physical interactions between transcription factors (TF) and upstream sequences of target genes. Inferring a TI means that the promoter sequence of a target is inferred to match the consensus sequence motifs of a potential TF, and their interaction type such as AT or RT is also predicted. Thus, a robust PWM (rPWM) was developed to search for consensus sequence motifs. In addition to rPWM, one feature extracted from ChIP-chip data was incorporated to identify potential TIs under specific conditions. An interaction type classifier was assembled to predict activation/repression of potential TIs using microarray data. This approach, combining an adaptive (learning) fuzzy inference system and an interaction type classifier to predict transcriptional regulatory networks, was named AdaFuzzy.
AdaFuzzy was applied to predict TIs using real genomics data from Saccharomyces cerevisiae. Following one of the latest advances in predicting TIs, constrained probabilistic sparse matrix factorization (cPSMF), and using 19 transcription factors (TFs), we compared AdaFuzzy to four well-known approaches using over-representation analysis and gene set enrichment analysis. AdaFuzzy outperformed these four algorithms. Furthermore, AdaFuzzy was shown to perform comparably to 'ChIP-experimental method' in inferring TIs identified by two sets of large scale ChIP-chip data, respectively. AdaFuzzy was also able to classify all predicted TIs into one or more of the four promoter architectures. The results coincided with known promoter architectures in yeast and provided insights into transcriptional regulatory mechanisms.
AdaFuzzy successfully integrates multiple types of data (sequence, ChIP, and microarray) to predict transcriptional regulatory networks. The validated success in the prediction results implies that AdaFuzzy can be applied to uncover TIs in yeast.
The RFX DNA binding domain (DBD) is a novel highly conserved motif belonging to a large number of dimeric DNA binding proteins which have diverse regulatory functions in eukaryotic organisms, ranging from yeasts to human. To characterize this novel motif, solid phase synthesis of a 76mer polypeptide corresponding to the DBD of human hRFX1 (hRFX1/DBD), a prototypical member of the RFX family, has been optimized to yield large quantities (approximately 90 mg) of pure compound. Preliminary two-dimensional1H NMR experiments suggested the presence of helical regions in this sequence in agreement with previously reported secondary structure predictions. In gel mobility shift assays, this synthetic peptide was shown to bind in a cooperative manner the 23mer duplex oligodeoxynucleotide corresponding to the binding site of hRFX1, with a 2:1 stoichoimetry due to an inverse repeat present in the 23mer. The stoichiometry of this complex was reduced to 1:1 by decreasing the length of the DNA sequence to a 13mer oligonucleotide containing a single half-site. Surface plasmon resonance measurements were achieved using this 5'-biotylinated 13mer oligonucleotide immobilized on an avidin-coated sensor chip. Using this method an association constant (K a = 4 x 10(5)/M/s), a dissociation constant (K d = 6 x 10(-2)/s) and an equilibrium dissociation constant (K D = 153 nM) were determined for binding of hRFX1/DBD to the double-stranded 13mer oligonucleotide. In the presence of hRFX1/DBD the melting temperature of the 13mer DNA was increased by 16 degreesC, illustrating stabilization of the double-stranded conformation induced by the peptide.
Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF–DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF–TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.