Search tips
Search criteria

Results 1-25 (1556815)

Clipboard (0)

Related Articles

1.  DBD2BS: connecting a DNA-binding protein with its binding sites 
Nucleic Acids Research  2012;40(Web Server issue):W173-W179.
By binding to short and highly conserved DNA sequences in genomes, DNA-binding proteins initiate, enhance or repress biological processes. Accurately identifying such binding sites, often represented by position weight matrices (PWMs), is an important step in understanding the control mechanisms of cells. When given coordinates of a DNA-binding domain (DBD) bound with DNA, a potential function can be used to estimate the change of binding affinity after base substitutions, where the changes can be summarized as a PWM. This technique provides an effective alternative when the chromatin immunoprecipitation data are unavailable for PWM inference. To facilitate the procedure of predicting PWMs based on protein–DNA complexes or even structures of the unbound state, the web server, DBD2BS, is presented in this study. The DBD2BS uses an atom-level knowledge-based potential function to predict PWMs characterizing the sequences to which the query DBD structure can bind. For unbound queries, a list of 1066 DBD–DNA complexes (including 1813 protein chains) is compiled for use as templates for synthesizing bound structures. The DBD2BS provides users with an easy-to-use interface for visualizing the PWMs predicted based on different templates and the spatial relationships of the query protein, the DBDs and the DNAs. The DBD2BS is the first attempt to predict PWMs of DBDs from unbound structures rather than from bound ones. This approach increases the number of existing protein structures that can be exploited when analyzing protein–DNA interactions. In a recent study, the authors showed that the kernel adopted by the DBD2BS can generate PWMs consistent with those obtained from the experimental data. The use of DBD2BS to predict PWMs can be incorporated with sequence-based methods to discover binding sites in genome-wide studies.
Available at:,, and
PMCID: PMC3394304  PMID: 22693214
2.  PiDNA: predicting protein–DNA interactions with structural models 
Nucleic Acids Research  2013;41(Web Server issue):W523-W530.
Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein–DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein–DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein–DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein–DNA interactions. PiDNA is available at:
PMCID: PMC3692134  PMID: 23703214
3.  A Threading-Based Method for the Prediction of DNA-Binding Proteins with Application to the Human Genome 
PLoS Computational Biology  2009;5(11):e1000567.
Diverse mechanisms for DNA-protein recognition have been elucidated in numerous atomic complex structures from various protein families. These structural data provide an invaluable knowledge base not only for understanding DNA-protein interactions, but also for developing specialized methods that predict the DNA-binding function from protein structure. While such methods are useful, a major limitation is that they require an experimental structure of the target as input. To overcome this obstacle, we develop a threading-based method, DNA-Binding-Domain-Threader (DBD-Threader), for the prediction of DNA-binding domains and associated DNA-binding protein residues. Our method, which uses a template library composed of DNA-protein complex structures, requires only the target protein's sequence. In our approach, fold similarity and DNA-binding propensity are employed as two functional discriminating properties. In benchmark tests on 179 DNA-binding and 3,797 non-DNA-binding proteins, using templates whose sequence identity is less than 30% to the target, DBD-Threader achieves a sensitivity/precision of 56%/86%. This performance is considerably better than the standard sequence comparison method PSI-BLAST and is comparable to DBD-Hunter, which requires an experimental structure as input. Moreover, for over 70% of predicted DNA-binding domains, the backbone Root Mean Square Deviations (RMSDs) of the top-ranked structural models are within 6.5 Å of their experimental structures, with their associated DNA-binding sites identified at satisfactory accuracy. Additionally, DBD-Threader correctly assigned the SCOP superfamily for most predicted domains. To demonstrate that DBD-Threader is useful for automatic function annotation on a large-scale, DBD-Threader was applied to 18,631 protein sequences from the human genome; 1,654 proteins are predicted to have DNA-binding function. Comparison with existing Gene Ontology (GO) annotations suggests that ∼30% of our predictions are new. Finally, we present some interesting predictions in detail. In particular, it is estimated that ∼20% of classic zinc finger domains play a functional role not related to direct DNA-binding.
Author Summary
DNA-binding proteins represent only a small fraction of proteins encoded in genomes, yet they play a critical role in a variety of biological activities. Identifying these proteins and understanding how they function are important issues. The structures of solved DNA protein complexes of different protein families provide an invaluable knowledge base not only for understanding DNA-protein interactions, but also for developing methods that predict whether or not a protein binds DNA. While such methods are useful, they require an experimental structure as input. To overcome this obstacle, we have developed a threading-based method for the prediction of DNA-binding domains and associated DNA-binding protein residues from protein sequence. The method has higher accuracy in large scale benchmarking than methods based on sequence similarity alone. Application to the human proteome identified potential targets of not only previously unknown DNA-binding proteins, but also of biologically interesting ones that are related to, yet evolved from, DNA-binding proteins.
PMCID: PMC2770119  PMID: 19911048
4.  Prediction of DNA-binding protein based on statistical and geometric features and support vector machines 
Proteome Science  2011;9(Suppl 1):S1.
Previous studies on protein-DNA interaction mostly focused on the bound structure of DNA-binding proteins but few paid enough attention to the unbound structures. As more new proteins are discovered, it is useful and imperative to develop algorithms for the functional prediction of unbound proteins. In our work, we apply an alpha shape model to represent the surface structure of the protein-DNA complex and extract useful statistical and geometric features, and use structural alignment and support vector machines for the prediction of unbound DNA-binding proteins.
The performance of our method is evaluated by discriminating a set of 104 DNA-binding proteins from 401 non-DNA-binding proteins. In the same test, the proposed method outperforms the other method using conditional probability. The results achieved by our proposed method for; precision, 83.33%; accuracy, 86.53%; and MCC, 0.5368 demonstrate its good performance.
In this study we develop an effective method for the prediction of protein-DNA interactions based on statistical and geometric features and support vector machines. Our results show that interface surface features play an important role in protein-DNA interaction. Our technique is able to predict unbound DNA-binding protein and discriminatory DNA-binding proteins from proteins that bind with other molecules.
PMCID: PMC3289070  PMID: 22166014
5.  Creating PWMs of transcription factors using 3D structure-based computation of protein-DNA free binding energies 
BMC Bioinformatics  2010;11:225.
Knowledge of transcription factor-DNA binding patterns is crucial for understanding gene transcription. Numerous DNA-binding proteins are annotated as transcription factors in the literature, however, for many of them the corresponding DNA-binding motifs remain uncharacterized.
The position weight matrices (PWMs) of transcription factors from different structural classes have been determined using a knowledge-based statistical potential. The scoring function calibrated against crystallographic data on protein-DNA contacts recovered PWMs of various members of widely studied transcription factor families such as p53 and NF-κB. Where it was possible, extensive comparison to experimental binding affinity data and other physical models was made. Although the p50p50, p50RelB, and p50p65 dimers belong to the same family, particular differences in their PWMs were detected, thereby suggesting possibly different in vivo binding modes. The PWMs of p63 and p73 were computed on the basis of homology modeling and their performance was studied using upstream sequences of 85 p53/p73-regulated human genes. Interestingly, about half of the p63 and p73 hits reported by the Match algorithm in the altogether 126 promoters lay more than 2 kb upstream of the corresponding transcription start sites, which deviates from the common assumption that most regulatory sites are located more proximal to the TSS. The fact that in most of the cases the binding sites of p63 and p73 did not overlap with the p53 sites suggests that p63 and p73 could influence the p53 transcriptional activity cooperatively. The newly computed p50p50 PWM recovered 5 more experimental binding sites than the corresponding TRANSFAC matrix, while both PWMs showed comparable receiver operator characteristics.
A novel algorithm was developed to calculate position weight matrices from protein-DNA complex structures. The proposed algorithm was extensively validated against experimental data. The method was further combined with Homology Modeling to obtain PWMs of factors for which crystallographic complexes with DNA are not yet available. The performance of PWMs obtained in this work in comparison to traditionally constructed matrices demonstrates that the structure-based approach presents a promising alternative to experimental determination of transcription factor binding properties.
PMCID: PMC2879287  PMID: 20438625
6.  Conformational Transitions upon Ligand Binding: Holo-Structure Prediction from Apo Conformations 
PLoS Computational Biology  2010;6(1):e1000634.
Biological function of proteins is frequently associated with the formation of complexes with small-molecule ligands. Experimental structure determination of such complexes at atomic resolution, however, can be time-consuming and costly. Computational methods for structure prediction of protein/ligand complexes, particularly docking, are as yet restricted by their limited consideration of receptor flexibility, rendering them not applicable for predicting protein/ligand complexes if large conformational changes of the receptor upon ligand binding are involved. Accurate receptor models in the ligand-bound state (holo structures), however, are a prerequisite for successful structure-based drug design. Hence, if only an unbound (apo) structure is available distinct from the ligand-bound conformation, structure-based drug design is severely limited. We present a method to predict the structure of protein/ligand complexes based solely on the apo structure, the ligand and the radius of gyration of the holo structure. The method is applied to ten cases in which proteins undergo structural rearrangements of up to 7.1 Å backbone RMSD upon ligand binding. In all cases, receptor models within 1.6 Å backbone RMSD to the target were predicted and close-to-native ligand binding poses were obtained for 8 of 10 cases in the top-ranked complex models. A protocol is presented that is expected to enable structure modeling of protein/ligand complexes and structure-based drug design for cases where crystal structures of ligand-bound conformations are not available.
Author Summary
Structure-based drug design has become a powerful tool in modern drug discovery pipelines. A critical prerequisite is a structure of the target protein close to its ligand bound conformation which is often difficult to determine experimentally. In many cases, a structure of the unbound receptor is available, but conformational changes with respect to the ligand-bound form preclude it from being used as a basis for structure-based drug design. We have developed a computational approach to predict the structure of protein/ligand complexes based solely on the unbound conformation, the ligand, and easy-to-assess experimental data. We tested our protocol on proteins that undergo substantial structural rearrangements upon binding a ligand and were able to predict structures of protein/ligand complexes which are in good agreement with experimentally determined structures. The ability to predict ligand bound receptor conformations based on structures in the unbound state enables structure-based drug design for cases where crystallization of the complex has not been successful so far.
PMCID: PMC2796265  PMID: 20066034
7.  Predicting DNA-binding locations and orientation on proteins using knowledge-based learning of geometric properties 
Proteome Science  2011;9(Suppl 1):S11.
DNA-binding proteins perform their functions through specific or non-specific sequence recognition. Although many sequence- or structure-based approaches have been proposed to identify DNA-binding residues on proteins or protein-binding sites on DNA sequences with satisfied performance, it remains a challenging task to unveil the exact mechanism of protein-DNA interactions without crystal complex structures. Without information from complexes, the linkages between DNA-binding proteins and their binding sites on DNA are still missing.
While it is still difficult to acquire co-crystallized structures in an efficient way, this study proposes a knowledge-based learning method to effectively predict DNA orientation and base locations around the protein’s DNA-binding sites when given a protein structure. First, the functionally important residues of a query protein are predicted by a sequential pattern mining tool. After that, surface residues falling in the predicted functional regions are determined based on the given structure. These residues are then clustered based on their spatial coordinates and the resultant clusters are ranked by a proposed DNA-binding propensity function. Clusters with high DNA-binding propensities are treated as DNA-binding units (DBUs) and each DBU is analyzed by principal component analysis (PCA) to predict potential orientation of DNA grooves. More specifically, the proposed method is developed to predict the direction of the tangent line to the helix curve of the DNA groove where a DBU is going to bind.
This paper proposes a knowledge-based learning procedure to determine the spatial location of the DNA groove with respect to the query protein structure by considering geometric propensity between protein side chains and DNA bases. The 11 test cases used in this study reveal that the location and orientation of the DNA groove around a selected DBU can be predicted with satisfied errors.
This study presents a method to predict the location and orientation of DNA grooves with respect to the structure of a DNA-binding protein. The test cases shown in this study reveal the possibility of imaging protein-DNA binding conformation before co-crystallized structure can be determined. How the proposed method can be incorporated with existing protein-DNA docking tools to study protein-DNA interactions deserve further studies in the near future.
PMCID: PMC3289072  PMID: 22166082
8.  From Nonspecific DNA–Protein Encounter Complexes to the Prediction of DNA–Protein Interactions 
PLoS Computational Biology  2009;5(3):e1000341.
DNA–protein interactions are involved in many essential biological activities. Because there is no simple mapping code between DNA base pairs and protein amino acids, the prediction of DNA–protein interactions is a challenging problem. Here, we present a novel computational approach for predicting DNA-binding protein residues and DNA–protein interaction modes without knowing its specific DNA target sequence. Given the structure of a DNA-binding protein, the method first generates an ensemble of complex structures obtained by rigid-body docking with a nonspecific canonical B-DNA. Representative models are subsequently selected through clustering and ranking by their DNA–protein interfacial energy. Analysis of these encounter complex models suggests that the recognition sites for specific DNA binding are usually favorable interaction sites for the nonspecific DNA probe and that nonspecific DNA–protein interaction modes exhibit some similarity to specific DNA–protein binding modes. Although the method requires as input the knowledge that the protein binds DNA, in benchmark tests, it achieves better performance in identifying DNA-binding sites than three previously established methods, which are based on sophisticated machine-learning techniques. We further apply our method to protein structures predicted through modeling and demonstrate that our method performs satisfactorily on protein models whose root-mean-square Cα deviation from native is up to 5 Å from their native structures. This study provides valuable structural insights into how a specific DNA-binding protein interacts with a nonspecific DNA sequence. The similarity between the specific DNA–protein interaction mode and nonspecific interaction modes may reflect an important sampling step in search of its specific DNA targets by a DNA-binding protein.
Author Summary
Many essential biological activities require interactions between DNA and proteins. These proteins usually use certain amino acids, called DNA-binding sites, to recognize their specific DNA targets. To facilitate the search of its specific DNA targets, a DNA-binding protein often associates with nonspecific DNA and then diffuses along the DNA. Due to the weak interactions between nonspecific DNA and the protein, structural characterization of nonspecific DNA–protein complexes is experimentally challenging. This paper describes a computational modeling study on nonspecific DNA–protein complexes and comparative analysis with respect to specific DNA–protein complexes. The study found that the specific DNA-binding sites on a protein are typically favorable for nonspecific DNA and that nonspecific and specific DNA–protein interaction modes are quite similar. This similarity may reflect an important sampling step in the search for the specific DNA target sequence by a DNA-binding protein. On the basis of these observations, a novel method was proposed for predicting DNA-binding sites and binding modes of a DNA-binding protein without knowing its specific DNA target sequence. Ultimately, the combination of this method and protein structure prediction may lead the way to high throughput modeling of DNA–protein interactions.
PMCID: PMC2659451  PMID: 19343221
9.  TFinDit: transcription factor-DNA interaction data depository 
BMC Bioinformatics  2012;13:220.
One of the crucial steps in regulation of gene expression is the binding of transcription factor(s) to specific DNA sequences. Knowledge of the binding affinity and specificity at a structural level between transcription factors and their target sites has important implications in our understanding of the mechanism of gene regulation. Due to their unique functions and binding specificity, there is a need for a transcription factor-specific, structure-based database and corresponding web service to facilitate structural bioinformatics studies of transcription factor-DNA interactions, such as development of knowledge-based interaction potential, transcription factor-DNA docking, binding induced conformational changes, and the thermodynamics of protein-DNA interactions.
TFinDit is a relational database and a web search tool for studying transcription factor-DNA interactions. The database contains annotated transcription factor-DNA complex structures and related data, such as unbound protein structures, thermodynamic data, and binding sequences for the corresponding transcription factors in the complex structures. TFinDit also provides a user-friendly interface and allows users to either query individual entries or generate datasets through culling the database based on one or more search criteria.
TFinDit is a specialized structural database with annotated transcription factor-DNA complex structures and other preprocessed data. We believe that this database/web service can facilitate the development and testing of TF-DNA interaction potentials and TF-DNA docking algorithms, and the study of protein-DNA recognition mechanisms.
PMCID: PMC3483241  PMID: 22943312
Transcription factor; Database; Binding site prediction; Interaction potential
10.  Optimized Position Weight Matrices in Prediction of Novel Putative Binding Sites for Transcription Factors in the Drosophila melanogaster Genome 
PLoS ONE  2013;8(8):e68712.
Position weight matrices (PWMs) have become a tool of choice for the identification of transcription factor binding sites in DNA sequences. DNA-binding proteins often show degeneracy in their binding requirement and thus the overall binding specificity of many proteins is unknown and remains an active area of research. Although existing PWMs are more reliable predictors than consensus string matching, they generally result in a high number of false positive hits. Our previous study introduced a promising approach to PWM refinement in which known motifs are used to computationally mine putative binding sites directly from aligned promoter regions using composition of similar sites. In the present study, we extended this technique originally tested on single examples of transcription factors (TFs) and showed its capability to optimize PWM performance to predict new binding sites in the fruit fly genome. We propose refined PWMs in mono- and dinucleotide versions similarly computed for a large variety of transcription factors of Drosophila melanogaster. Along with the addition of many auxiliary sites the optimization includes variation of the PWM motif length, the binding sites location on the promoters and the PWM score threshold. To assess the predictive performance of the refined PWMs we compared them to conventional TRANSFAC and JASPAR sources. The results have been verified using performed tests and literature review. Overall, the refined PWMs containing putative sites derived from real promoter content processed using optimized parameters had better general accuracy than conventional PWMs.
PMCID: PMC3735551  PMID: 23936309
11.  Binding-Induced Folding of a Natively Unstructured Transcription Factor 
PLoS Computational Biology  2008;4(4):e1000060.
Transcription factors are central components of the intracellular regulatory networks that control gene expression. An increasingly recognized phenomenon among human transcription factors is the formation of structure upon target binding. Here, we study the folding and binding of the pKID domain of CREB to the KIX domain of the co-activator CBP. Our simulations of a topology-based Gō-type model predict a coupled folding and binding mechanism, and the existence of partially bound intermediates. From transition-path and Φ-value analyses, we find that the binding transition state resembles the unstructured state in solution, implying that CREB becomes structured only after committing to binding. A change of structure following binding is reminiscent of an induced-fit mechanism and contrasts with models in which binding occurs to pre-structured conformations that exist in the unbound state at equilibrium. Interestingly, increasing the amount of structure in the unbound pKID reduces the rate of binding, suggesting a “fly-casting”-like process. We find that the inclusion of attractive non-native interactions results in the formation of non-specific encounter complexes that enhance the on-rate of binding, but do not significantly change the binding mechanism. Our study helps explain how being unstructured can confer an advantage in protein target recognition. The simulations are in general agreement with the results of a recently reported nuclear magnetic resonance study, and aid in the interpretation of the experimental binding kinetics.
Author Summary
Protein-protein interactions are central to many physiological processes. Traditionally, the atomic structure of the isolated proteins, as obtained from X-ray crystallography or NMR spectroscopy, has been thought to determine the molecular recognition and binding process. However, this view has been challenged by the discovery of natively unstructured proteins, including many human transcription factors, which assume ordered molecular structures only upon binding to their targets. Understanding these transitions from a dissociated and unfolded state to a bound and folded state is key to unraveling how transcription factors, proteins involved in the system that controls the transcription of genetic information from DNA to RNA, perform their function. We conducted molecular simulations to study the binding of a well characterized transcription/co-transcription factor complex. We found that the transcription factor is unstructured when binding to its partner, with folding into an ordered structure occurring only after the initial binding event. The resulting coupled folding-and-binding mechanism is found to be in accord with state-of-the-art experimental data. Transcription-factor binding in an unstructured state may confer an advantage in protein target recognition by accelerating the rate of association without compromising the ability to bind to a diverse set of proteins with high specificity and yet relatively low affinity.
PMCID: PMC2289845  PMID: 18404207
12.  Accurate Prediction of Peptide Binding Sites on Protein Surfaces 
PLoS Computational Biology  2009;5(3):e1000335.
Many important protein–protein interactions are mediated by the binding of a short peptide stretch in one protein to a large globular segment in another. Recent efforts have provided hundreds of examples of new peptides binding to proteins for which a three-dimensional structure is available (either known experimentally or readily modeled) but where no structure of the protein–peptide complex is known. To address this gap, we present an approach that can accurately predict peptide binding sites on protein surfaces. For peptides known to bind a particular protein, the method predicts binding sites with great accuracy, and the specificity of the approach means that it can also be used to predict whether or not a putative or predicted peptide partner will bind. We used known protein–peptide complexes to derive preferences, in the form of spatial position specific scoring matrices, which describe the binding-site environment in globular proteins for each type of amino acid in bound peptides. We then scan the surface of a putative binding protein for sites for each of the amino acids present in a peptide partner and search for combinations of high-scoring amino acid sites that satisfy constraints deduced from the peptide sequence. The method performed well in a benchmark and largely agreed with experimental data mapping binding sites for several recently discovered interactions mediated by peptides, including RG-rich proteins with SMN domains, Epstein-Barr virus LMP1 with TRADD domains, DBC1 with Sir2, and the Ago hook with Argonaute PIWI domain. The method, and associated statistics, is an excellent tool for predicting and studying binding sites for newly discovered peptides mediating critical events in biology.
Author Summary
An important class of protein interactions in critical cellular processes, such as signaling pathways, involves a domain from one protein binding to a linear peptide stretch of another. Many methods identify peptides mediating such interactions but without details of how the interactions occur, even when excellent structural information is available for the unbound protein. Experimental studies are currently time consuming, while existing computational methods to predict protein–peptide structures mostly focus on interactions involving specific protein families. Here, we present a general approach for predicting protein–peptide interaction sites. We show that spatial atomic position specific scoring matrices of binding sites for each peptide residue can capture the properties important for binding and when used to scan the surface of target proteins can accurately identify candidate binding sites for interacting peptides. The resulting predictions are highly illuminating for several recently described protein–peptide complexes, including RG-rich peptides with SMN domains, the Epstein-Barr virus LMP1 with TRADD domains, DBC1 with Sir2, and the Ago hook with the Argonaute PIWI domain. The accurate prediction of protein–peptide binding without prior structural knowledge will ultimately enable better functional characterization of many protein interactions involved in vital biological processes and provide a better picture of cellular mechanisms.
PMCID: PMC2653190  PMID: 19325869
13.  De novo prediction of DNA-binding specificities for Cys2His2 zinc finger proteins 
Nucleic Acids Research  2013;42(1):97-108.
Proteins with sequence-specific DNA binding function are important for a wide range of biological activities. De novo prediction of their DNA-binding specificities from sequence alone would be a great aid in inferring cellular networks. Here we introduce a method for predicting DNA-binding specificities for Cys2His2 zinc fingers (C2H2-ZFs), the largest family of DNA-binding proteins in metazoans. We develop a general approach, based on empirical calculations of pairwise amino acid–nucleotide interaction energies, for predicting position weight matrices (PWMs) representing DNA-binding specificities for C2H2-ZF proteins. We predict DNA-binding specificities on a per-finger basis and merge predictions for C2H2-ZF domains that are arrayed within sequences. We test our approach on a diverse set of natural C2H2-ZF proteins with known binding specificities and demonstrate that for >85% of the proteins, their predicted PWMs are accurate in 50% of their nucleotide positions. For proteins with several zinc finger isoforms, we show via case studies that this level of accuracy enables us to match isoforms with their known DNA-binding specificities. A web server for predicting a PWM given a protein containing C2H2-ZF domains is available online at and can be used to aid in protein engineering applications and in genome-wide searches for transcription factor targets.
PMCID: PMC3874201  PMID: 24097433
14.  Assessment of clusters of transcription factor binding sites in relationship to human promoter, CpG islands and gene expression 
BMC Genomics  2004;5:16.
Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions.
We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI.
Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions.
PMCID: PMC375527  PMID: 15053842
promoter; tissue-specific gene expression; position weight matrix; regulatory motif
15.  The multiple-specificity landscape of modular peptide recognition domains 
Using large scale experimental datasets, the authors show how modular protein interaction domains such as PDZ, SH3 or WW domains, frequently display unexpected multiple binding specificity. The observed multiple specificity leads to new structural insights and accurately predicts new protein interactions.
Modular protein domains interacting with short linear peptides, such as PDZ, SH3 or WW domains, display a rich binding specificity with significant interplay (or correlation) between ligand residues.The binding specificity of these domains is more accurately described with a multiple specificity model.The multiple specificity reveals new structural insights and predicts new protein interactions.
Modular protein domains have a central role in the complex network of signaling pathways that governs cellular processes. Many of them, called peptide recognition domains, bind short linear regions in their target proteins, such as the well-known SH3 or PDZ domains. These domain–peptide interactions are the predominant form of protein interaction in signaling pathways.
Because of the relative simplicity of the interaction, their binding specificity is generally represented using a simple model, analogous to transcription factor binding: the domain binds a short stretch of amino acids and at each position some amino acids are preferred over other ones. Thus, for each position, a probability can be assigned to each amino acid and these probabilities are often grouped into a matrix called position weight matrix (PWM) or position-specific scoring matrix. Such a matrix can then be represented in a highly intuitive manner as a so-called sequence logo (see Figure 1).
A main shortcoming of this specificity model is that, although intuitive and interpretable, it inherently assumes that all residues in the peptide contribute independently to binding. On the basis of statistical analyses of large data sets of peptides binding to PDZ, SH3 and WW domains, we show that for most domains, this is not the case. Indeed, there is complex and highly significant interplay between the ligand residues. To overcome this issue, we develop a computational model that can both take into account such correlations and also preserve the advantages of PWMs, namely its straightforward interpretability.
Briefly, our method detects whether the domain is capable of binding its targets not only with a single specificity but also with multiple specificities. If so, it will determine all the relevant specificities (see Figure 1). This is accomplished by using a machine learning algorithm based on mixture models, and the results can be effectively visualized as multiple sequence logos. In other words, based on experimentally derived data sets of binding peptides, we determine for every domain, in addition to the known specificity, one or more new specificities. As such, we capture more real information, and our model performs better than previous models of binding specificity.
A crucial question is what these new specificities correspond to: are they simply mathematical artifacts coming out of some algorithm or do they represent something we can understand on a biophysical or structural level? Overall, the new specificities provide us with substantial new intuitive insight about the structural basis of binding for these domains. We can roughly identify two cases.
First, we have neighboring (or very close in sequence) amino acids in the ligand that show significant correlations. These usually correspond to amino acids whose side chains point in the same directions and often occupy the same physical space, and therefore can directly influence each other.
In other cases, we observe that multiple specificities found for a single domain are very different from each other. They correspond to different ways that the domain accommodates its binders. Often, conformational changes are required to switch from one binding mode to another. In almost all cases, only one canonical binding mode was previously known, and our analysis enables us to predict several interesting non-canonical ones. Specifically, we discuss one example in detail in Figure 5. In a PDZ domain of DLG1, we identify a novel binding specificity that differs from the canonical one by the presence of an additional tryptophan at the C terminus of the ligand. From a structural point of view, this would require a flexible loop to move out of the way to accommodate this rather large side chain. We find evidence of this predicted new binding mode based on both existing crystal structures and structural modeling.
Finally, our model of binding specificity leads to predictions of many new and previously unknown protein interactions. We validate a number of these using the membrane yeast two-hybrid approach.
In summary, we show here that multiple specificity is a general and underappreciated phenomenon for modular peptide recognition domains and that it leads to substantial new insight into the basis of protein interactions.
Modular protein interaction domains form the building blocks of eukaryotic signaling pathways. Many of them, known as peptide recognition domains, mediate protein interactions by recognizing short, linear amino acid stretches on the surface of their cognate partners with high specificity. Residues in these stretches are usually assumed to contribute independently to binding, which has led to a simplified understanding of protein interactions. Conversely, we observe in large binding peptide data sets that different residue positions display highly significant correlations for many domains in three distinct families (PDZ, SH3 and WW). These correlation patterns reveal a widespread occurrence of multiple binding specificities and give novel structural insights into protein interactions. For example, we predict a new binding mode of PDZ domains and structurally rationalize it for DLG1 PDZ1. We show that multiple specificity more accurately predicts protein interactions and experimentally validate some of the predictions for the human proteins DLG1 and SCRIB. Overall, our results reveal a rich specificity landscape in peptide recognition domains, suggesting new ways of encoding specificity in protein interaction networks.
PMCID: PMC3097085  PMID: 21525870
binding specificity; peptide recognition domains; PDZ; phage display; residue correlations
16.  Variable structure motifs for transcription factor binding sites 
BMC Genomics  2010;11:30.
Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets.
We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance.
We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1.
PMCID: PMC2824720  PMID: 20074339
17.  A non-redundant structure dataset for benchmarking protein-RNA computational docking 
Protein-RNA interactions play an important role in many biological processes. The ability to predict the molecular structures of protein-RNA complexes from docking would be valuable for understanding the underlying chemical mechanisms. We have developed a novel non-redundant benchmark dataset for protein-RNA docking and scoring. The diverse dataset of 72 targets consists of 52 unbound-unbound test complexes, and 20 unbound-bound test complexes. Here, unbound-unbound complexes refer to cases in which both binding partners of the co-crystallized complex are either in apo form or in a conformation taken from a different protein-RNA complex, whereas unbound-bound complexes are cases in which only one of the two binding partners has another experimentally determined conformation. The dataset is classified into three categories according to the interface RMSD and the percentage of native contacts in the unbound structures: 49 easy, 16 medium, and 7 difficult targets. The bound and unbound cases of the benchmark dataset are expected to benefit the development and improvement of docking and scoring algorithms for the docking community. All the easy-to-view structures are freely available to the public at
PMCID: PMC3546201  PMID: 23047523
Benchmarking; protein-RNA interactions; molecular docking; scoring function; molecular recognition
18.  Tree-Based Position Weight Matrix Approach to Model Transcription Factor Binding Site Profiles 
PLoS ONE  2011;6(9):e24210.
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
PMCID: PMC3166302  PMID: 21912677
19.  DrugScorePPI Knowledge-Based Potentials Used as Scoring and Objective Function in Protein-Protein Docking 
PLoS ONE  2014;9(2):e89466.
The distance-dependent knowledge-based DrugScorePPI potentials, previously developed for in silico alanine scanning and hot spot prediction on given structures of protein-protein complexes, are evaluated as a scoring and objective function for the structure prediction of protein-protein complexes. When applied for ranking “unbound perturbation” (“unbound docking”) decoys generated by Baker and coworkers a 4-fold (1.5-fold) enrichment of acceptable docking solutions in the top ranks compared to a random selection is found. When applied as an objective function in FRODOCK for bound protein-protein docking on 97 complexes of the ZDOCK benchmark 3.0, DrugScorePPI/FRODOCK finds up to 10% (15%) more high accuracy solutions in the top 1 (top 10) predictions than the original FRODOCK implementation. When used as an objective function for global unbound protein-protein docking, fair docking success rates are obtained, which improve by ∼2-fold to 18% (58%) for an at least acceptable solution in the top 10 (top 100) predictions when performing knowledge-driven unbound docking. This suggests that DrugScorePPI balances well several different types of interactions important for protein-protein recognition. The results are discussed in view of the influence of crystal packing and the type of protein-protein complex docked. Finally, a simple criterion is provided with which to estimate a priori if unbound docking with DrugScorePPI/FRODOCK will be successful.
PMCID: PMC3931789  PMID: 24586799
20.  Dinucleotide Weight Matrices for Predicting Transcription Factor Binding Sites: Generalizing the Position Weight Matrix 
PLoS ONE  2010;5(3):e9722.
Identifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as “position weight matrices” (PWMs). Though convenient, the PWM has significant limitations, in particular the assumed independence of positions within the binding motif; and predictions based on PWMs are usually not very specific to known functional sites. Analysis here on binding sites in yeast suggests that correlation of dinucleotides is not limited to near-neighbours, but can extend over considerable gaps.
Methodology/Principal Findings
I describe a straightforward generalization of the PWM model, that considers frequencies of dinucleotides instead of individual nucleotides. Unlike previous efforts, this method considers all dinucleotides within an extended binding region, and does not make an attempt to determine a priori the significance of particular dinucleotide correlations. I describe how to use a “dinucleotide weight matrix” (DWM) to predict binding sites, dealing in particular with the complication that its entries are not independent probabilities. Benchmarks show, for many factors, a dramatic improvement over PWMs in precision of predicting known targets. In most cases, significant further improvement arises by extending the commonly defined “core motifs” by about 10bp on either side. Though this flanking sequence shows no strong motif at the nucleotide level, the predictive power of the dinucleotide model suggests that the “signature” in DNA sequence of protein-binding affinity extends beyond the core protein-DNA contact region.
While computationally more demanding and slower than PWM-based approaches, this dinucleotide method is straightforward, both conceptually and in implementation, and can serve as a basis for future improvements.
PMCID: PMC2842295  PMID: 20339533
21.  Increasing Coverage of Transcription Factor Position Weight Matrices through Domain-level Homology 
PLoS ONE  2012;7(8):e42779.
Transcription factor-DNA interactions, central to cellular regulation and control, are commonly described by position weight matrices (PWMs). These matrices are frequently used to predict transcription factor binding sites in regulatory regions of DNA to complement and guide further experimental investigation. The DNA sequence preferences of transcription factors, encoded in PWMs, are dictated primarily by select residues within the DNA binding domain(s) that interact directly with DNA. Therefore, the DNA binding properties of homologous transcription factors with identical DNA binding domains may be characterized by PWMs derived from different species. Accordingly, we have implemented a fully automated domain-level homology searching method for identical DNA binding sequences.
By applying the domain-level homology search to transcription factors with existing PWMs in the JASPAR and TRANSFAC databases, we were able to significantly increase coverage in terms of the total number of PWMs associated with a given species, assign PWMs to transcription factors that did not previously have any associations, and increase the number of represented species with PWMs over an order of magnitude. Additionally, using protein binding microarray (PBM) data, we have validated the domain-level method by demonstrating that transcription factor pairs with matching DNA binding domains exhibit comparable DNA binding specificity predictions to transcription factor pairs with completely identical sequences.
The increased coverage achieved herein demonstrates the potential for more thorough species-associated investigation of protein-DNA interactions using existing resources. The PWM scanning results highlight the challenging nature of transcription factors that contain multiple DNA binding domains, as well as the impact of motif discovery on the ability to predict DNA binding properties. The method is additionally suitable for identifying domain-level homology mappings to enable utilization of additional information sources in the study of transcription factors. The domain-level homology search method, resulting PWM mappings, web-based user interface, and web API are publicly available at
PMCID: PMC3428306  PMID: 22952610
22.  DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions 
Nucleic Acids Research  2008;36(12):3978-3992.
The structures of DNA–protein complexes have illuminated the diversity of DNA–protein binding mechanisms shown by different protein families. This lack of generality could pose a great challenge for predicting DNA–protein interactions. To address this issue, we have developed a knowledge-based method, DNA-binding Domain Hunter (DBD-Hunter), for identifying DNA-binding proteins and associated binding sites. The method combines structural comparison and the evaluation of a statistical potential, which we derive to describe interactions between DNA base pairs and protein residues. We demonstrate that DBD-Hunter is an accurate method for predicting DNA-binding function of proteins, and that DNA-binding protein residues can be reliably inferred from the corresponding templates if identified. In benchmark tests on ∼4000 proteins, our method achieved an accuracy of 98% and a precision of 84%, which significantly outperforms three previous methods. We further validate the method on DNA-binding protein structures determined in DNA-free (apo) state. We show that the accuracy of our method is only slightly affected on apo-structures compared to the performance on holo-structures cocrystallized with DNA. Finally, we apply the method to ∼1700 structural genomics targets and predict that 37 targets with previously unknown function are likely to be DNA-binding proteins. DBD-Hunter is freely available at
PMCID: PMC2475642  PMID: 18515839
23.  Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data 
BMC Bioinformatics  2013;14(Suppl 10):S2.
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.
PMCID: PMC3750486  PMID: 24267147
24.  ParaDock: a flexible non-specific DNA—rigid protein docking algorithm 
Nucleic Acids Research  2011;39(20):e135.
Accurate prediction of protein–DNA complexes could provide an important stepping stone towards a thorough comprehension of vital intracellular processes. Few attempts were made to tackle this issue, focusing on binding patch prediction, protein function classification and distance constraints-based docking. We introduce ParaDock: a novel ab initio protein–DNA docking algorithm. ParaDock combines short DNA fragments, which have been rigidly docked to the protein based on geometric complementarity, to create bent planar DNA molecules of arbitrary sequence. Our algorithm was tested on the bound and unbound targets of a protein–DNA benchmark comprised of 47 complexes. With neither addressing protein flexibility, nor applying any refinement procedure, CAPRI acceptable solutions were obtained among the 10 top ranked hypotheses in 83% of the bound complexes, and 70% of the unbound. Without requiring prior knowledge of DNA length and sequence, and within <2 h per target on a standard 2.0 GHz single processor CPU, ParaDock offers a fast ab initio docking solution.
PMCID: PMC3203577  PMID: 21835777
25.  Protein-DNA docking with a coarse-grained force field 
BMC Bioinformatics  2012;13:228.
Protein-DNA interactions are important for many cellular processes, however structural knowledge for a large fraction of known and putative complexes is still lacking. Computational docking methods aim at the prediction of complex architecture given detailed structures of its constituents. They are becoming an increasingly important tool in the field of macromolecular assemblies, complementing particularly demanding protein-nucleic acids X ray crystallography and providing means for the refinement and integration of low resolution data coming from rapidly advancing methods such as cryoelectron microscopy.
We present a new coarse-grained force field suitable for protein-DNA docking. The force field is an extension of previously developed parameter sets for protein-RNA and protein-protein interactions. The docking is based on potential energy minimization in translational and orientational degrees of freedom of the binding partners. It allows for fast and efficient systematic search for native-like complex geometry without any prior knowledge regarding binding site location.
We find that the force field gives very good results for bound docking. The quality of predictions in the case of unbound docking varies, depending on the level of structural deviation from bound geometries. We analyze the role of specific protein-DNA interactions on force field performance, both with respect to complex structure prediction, and the reproduction of experimental binding affinities. We find that such direct, specific interactions only partially contribute to protein-DNA recognition, indicating an important role of shape complementarity and sequence-dependent DNA internal energy, in line with the concept of indirect protein-DNA readout mechanism.
PMCID: PMC3522568  PMID: 22966980

Results 1-25 (1556815)