By binding to short and highly conserved DNA sequences in genomes, DNA-binding proteins initiate, enhance or repress biological processes. Accurately identifying such binding sites, often represented by position weight matrices (PWMs), is an important step in understanding the control mechanisms of cells. When given coordinates of a DNA-binding domain (DBD) bound with DNA, a potential function can be used to estimate the change of binding affinity after base substitutions, where the changes can be summarized as a PWM. This technique provides an effective alternative when the chromatin immunoprecipitation data are unavailable for PWM inference. To facilitate the procedure of predicting PWMs based on protein–DNA complexes or even structures of the unbound state, the web server, DBD2BS, is presented in this study. The DBD2BS uses an atom-level knowledge-based potential function to predict PWMs characterizing the sequences to which the query DBD structure can bind. For unbound queries, a list of 1066 DBD–DNA complexes (including 1813 protein chains) is compiled for use as templates for synthesizing bound structures. The DBD2BS provides users with an easy-to-use interface for visualizing the PWMs predicted based on different templates and the spatial relationships of the query protein, the DBDs and the DNAs. The DBD2BS is the first attempt to predict PWMs of DBDs from unbound structures rather than from bound ones. This approach increases the number of existing protein structures that can be exploited when analyzing protein–DNA interactions. In a recent study, the authors showed that the kernel adopted by the DBD2BS can generate PWMs consistent with those obtained from the experimental data. The use of DBD2BS to predict PWMs can be incorporated with sequence-based methods to discover binding sites in genome-wide studies.
http://dbd2bs.csie.ntu.edu.tw/, http://dbd2bs.csbb.ntu.edu.tw/, and http://dbd2bs.ee.ncku.edu.tw.
Protein-RNA interactions play an important role in many biological processes. The ability to predict the molecular structures of protein-RNA complexes from docking would be valuable for understanding the underlying chemical mechanisms. We have developed a novel non-redundant benchmark dataset for protein-RNA docking and scoring. The diverse dataset of 72 targets consists of 52 unbound-unbound test complexes, and 20 unbound-bound test complexes. Here, unbound-unbound complexes refer to cases in which both binding partners of the co-crystallized complex are either in apo form or in a conformation taken from a different protein-RNA complex, whereas unbound-bound complexes are cases in which only one of the two binding partners has another experimentally determined conformation. The dataset is classified into three categories according to the interface RMSD and the percentage of native contacts in the unbound structures: 49 easy, 16 medium, and 7 difficult targets. The bound and unbound cases of the benchmark dataset are expected to benefit the development and improvement of docking and scoring algorithms for the docking community. All the easy-to-view structures are freely available to the public at http://zoulab.dalton.missouri.edu/RNAbenchmark/.
Benchmarking; protein-RNA interactions; molecular docking; scoring function; molecular recognition
Position weight matrices (PWMs) have become a tool of choice for the identification of transcription factor binding sites in DNA sequences. DNA-binding proteins often show degeneracy in their binding requirement and thus the overall binding specificity of many proteins is unknown and remains an active area of research. Although existing PWMs are more reliable predictors than consensus string matching, they generally result in a high number of false positive hits. Our previous study introduced a promising approach to PWM refinement in which known motifs are used to computationally mine putative binding sites directly from aligned promoter regions using composition of similar sites. In the present study, we extended this technique originally tested on single examples of transcription factors (TFs) and showed its capability to optimize PWM performance to predict new binding sites in the fruit fly genome. We propose refined PWMs in mono- and dinucleotide versions similarly computed for a large variety of transcription factors of Drosophila melanogaster. Along with the addition of many auxiliary sites the optimization includes variation of the PWM motif length, the binding sites location on the promoters and the PWM score threshold. To assess the predictive performance of the refined PWMs we compared them to conventional TRANSFAC and JASPAR sources. The results have been verified using performed tests and literature review. Overall, the refined PWMs containing putative sites derived from real promoter content processed using optimized parameters had better general accuracy than conventional PWMs.
Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein–DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein–DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein–DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein–DNA interactions. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.
Previous studies on protein-DNA interaction mostly focused on the bound structure of DNA-binding proteins but few paid enough attention to the unbound structures. As more new proteins are discovered, it is useful and imperative to develop algorithms for the functional prediction of unbound proteins. In our work, we apply an alpha shape model to represent the surface structure of the protein-DNA complex and extract useful statistical and geometric features, and use structural alignment and support vector machines for the prediction of unbound DNA-binding proteins.
The performance of our method is evaluated by discriminating a set of 104 DNA-binding proteins from 401 non-DNA-binding proteins. In the same test, the proposed method outperforms the other method using conditional probability. The results achieved by our proposed method for; precision, 83.33%; accuracy, 86.53%; and MCC, 0.5368 demonstrate its good performance.
In this study we develop an effective method for the prediction of protein-DNA interactions based on statistical and geometric features and support vector machines. Our results show that interface surface features play an important role in protein-DNA interaction. Our technique is able to predict unbound DNA-binding protein and discriminatory DNA-binding proteins from proteins that bind with other molecules.
Proteins with sequence-specific DNA binding function are important for a wide range of biological activities. De novo prediction of their DNA-binding specificities from sequence alone would be a great aid in inferring cellular networks. Here we introduce a method for predicting DNA-binding specificities for Cys2His2 zinc fingers (C2H2-ZFs), the largest family of DNA-binding proteins in metazoans. We develop a general approach, based on empirical calculations of pairwise amino acid–nucleotide interaction energies, for predicting position weight matrices (PWMs) representing DNA-binding specificities for C2H2-ZF proteins. We predict DNA-binding specificities on a per-finger basis and merge predictions for C2H2-ZF domains that are arrayed within sequences. We test our approach on a diverse set of natural C2H2-ZF proteins with known binding specificities and demonstrate that for >85% of the proteins, their predicted PWMs are accurate in 50% of their nucleotide positions. For proteins with several zinc finger isoforms, we show via case studies that this level of accuracy enables us to match isoforms with their known DNA-binding specificities. A web server for predicting a PWM given a protein containing C2H2-ZF domains is available online at http://zf.princeton.edu and can be used to aid in protein engineering applications and in genome-wide searches for transcription factor targets.
Accurate prediction of protein–DNA complexes could provide an important stepping stone towards a thorough comprehension of vital intracellular processes. Few attempts were made to tackle this issue, focusing on binding patch prediction, protein function classification and distance constraints-based docking. We introduce ParaDock: a novel ab initio protein–DNA docking algorithm. ParaDock combines short DNA fragments, which have been rigidly docked to the protein based on geometric complementarity, to create bent planar DNA molecules of arbitrary sequence. Our algorithm was tested on the bound and unbound targets of a protein–DNA benchmark comprised of 47 complexes. With neither addressing protein flexibility, nor applying any refinement procedure, CAPRI acceptable solutions were obtained among the 10 top ranked hypotheses in 83% of the bound complexes, and 70% of the unbound. Without requiring prior knowledge of DNA length and sequence, and within <2 h per target on a standard 2.0 GHz single processor CPU, ParaDock offers a fast ab initio docking solution.
fPOP (footprinting Pockets Of Proteins, http://pocket.uchicago.edu/fpop/) is a relational database of the protein functional surfaces identified by analyzing the shapes of binding sites in ∼42 700 structures, including both holo and apo forms. We previously used a purely geometric method to extract the spatial patterns of functional surfaces (split pockets) in ∼19 000 bound structures and constructed a database, SplitPocket (http://pocket.uchicago.edu/). These functional surfaces are now used as spatial templates to predict the binding surfaces of unbound structures. To conduct a shape comparison, we use the Smith–Waterman algorithm to footprint an unbound pocket fragment with those of the functional surfaces in SplitPocket. The pairwise alignment of the unbound and bound pocket fragments is used to evaluate the local structural similarity via geometric matching. The final results of our large-scale computation, including ∼90 000 identified or predicted functional surfaces, are stored in fPOP. This database provides an easily accessible resource for studying functional surfaces, assessing conformational changes between bound and unbound forms and analyzing functional divergence. Moreover, it may facilitate the exploration of the physicochemical textures of molecules and the inference of protein function. Finally, our approach provides a framework for classification of proteins into families on the basis of their functional surfaces.
Predicting the binding specificity of transcription factors is a critical step in the characterization and computational identification and of cis-regulatory elements in genomic sequences. Here we use protein–DNA structures to predict binding specificity and consider the possibility of predicting position weight matrices (PWM) for an entire protein family based on the structures of just a few family members. A particular focus is the sensitivity of prediction accuracy to the docking geometry of the structure used. We investigate this issue with the goal of determining how similar two docking geometries must be for binding specificity predictions to be accurate. Docking similarity is quantified using our recently described interface alignment score (IAS). Using a molecular-mechanics force field, we predict high-affinity nucleotide sequences that bind to the second zinc-finger (ZF) domain from the Zif268 protein, using different C2H2 ZF domains as structural templates. We identify a strong relationship between IAS values and prediction accuracy, and define a range of IAS values for which accurate structure-based predictions of binding specificity is to be expected. The implication of our results for large-scale, structure-based prediction of PWMs is discussed.
DNA-binding proteins perform their functions through specific or non-specific sequence recognition. Although many sequence- or structure-based approaches have been proposed to identify DNA-binding residues on proteins or protein-binding sites on DNA sequences with satisfied performance, it remains a challenging task to unveil the exact mechanism of protein-DNA interactions without crystal complex structures. Without information from complexes, the linkages between DNA-binding proteins and their binding sites on DNA are still missing.
While it is still difficult to acquire co-crystallized structures in an efficient way, this study proposes a knowledge-based learning method to effectively predict DNA orientation and base locations around the protein’s DNA-binding sites when given a protein structure. First, the functionally important residues of a query protein are predicted by a sequential pattern mining tool. After that, surface residues falling in the predicted functional regions are determined based on the given structure. These residues are then clustered based on their spatial coordinates and the resultant clusters are ranked by a proposed DNA-binding propensity function. Clusters with high DNA-binding propensities are treated as DNA-binding units (DBUs) and each DBU is analyzed by principal component analysis (PCA) to predict potential orientation of DNA grooves. More specifically, the proposed method is developed to predict the direction of the tangent line to the helix curve of the DNA groove where a DBU is going to bind.
This paper proposes a knowledge-based learning procedure to determine the spatial location of the DNA groove with respect to the query protein structure by considering geometric propensity between protein side chains and DNA bases. The 11 test cases used in this study reveal that the location and orientation of the DNA groove around a selected DBU can be predicted with satisfied errors.
This study presents a method to predict the location and orientation of DNA grooves with respect to the structure of a DNA-binding protein. The test cases shown in this study reveal the possibility of imaging protein-DNA binding conformation before co-crystallized structure can be determined. How the proposed method can be incorporated with existing protein-DNA docking tools to study protein-DNA interactions deserve further studies in the near future.
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
Mechanistic understanding of many key cellular processes often involves identification of RNA binding proteins (RBPs) and RNA binding sites in two separate steps. Here, they are predicted simultaneously by structural alignment to known protein–RNA complex structures followed by binding assessment with a DFIRE-based statistical energy function. This method achieves 98% accuracy and 91% precision for predicting RBPs and 93% accuracy and 78% precision for predicting RNA-binding amino-acid residues for a large benchmark of 212 RNA binding and 6761 non-RNA binding domains (leave-one-out cross-validation). Additional tests revealed that the method makes no false positive prediction from 311 DNA binding domains but correctly detects six domains binding with both DNA and RNA. In addition, it correctly identified 31 of 75 unbound RNA-binding domains with 92% accuracy and 65% precision for predicted binding residues and achieved 86% success rate in its application to SCOP RNA binding domain superfamily (Structural Classification Of Proteins). It further predicts 25 targets as RBPs in 2076 structural genomics targets: 20 of 25 predicted ones (80%) are putatively RNA binding. The superior performance over existing methods indicates the importance of dividing structures into domains, using a Z-score to measure relative structural similarity, and a statistical energy function to measure protein–RNA binding affinity.
The structures of DNA–protein complexes have illuminated the diversity of DNA–protein binding mechanisms shown by different protein families. This lack of generality could pose a great challenge for predicting DNA–protein interactions. To address this issue, we have developed a knowledge-based method, DNA-binding Domain Hunter (DBD-Hunter), for identifying DNA-binding proteins and associated binding sites. The method combines structural comparison and the evaluation of a statistical potential, which we derive to describe interactions between DNA base pairs and protein residues. We demonstrate that DBD-Hunter is an accurate method for predicting DNA-binding function of proteins, and that DNA-binding protein residues can be reliably inferred from the corresponding templates if identified. In benchmark tests on ∼4000 proteins, our method achieved an accuracy of 98% and a precision of 84%, which significantly outperforms three previous methods. We further validate the method on DNA-binding protein structures determined in DNA-free (apo) state. We show that the accuracy of our method is only slightly affected on apo-structures compared to the performance on holo-structures cocrystallized with DNA. Finally, we apply the method to ∼1700 structural genomics targets and predict that 37 targets with previously unknown function are likely to be DNA-binding proteins. DBD-Hunter is freely available at http://cssb.biology.gatech.edu/skolnick/webservice/DBD-Hunter/.
One of the crucial steps in regulation of gene expression is the binding of transcription factor(s) to specific DNA sequences. Knowledge of the binding affinity and specificity at a structural level between transcription factors and their target sites has important implications in our understanding of the mechanism of gene regulation. Due to their unique functions and binding specificity, there is a need for a transcription factor-specific, structure-based database and corresponding web service to facilitate structural bioinformatics studies of transcription factor-DNA interactions, such as development of knowledge-based interaction potential, transcription factor-DNA docking, binding induced conformational changes, and the thermodynamics of protein-DNA interactions.
TFinDit is a relational database and a web search tool for studying transcription factor-DNA interactions. The database contains annotated transcription factor-DNA complex structures and related data, such as unbound protein structures, thermodynamic data, and binding sequences for the corresponding transcription factors in the complex structures. TFinDit also provides a user-friendly interface and allows users to either query individual entries or generate datasets through culling the database based on one or more search criteria.
TFinDit is a specialized structural database with annotated transcription factor-DNA complex structures and other preprocessed data. We believe that this database/web service can facilitate the development and testing of TF-DNA interaction potentials and TF-DNA docking algorithms, and the study of protein-DNA recognition mechanisms.
Transcription factor; Database; Binding site prediction; Interaction potential
Comparison of bound and unbound DNA in protein–DNA co-crystal complexes reveals insights into controller-protein binding and DNA distortion in transcriptional regulation.
The controller protein of the type II restriction–modification (RM) system Esp1396I binds to three distinct DNA operator sequences upstream of the methyltransferase and endonuclease genes in order to regulate their expression. Previous biophysical and crystallographic studies have shown molecular details of how the controller protein binds to the operator sites with very different affinities. Here, two protein–DNA co-crystal structures containing portions of unbound DNA from native operator sites are reported. The DNA in both complexes shows significant distortion in the region between the conserved symmetric sequences, similar to that of a DNA duplex when bound by the controller protein (C-protein), indicating that the naked DNA has an intrinsic tendency to bend when not bound to the C-protein. Moreover, the width of the major groove of the DNA adjacent to a bound C-protein dimer is observed to be significantly increased, supporting the idea that this DNA distortion contributes to the substantial cooperativity found when a second C-protein dimer binds to the operator to form the tetrameric repression complex.
transcriptional regulation; DNA-binding proteins; helix–turn–helix motif; DNA distortion
Diverse mechanisms for DNA-protein recognition have been elucidated in numerous atomic complex structures from various protein families. These structural data provide an invaluable knowledge base not only for understanding DNA-protein interactions, but also for developing specialized methods that predict the DNA-binding function from protein structure. While such methods are useful, a major limitation is that they require an experimental structure of the target as input. To overcome this obstacle, we develop a threading-based method, DNA-Binding-Domain-Threader (DBD-Threader), for the prediction of DNA-binding domains and associated DNA-binding protein residues. Our method, which uses a template library composed of DNA-protein complex structures, requires only the target protein's sequence. In our approach, fold similarity and DNA-binding propensity are employed as two functional discriminating properties. In benchmark tests on 179 DNA-binding and 3,797 non-DNA-binding proteins, using templates whose sequence identity is less than 30% to the target, DBD-Threader achieves a sensitivity/precision of 56%/86%. This performance is considerably better than the standard sequence comparison method PSI-BLAST and is comparable to DBD-Hunter, which requires an experimental structure as input. Moreover, for over 70% of predicted DNA-binding domains, the backbone Root Mean Square Deviations (RMSDs) of the top-ranked structural models are within 6.5 Å of their experimental structures, with their associated DNA-binding sites identified at satisfactory accuracy. Additionally, DBD-Threader correctly assigned the SCOP superfamily for most predicted domains. To demonstrate that DBD-Threader is useful for automatic function annotation on a large-scale, DBD-Threader was applied to 18,631 protein sequences from the human genome; 1,654 proteins are predicted to have DNA-binding function. Comparison with existing Gene Ontology (GO) annotations suggests that ∼30% of our predictions are new. Finally, we present some interesting predictions in detail. In particular, it is estimated that ∼20% of classic zinc finger domains play a functional role not related to direct DNA-binding.
DNA-binding proteins represent only a small fraction of proteins encoded in genomes, yet they play a critical role in a variety of biological activities. Identifying these proteins and understanding how they function are important issues. The structures of solved DNA protein complexes of different protein families provide an invaluable knowledge base not only for understanding DNA-protein interactions, but also for developing methods that predict whether or not a protein binds DNA. While such methods are useful, they require an experimental structure as input. To overcome this obstacle, we have developed a threading-based method for the prediction of DNA-binding domains and associated DNA-binding protein residues from protein sequence. The method has higher accuracy in large scale benchmarking than methods based on sequence similarity alone. Application to the human proteome identified potential targets of not only previously unknown DNA-binding proteins, but also of biologically interesting ones that are related to, yet evolved from, DNA-binding proteins.
Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. However, many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. This approach utilizes an energy function that is uniquely trained on each structure. Our approach leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request.
The latency-associated nuclear antigen (LANA) of Kaposi's sarcoma-associated herpesvirus functions as an origin-binding protein (OBP) and transcriptional regulator. LANA binds the terminal repeats via the C-terminal DNA-binding domain (DBD) to support latent DNA replication. To date, the structure of LANA has not been solved. Sequence alignments among OBPs of gammaherpesviruses have revealed that the C terminus of LANA is structurally related to EBNA1, the OBP of Epstein–Barr virus. Based on secondary structure predictions for LANADBD and published structures of EBNA1DBD, this study used bioinformatics tools to model a putative structure for LANADBD bound to DNA. To validate the predicted model, 38 mutants targeting the most conserved motifs, namely three α-helices and a conserved proline loop, were constructed and functionally tested. In agreement with data for EBNA1, residues in helices 1 and 2 mainly contributed to sequence-specific DNA binding and replication activity, whilst mutations in helix 3 affected replication activity and multimer formation. Additionally, several mutants were isolated with discordant phenotypes, which may aid further studies into LANA function. In summary, these data suggest that the secondary and tertiary structures of LANA and EBNA1 DBDs are conserved and are critical for (i) sequence-specific DNA binding, (ii) multimer formation, (iii) LANA-dependent transcriptional repression, and (iv) DNA replication.
Biological function of proteins is frequently associated with the formation of complexes with small-molecule ligands. Experimental structure determination of such complexes at atomic resolution, however, can be time-consuming and costly. Computational methods for structure prediction of protein/ligand complexes, particularly docking, are as yet restricted by their limited consideration of receptor flexibility, rendering them not applicable for predicting protein/ligand complexes if large conformational changes of the receptor upon ligand binding are involved. Accurate receptor models in the ligand-bound state (holo structures), however, are a prerequisite for successful structure-based drug design. Hence, if only an unbound (apo) structure is available distinct from the ligand-bound conformation, structure-based drug design is severely limited. We present a method to predict the structure of protein/ligand complexes based solely on the apo structure, the ligand and the radius of gyration of the holo structure. The method is applied to ten cases in which proteins undergo structural rearrangements of up to 7.1 Å backbone RMSD upon ligand binding. In all cases, receptor models within 1.6 Å backbone RMSD to the target were predicted and close-to-native ligand binding poses were obtained for 8 of 10 cases in the top-ranked complex models. A protocol is presented that is expected to enable structure modeling of protein/ligand complexes and structure-based drug design for cases where crystal structures of ligand-bound conformations are not available.
Structure-based drug design has become a powerful tool in modern drug discovery pipelines. A critical prerequisite is a structure of the target protein close to its ligand bound conformation which is often difficult to determine experimentally. In many cases, a structure of the unbound receptor is available, but conformational changes with respect to the ligand-bound form preclude it from being used as a basis for structure-based drug design. We have developed a computational approach to predict the structure of protein/ligand complexes based solely on the unbound conformation, the ligand, and easy-to-assess experimental data. We tested our protocol on proteins that undergo substantial structural rearrangements upon binding a ligand and were able to predict structures of protein/ligand complexes which are in good agreement with experimentally determined structures. The ability to predict ligand bound receptor conformations based on structures in the unbound state enables structure-based drug design for cases where crystallization of the complex has not been successful so far.
In this study, we present the DNA-Binding Site Identifier (DBSI), a new structure-based method for predicting protein interaction sites for DNA binding. DBSI was trained and validated on a data set of 263 proteins (TRAIN-263), tested on an independent set of protein-DNA complexes (TEST-206) and data sets of 29 unbound (APO-29) and 30 bound (HOLO-30) protein structures distinct from the training data. We computed 480 candidate features for identifying protein residues that bind DNA, including new features that capture the electrostatic microenvironment within shells near the protein surface. Our iterative feature selection process identified features important in other models, as well as features unique to the DBSI model, such as a banded electrostatic feature with spatial separation comparable with the canonical width of the DNA minor groove. Validations and comparisons with established methods using a range of performance metrics clearly demonstrate the predictive advantage of DBSI, and its comparable performance on unbound (APO-29) and bound (HOLO-30) conformations demonstrates robustness to binding-induced protein conformational changes. Finally, we offer our feature data table to others for integration into their own models or for testing improved feature selection and model training strategies based on DBSI.
One obstacle to achieving complete understanding of the principles underlying sequence-dependent recognition of DNA is the paucity of structural data for DNA recognition sequences in their free (unbound) state. Here we carried out crystallization screening of 50 DNA duplexes containing cognate protein binding sites and obtained new crystal structures of free DNA binding sites for three distinct modes of DNA recognition: anti-parallel beta strands (MetR), helix-turn-helix motif + hinge helices (PurR), and zinc fingers (Zif268). Structural changes between free and protein-bound DNA are manifested differently in each case. The new DNA structures reveal that distinctive sequence-dependent DNA geometry dominates recognition by MetR, protein-induced bending of DNA dictates recognition by PurR, and deformability of DNA along the A-B continuum is important in recognition by Zif268. Together, our findings show that crystal structures of free DNA binding sites provide new information about the nature of protein-DNA interactions and thus lend insights towards a structural code for DNA recognition.
DNA structure; transcription factors; indirect readout; protein-DNA interactions; gene regulation
Summary: In the post-genomic era, the annotation of protein function facilitates the understanding of various biological processes. To extend the range of function annotation methods to the twilight zone of sequence identity, we have developed approaches that exploit both protein tertiary structure and/or protein sequence evolutionary relationships. To serve the scientific community, we have integrated the structure prediction tools, TASSER, TASSER-Lite and METATASSER, and the functional inference tools, FINDSITE, a structure-based algorithm for binding site prediction, Gene Ontology molecular function inference and ligand screening, EFICAz2, a sequence-based approach to enzyme function inference and DBD-hunter, an algorithm for predicting DNA-binding proteins and associated DNA-binding residues, into a unified web resource, Protein Structure and Function prediction Resource (PSiFR).
Availability and implementation: PSiFR is freely available for use on the web at http://psifr.cssb.biology.gatech.edu/
The nonlocal nature of the protein-ligand binding problem is investigated via the Gaussian Network Model with which the residues lying along interaction pathways in a protein and the residues at the binding site are predicted. The predictions of the binding site residues are verified by using several benchmark systems where the topology of the unbound protein and the bound protein-ligand complex are known. Predictions are made on the unbound protein. Agreement of results with the bound complexes indicates that the information for binding resides in the unbound protein. Cliques that consist of three or more residues that are far apart along the primary structure but are in contact in the folded structure are shown to be important determinants of the binding problem. Comparison with known structures shows that the predictive capability of the method is significant.
The LysR-type transcriptional regulators (LTTRs) comprise the largest family of prokaryotic transcription factors. These proteins are composed of an N-terminal DNA binding domain (DBD) and a C-terminal cofactor binding domain. To date, no structure of the DBD has been solved. According to the SUPERFAMILY and MODBASE databases, a reliable homology model of LTTR DBDs may be built using the structure of the Escherichia coli ModE transcription factor, containing a winged helix– turn–helix (HTH) motif, as a template. The remote, but statistically significant, sequence similarity between ModE and LTTR DBDs and an alignment generated using SUPERFAMILY and MODBASE methods was independently confirmed by alignment of sequence profiles representing ModE and LTTR family DBDs. Using the crystal structure of the E.coli OxyR C-terminal domain and the DBD alignments we constructed a structural model of the full-length dimer of this LTTR family member and used it to investigate the mode of protein–DNA interaction. We also applied the model to interpret, in a structural context, the results of numerous biochemical studies of mutated LTTRs. A comparison of the LTTR DBD model with the structures of other HTH proteins also provides insights into the interaction of LTTRs with the C-terminal domain of the RNA polymerase α subunit.
Most of the proteins in a cell assemble into complexes to carry out their function. In this work, we have created a new database (named ComSin) of protein structures in bound (complex) and unbound (single) states to provide a researcher with exhaustive information on structures of the same or homologous proteins in bound and unbound states. From the complete Protein Data Bank (PDB), we selected 24 910 pairs of protein structures in bound and unbound states, and identified regions of intrinsic disorder. For 2448 pairs, the proteins in bound and unbound states are identical, while 7129 pairs have sequence identity 90% or larger. The developed server enables one to search for proteins in bound and unbound states with several options including sequence similarity between the corresponding proteins in bound and unbound states, and validation of interaction interfaces of protein complexes. Besides that, through our web server, one can obtain necessary information for studying disorder-to-order and order-to-disorder transitions upon complex formation, and analyze structural differences between proteins in bound and unbound states. The database is available at http://antares.protres.ru/comsin/.