shows the workflow of the proposed method. Given an unbound query protein and a template complex, the proposed method generates synthetic protein-DNA complex structures for PWM prediction using structure alignment, where the query protein is superimposed onto the template structure (‘Superimposed Complex’ in ). This is achieved by applying the rotation matrix reported by the structure alignment program. PWM prediction is then performed on the superimposed protein-DNA complex based on an all-atom model, which is a knowledge-based potential function considering atomic contacts. See the ‘Methods
’ section for the details of a) constructing the superimposed complex based on the given query and template structures and b) the employed all-atom model.
The workflow of the proposed method.
To evaluate the performance of the proposed framework, we first considered the 20 annotated PWMs and the corresponding native protein-DNA complexes from the study of Morozov et al.
. The structure with discontinuous dsDNA (1IHF) was excluded as in the study of Xu et al.
. Since the proposed method requires an unbound structure of the query protein and a native complex from any of its homologues as the template, we must require each of the 19 potential test cases to further pass the following criterion: to have an unbound structure in PDB which yields at least one qualified alignment to a DNA-bound structure of another protein.
For each of the 19 proteins, we first checked if it has an unbound structure that can be used as a query in the proposed method. Only 12 of them have unbound structures in the 30 July 2011 release of PDB. Each unbound structure was then compared to the protein chains of all the protein-DNA complexes in PDB by using PSI-BLAST 
for measuring the sequence similarity and by TM-align 
for the structure similarity. If the significance of sequence similarity passes the condition of e-value<0.001 and the structure alignment score, TM-score 
, is greater than 0.5, the qualified complex was collected in the set of potential template complexes. Here, we required that a template structure must satisfy the following criteria: a) it is an X-ray structure with resolution better than 3.0 Å, b) the DNA molecule has ≥6 paired bases and has less than 30% non-paired bases, c) the protein chain has ≥5 contact residues (residues within 4.5 Å to the DNA molecule) and d) the protein chain has ≥40 residues. In this study, the query-template pair with the highest TM-score for each of the potential test cases was chosen for PWM prediction. In the end, six proteins were used as test cases and the other 13 proteins that do not satisfy the above criterion were used for tuning the parameters of the all-atom model ().
The validation set used in this study.
In addition to the test cases collected from the study of Morozov et al.
, this study attempted to enlarge the test data by collecting more annotated PWMs from the TRANSFAC database 
. The public version of TRANSFAC contains 398 annotated PWMs for 133 UniProt 
entry names. However, due to the limited overlap between the list of proteins with annotated PWMs and the list of proteins with both unbound and available templates, only one more test case (NFKB1_HUMAN) was added, as shown in .
Evaluating PWM prediction using unbound protein structures
The detailed predictions on the seven test proteins using their unbound structures are provided in (denoted as ‘Unbound’), in comparison with the annotated PWMs provided by 
(denoted as ‘Annotated’) and the predicted PWMs based on their native complexes (denoted as ‘Native’). The involved PDB entries are listed in . In this study, the Ψ
-score used in 
was employed to evaluate the performance of the proposed method. Ψ
-score is the average of the Kullback-Leibler divergences across all positions, and was adopted to evaluate the consistency between the predicted and annotated weight scores for all base types. The definition of the Ψ
-score is provided as follows:
are predicted and annotated weight scores, respectively, for base type i
at position j
, and L
is the length of the binding site in base pairs. A smaller number on the Ψ
-score implies a higher degree of consistency between two PWMs. To measure the significance of a Ψ
-score, 100,000 dummy PWMs with the same length as the predicted PWM were randomly generated to estimate the null distribution of Ψ
-scores to the annotated PWM and the p-value of the Ψ
-score of the predicted PWM.
Predictions by the proposed method on the seven test cases.
The PDB entries used in this study.
The proposed framework achieved 0.38 Ψ
-score in average, which was worse than that (0.26 Ψ
-score) based on the native complexes. Even though the average Ψ
-score of using unbound structures is worse than that of using native complexes, the difference is not significant (the p-value of paired Wilcoxon signed-rank test 
is 0.078). We also compared the proposed method with a naïve approach that predicts PWMs directly based on the homologues' native complexes of the query structures using the all-atom model. Namely the naïve method uses the query unbound structure to search the homologous bound structures but not replace the protein in the homologous structure with the query structure. This approach is denoted as ‘Naïve’ in , where the homologous bound structure of each case used for prediction was the corresponding template structure in . The average Ψ
-score of the naïve approach is 0.54, and the p-value of paired Wilcoxon signed-rank test between the proposed method and the naïve approach is 0.016.
It is observed in that the widths of the predicted PWMs are usually shorter than the annotated ones. This is because that the proposed method can only infer the target DNA sequences physically contactable by the query protein in the superimposed complexes. Protein-DNA interactions sometimes require multiple protein chains to participate in. Since the query unbound structure is simply one of them, the predicted PWM might be shorter than i) that based on native complexes which contain the complete set of protein chains and ii) the annotated PWMs derived from experiments or conserved promoter sequences.
We also compared the predictions on the six test cases from 
to those of applying different potential functions 
on native complexes (). shows that the predictions of using native complexes generally outperforms that of using synthetic complexes constructed based on the unbound structures and the selected templates. The results shown in and both reveal the potential of conducting PWM prediction for DNA-binding proteins based on unbound structures, though the accuracy degrades when synthetic complexes were used instead of native complexes. It is reasonably speculated that the performance difference was due to structural variations between the native complexes and the synthetic complexes generated by structure alignment followed by superposition. The next subsection lists three types of structural variations that presumably influence the prediction accuracy and provides further analyses to investigate these structural variations. The first considers the variation on the binding position or orientation caused by structure alignment. In other words, the complexes generated by structure alignment might have structural variations deviated from crystallized complexes. The second one is the structural variation due to sequence difference. That is, the binding position or orientation might have variations on two different protein sequences, even though their structures are similar. The third structural variation we considered is the conformational change of proteins from the unbound to bound form.
Predictions using unbound structures compared with those using native complexes.
Evaluating robustness of the proposed method against structural variations
For the first structural variation from the alignment, we want to know if the proposed method yields stable predictions when the protein structure in a native complex is replaced by a protein structure from another native complex of the same protein using structure alignment. Namely, the query protein, which is also a bound structure, is superimposed to another complex of the same protein. This design aims to eliminate the influence of the other two structural variations. For this purpose, we grouped protein-DNA complexes in PDB by the UniProt entry names of the protein chains. Protein chains in complexes with multiple protein chains were excluded. In the end, we have 38 PDB chains and 74 query-template pairs over eight entry names, where each entry name has 4–6 PDB chains. shows the results of the analysis regarding the first structural variation. All the values of Ψ-score are quite small. These results reveal an important observation that the proposed method is robust to the structural variations among native complexes of the same protein determined from different experiments as well as the variations due to structure alignment.
Performance on identical protein using different native complexes.
To investigate the second structural variation due to sequence difference, we prepared the second synthetic complex (U) where the template is a complex of the query protein itself—instead of a complex of a different protein—for each query in the validation set (). shows that using this set achieved an average Ψ-score of 0.40, which is close to that of using a different protein (0.38). The p-value of the paired Wilcoxon signed-rank test on the Ψ-scores of these two sets (μ and U) is 1. Namely, there was no apparent improvement observed when we eliminated this type of structural variation in the prediction framework. This suggests that the all-atom model with the proposed framework can tolerate the structural differences between different proteins that share similar structures.
The three synthetic complexes employed in the analysis of structural variations.
Predictions using different complexes.
To investigate the third structural variation of the conformational change between unbound and bound forms, we prepared the third synthetic complex (B) by replacing the query of the second synthetic complex with a bound structure for each query in the validation set (). Using this set achieved Ψ-score of 0.33 (). This performance was better than those using unbound queries and close to those using native complexes. The performance gap after eliminating this type of structural variation indicates that the structural variation of the conformational change is the most critical structural variation to the prediction accuracy. These results reveal that the proposed framework is more sensitive to the structural changes between unbound and bound conformations than those between two homologous structures. Hence, if we want to construct PWMs directly from an unbound structure with higher accuracy, the first priority of the next step is to overcome the unbound-to-bound conformational change.
In , we provided with more details about the structural changes upon DNA-binding of the seven test cases based on the same query (unbound) and template (bound) structures as the second synthetic complex (U
). Two special structural transitions, transitions of secondary structures (SSE) and disorder-to-order (D2O) transitions discussed in a recent study 
, were in particular examined here in addition to the root-mean-square deviations (RMSDs) between a pair of structures. In this table, we observed that structural variations are not necessarily accompanied with structural transitions. For example, the used structures for MYB_MOUSE have the largest RMSD (2.88) but have neither SSE nor D2O transitions. The structures used for NDT80_YEAST have 25 D2O transitions but a small RMSD (0.72).
Structural transitions upon DNA-binding.
Comparison with predictions based on complexes generated by docking
The above experiments were designed to evaluate the quality of the synthetic complexes under the proposed framework. This section, on the contrary, compares the prediction performance of using the synthetic complexes obtained by the proposed framework to that obtained by protein-DNA docking. Here we adopted the ZDOCK package (version 2.3.1) to perform protein-DNA docking. The protein structure was prepared using the query structures and the DNA was prepared using the bound DNA structures of the templates listed in . In the proposed framework, a template of protein-DNA complex is employed to facilitate the generation of synthetic complexes. In other words, the DNA-binding residues of the protein were learned from an existing protein-DNA complex. For a fair comparison, the same information was exploited here to rank models generated by ZDOCK. We assigned the highest score to the synthetic complex that reserves the largest set of the expected contact residues. Complexes reserving the same number of contact residues kept the same order suggested by ZDOCK. Based on the new scoring strategy, the top 20 complexes of the 2000 ZDOCK predictions (here 2000 was set according to the ZDOCK manuscript) were used to perform PWM prediction. Finally, the predicted PWM with the best Ψ-score to the annotated PWM was reported here. The process of using the Ψ-score to select PWM, note that it favors ZDOCK, was adopted because we observed that the highest scored complexes resulted in extremely bad PWMs, which were difficult to be aligned to the annotated ones in all tests (data not shown).
shows the comparison of using the proposed framework (denoted as ‘Alignment’ in ) and using the protein-DNA docking to generate the protein-DNA complex for PWM prediction. Using the docked complexes achieved an average Ψ-score of 0.40, worse than the proposed method. We observed that the PWMs generated by the proposed method and docking have their own advantages in different positions even though the same queries and templates were used. For example, for the center five positions (‘TGTGA’), which are more conserved than the flanking positions in the annotated PWM of CRP_ECOLI, the docking's PWM only missed the fourth position. On the other hand, our PWM correctly predicted the fourth position but missed the first two positions. On the test case NDT80_YEAST, the docking's PWM correctly predicted the six positions (2–3 and 5–8) on the left part while our PWM correctly predicted the six positions (6–10 and 12) on the right part of the annotated PWM. For TRPR_ECOLI, the docking's PWM has no overlap with our PWM, but both of them are generally correct since the interaction actually involve two identical protein chains. In summary, the docking's and our PWMs both made good predictions on some test cases though on different positions. Regarding the efficiency issue, ZDOCK takes more than an hour for the seven test cases, which is much longer than that (less than ten seconds) of the proposed method based on structure alignment.
Comparison with predictions of using docking to construct synthetic complexes.
The complementary phenomenon of the docking's and our predictions might be due to the structural variations—mainly from unbound to bound—discussed in the previous subsection. The query structures must undergo some conformational change so that they can fit the DNA molecules well. However, both the proposed framework and the adopted docking strategy regarded the query structures as rigid bodies. It might happen that one end of the binding site of the query structure perfectly fit the DNA but the other end was ‘seesawed’ out its best position.
Discussion and concluding remarks
It was discussed in the study of Dan et al.
that conformational changes were commonly observed in DNA-binding proteins. To understand how common the conformational changes are present in protein-DNA interactions and how large the changes are usually observed, we further collected available structure pairs of unbound and bound states for DNA-binding proteins from the PDB database. Since a protein may have multiple unbound-bound structure pairs, we adopted a strict criterion that a protein has transitions if at least one of the associated unbound-bound structure pair has transitions. The definition of transitions between a structure pair is identical to that of Dan et al.
's work (the DSSP program was used to assign secondary structures and only segments in which the same transition was consistent for at least five consecutive residues were considered). The results show 40.2% of the 132 proteins underwent SSE transitions (changes on secondary structure) and 53.8% underwent D2O (disorder-to-order) transitions. The high ratios concur with the points of Dan et al
On the other hand, it is observed that the RMSD values were not that large, i.e., all structure pairs were less than 4 Å (data not shown). If the criterion ‘RMSD≤2 Å’, a rigorous threshold in general, is considered to indicate small structural variation, 93.2% proteins have at least one structure pair with small structural variation. In , we found that the ratio of proteins underwent SSE (0.0%) and D2O (14.3%, one among the seven test cases) transitions were much lower than those of the overall distribution (40.2% SSE and 53.8% D2O transitions). The major difference between and the analysis in this section is that in the structure pair was selected by the structure alignment score. This suggests that in practice using the best structure alignment score helps to find structure pairs with few transitions for PWM prediction. If the structure pair with the best RMSD is chosen to investigate the conformational changes of a protein upon binding DNA, we found that ratios of proteins which underwent SSE and D2O transitions dropped to 13.8% and 39.4%, respectively. These results suggest that the proposed method will benefit the study of a large quantity of DNA-binding proteins with only unbound structures in the PDB database.
To shift the problem of structure-based PWM prediction from native complexes to unbound protein structures, the most challenging issue might be constructing a reliable synthetic protein-DNA complex on which physics- or knowledge-based scoring functions can be applied to perform prediction. Regarding this issue, this study concludes that structure alignment can serve as an option when complexes containing bound structures similar to the query protein exist. Though currently we used the template with the highest structure similarity to generate the synthetic complex, it is observed in many cases that templates with a low structure similarity also have the potential to produce satisfied results, as exemplified in .
An example where the template has a low structure similarity to the query.
Two concluding remarks are provided here. The DNA sequence in the selected template is probably not the native DNA sequence to which the query protein can bind. Thus the ability of the adopted potential function to handle the mutations of DNA sequences embedded in the synthetic complex is critical to the success of the proposed framework. Regarding this issue, we concluded that the selected atomic knowledge-based potential function is generally able to predict the most favorable base type without being affected by the original sequence present in the synthetic complex. Three examples are shown in to illustrate this observation. Another important issue related to the development of structure-based methods is their applicability. In the PDB release of July 30, 2011, there are 114 DNA-binding proteins that do not have native complexes but have unbound structures with potential templates from homologues available. The definition of a pair of unbound structure and the potential template is e-value<0.001 for the sequence alignment and TM-score >0.5 for the structure alignment. Currently the public version of TRANSFAC database contains 398 annotated PWMs for 133 proteins, most of which were determined via sequence-based methods. However, the overlap between the 114 DNA-binding proteins, which are the targets of this study, and the 133 proteins with known PWMs is only 16. This small overlap concurs with the fact that the currently curated PWMs were majorly contributed by sequence-based methods. This also reveals the distinctness and potential of structure-based methods, since up to now an abundance of structure information has not been widely exploited to enhance our understandings about the interactions between DNA-binding proteins and their binding sites.
Demonstration of base substitution.
Accurate construction of binding sequences for DNA-binding proteins is an important step for studying protein-DNA interactions. This study proposes a novel prediction framework and shows the possibility of predicting target DNA sequences of DNA-binding proteins directly from their unbound forms. Several factors that might affect the prediction power of the proposed method are examined and discussed. The experiments conducted in this study encourage more efforts on the structure alignment-based approaches as well as raise the challenges of PWM prediction using unbound protein structures for future studies.