In the current work, it has been demonstrated that how a number of structural features can be employed to determine whether a protein of known structure, and unknown function, is a DNA-binding protein. These structural features are similar to a small number of DNA-binding motifs (HTH, HhH or HLH), the solvent accessibility of the motif and the electrostatic potential in the region of the motif. The relative importance of the similarity, the accessibility and the electrostatic potential vary depending on the motif. It is also important to note that the level of sequence similarity varies enormously between the different types of motif and the optimal type of search (structural or sequence) employed to find such proteins might also vary.
A concern of using structural templates is that it has become clear that many DNA-binding proteins exhibit intrinsically disordered regions, which only became ordered upon binding to DNA (44
). One well-known example of this is the leucine zipper protein GCN4 (45
). In the case of the three motifs used here, there exist examples of the motifs in complexed and uncomplexed form, and both have been used here, indicating that this is unlikely to occur for these motifs. Furthermore, Stawiski et al
) have also demonstrated that a structural approach can distinguish complexed and uncomplexed DNA-binding proteins.
The final results are summarized in Table . In the case of the HTH motif, with 133 examples in the PDB (equivalent to 91 non-identical proteins), the cut-offs for the superposition rmsd and ASA of the motif are complemented by the electrostatic potential, reducing the number of false positives from 33 to 7, and identifying 71 non-identical true proteins.
Summary of the results obtained for each structural motif
In the case of HhH motif, with 161 examples in the PDB (23 non-identical proteins), the combination of rmsd and the electrostatic potential resulted in 14 false positives and identified 21 of the non-identical true proteins. The ASA did not resolve the true and false data sets reliably, and were discarded. This is not surprising, given that only a small fraction of the motif makes contact with the DNA. The EMS removes approximately half of the false positives.
The analysis of the HLH motif was of limited value as all the known structures are part of the same D-HMM family. Nonetheless, the use of the rmsd from a single structural template, of reduced length gives a quite good resolution, eliminating all false positives and identifying 13 out of a possible 15 true non-identical DNA-binding proteins with an HLH motif.
The true positive rates we have obtained are slightly smaller than those obtained by Stawiski et al
), using a neural network based on 12 different parameters (including electrostatics, but not using structural templates), trained on a somewhat smaller data set. In particular, their true positive rate (sensitivity) for DNA-binding proteins with a HTH motif is ~0.81, compared to our true positive rate of 0.78. However, our results have been achieved using only 3 types of parameter as opposed to 12. Indeed, as we have scanned as many possible non-DNA-binding proteins as possible, the accuracy and specificity of our method for any of the motifs is ~1, compared to the total accuracy and specificity of 0.92 and 0.94 respectively using the neural network approach.
It is a concern that when a large number of parameters are used in a machine learning context on a comparatively small data set, the resulting discriminator will be over-constrained, even if cross-validation has been employed. We have demonstrated that in the case of the HTH motif, three carefully chosen parameters can give similar results as 12 parameters, and as a result the former approach is likely to be more robust than the latter. It also presents us with a clear physical picture of the nature of DNA–protein binding, namely an appropriate spatial configuration for the protein and a positive electrostatic potential in the binding region. This is much harder to elucidate from a neural network approach based on such a large number of parameters.
The structural approach also gives us an insight into the evolutionary diversity of these motifs. In the case of the HTH motif, there are a large number of sequence families defined using HMMs or a 35% sequence identity criterion. This may indicate examples of converging evolution. As a result, structural approaches (in conjunction with the electrostatic potential), such as the one outlined here, are the optimal method for detecting new DNA-binding proteins with such a motif.
On the other hand, despite the fact that there are more examples of DNA-binding proteins with an HhH motif in the PDB than those with a HTH motif, there are a considerably smaller number of sequence families. This set of proteins can be identified using a single structural template from an initial set of six H superfamilies. Furthermore, the ‘HHH’ HMM of Pfam can identify proteins from five of the H superfamilies. This is not due to any misclassification of the domains as this also occurs for version 2.5.1 of CATH and we see a similar diversity at the fold and superfamily level of the SCOP database (46
). This implies a much smaller amount of evolutionary diversity. Finally, DNA-binding proteins with a HLH motif exhibit very little evolutionary diversity, as one HMM can identify all such proteins.
By approaching the detection of DNA-binding proteins in terms of different structural motifs, we can tease out the relative importance of the observables employed here, which may not be detected from studying all possible DNA-binding protein structures using one model. The above results suggest that future studies should integrate structural and sequence methods to identify future DNA-binding proteins. In the case of proteins with a HTH motif, the methods we have described above will be most useful. On the other hand, those with an HLH, and probably an HhH, motif will be best identified using HMMs.