|Home | About | Journals | Submit | Contact Us | Français|
Robust methods to detect DNA-binding proteins from structures of unknown function are important for structural biology. This paper describes a method for identifying such proteins that (i) have a solvent accessible structural motif necessary for DNA-binding and (ii) a positive electrostatic potential in the region of the binding region. We focus on three structural motifs: helix–turn-helix (HTH), helix–hairpin–helix (HhH) and helix–loop–helix (HLH). We find that the combination of these variables detect 78% of proteins with an HTH motif, which is a substantial improvement over previous work based purely on structural templates and is comparable to more complex methods of identifying DNA-binding proteins. Similar true positive fractions are achieved for the HhH and HLH motifs. We see evidence of wide evolutionary diversity for DNA-binding proteins with an HTH motif, and much smaller diversity for those with an HhH or HLH motif.
One of the challenges of structural genomics is to elucidate the function of proteins of known sequence and unknown function. In this paper, we shall focus on the methods for identifying the fraction of proteins that bind to DNA. This is a non-trivial task as it has been estimated that 6–7% of all eukaryotic proteins bind DNA (1). Although there are a number of possible parameters that can be used to identify a DNA-binding protein, in this paper, we combine searches for a set of structural motifs and a positive electrostatic potential on the surface of a putative DNA-binding protein. This approach is only relevant if a three-dimensional (3D) structure is available.
It has been observed that many known DNA-binding proteins have one of a small number of distinct structural motifs that play a key role in binding DNA (2). We focus on three motifs: the helix–turn–helix (HTH) motif, the helix–hairpin–helix (HhH) motif and the helix–loop–helix (HLH) motif. The HTH motif has previously been considered in a preliminary analysis by Jones et al. (3) with some success, and these methods are extended here.
As suggested by their names, all three motifs start and terminate with helices (denoted as H1 and H2), connected by a short linking region of varying geometry (which does not form a helix or part of a sheet). Examples of each motif used to derive structural templates are shown in Figure Figure11.
DNA-binding proteins with an HhH structural motif are involved in non-sequence-specific DNA binding that occurs via the formation of hydrogen bonds between protein backbone nitrogens and DNA phosphate groups. These HhH motifs are observed in DNA repair enzymes and in DNA polymerases. Structurally, the motif forms a pair of anti-parallel α-helices connected by a hairpin-like loop. This loop is involved in interactions with the DNA (8–10) and usually contains a consensus GXG sequence pattern, where X is a hydrophobic residue. The two α-helices are packed at an acute angle of ~25–50° that dictates the characteristic pattern of hydrophobicity in the sequences (11).
DNA-binding proteins with the HLH structural motif are transcriptional regulatory proteins and are principally related to a wide array of developmental processes. In 1997, Atchley and Fitch (12) identified 242 HLH DNA-binding proteins in organisms ranging from Saccharomyces cerevisiae to Homo sapiens. These proteins have in common a highly conserved region that allows them to bind to DNA and to interact with each other (13). The motif is longer, in terms of residues, than the other two motifs. Many of these proteins interact to form homo- and hetero-dimers. The structural motif is composed of two long helix regions, with the N-terminal helix binding to the DNA, while the loop region allows the protein to dimerize.
Given the negative electrostatic potential that envelopes the DNA, it has been noted that a DNA-binding protein will have a complementary positive electrostatic potential in its binding region. This was used initially to identify the DNA-binding nature of the Tubby protein (14). The calculation of an electrostatic potential and its use in the prediction of DNA-binding sites has previously been presented (15). Each accessible atom is assigned a score, which is proportional to the surface integral of the potential over a region projected from the accessible surface, which is 7 Å from each atom.
Given that the geometry and electrostatic potential are essentially independent variables, it is plausible that a combination of the two should provide an improved method for identifying DNA-binding proteins, which is the focus for the current work. Initially, a set of structural templates is constructed for each of the two new motifs, based on the methods of Jones et al. (3). These structural templates are employed as the ‘first pass’ to scan all structures in the Protein Data Bank (PDB) (16) to identify the DNA-binding proteins, by calculating an optimal superposition of a template on a complete structure. This gave an initial set of hits, including correct matches and false positive and also false negative proteins. An accessible surface area (ASA) threshold and an electrostatic motif score (EMS) threshold is then employed on this initial set of true and false positives to improve the accuracy of the predictions. The success of these structural templates is then compared to the sequence homology methods. It has been shown that the HTH templates were generic (identifying DNA-binding motifs across different homologous families). The generic nature of the HhH motif is investigated. Finally, the current template method is compared with another more complex approach for detecting DNA-binding proteins.
In this paper, the term ‘family’ is employed to describe the clusters of protein chains that exhibit evidence, from a particular observable, of a common evolutionary ancestor. The term ‘set’ is used more generally for clusters of protein chains that exhibit a similarity under some measure, which may, or may not, be evidence for a common evolutionary ancestor. In particular, in this paper, we assume that the similarity to a particular structural template does not necessarily imply a common evolutionary ancestor. On the other hand, all other criteria (sequence, global structure comparison) used in this paper to cluster proteins is assumed to imply a common evolutionary ancestor. Typically, families defined using one method may be subfamilies of a larger family defined using another method. In addition, sets defined using similarity to a particular structural template will have the largest number of elements and all the other families will lie within them.
Furthermore, individual protein chains are often used to represent families or sets of structures. In order to avoid confusion, each family and set definition is described and if necessary, a label to a representative structure derived from a family using this definition is assigned. We can then employ them consistently through the rest of the paper.
In the first instance, an ‘S sequence family’ is defined as a set of proteins that have a domain with a sequence identity that is >35%. In the CATH hierarchy of protein structures (17), this corresponds to the ‘S-level’ (the fifth integer in the CATH number). The families constructed using this definition will have the smallest number of members. A representative sequence of such a family is referred to as an SREP, while a representative of a structure from this family would be a T_SREP.
A ‘D-HMM sequence family’ is defined as a set of protein sequences whose E-values from a specific hidden Markov sequence model (HMM) (18), defined from Pfam (19) or SMART (20), are <10−2. The ‘D’ indicates a defined HMM from Pfam or SMART, which are somewhat conservative in their range in comparison to other possible HMMs. S sequence families form subfamilies of D-HMM sequence families, as shown schematically in Figure Figure22a.
An ‘H superfamily’ is defined from structural data as a set of protein chains with a structural domain which are in the same non-homologous structural family, defined from the CATH database. This corresponds to the ‘H-level’ in the CATH hierarchy (the fourth integer in the CATH number). Typically, as shown in Figure Figure2b,2b, D-HMM sequence families are subsets of H superfamilies, though this is not necessarily true (in particular, in the case of the HhH motif). Nonetheless, all S families form subsets of H superfamilies. A representative from this family used to form a structural template is referred to as T_HREP.
Finally, a ‘structural template set’ is a set of protein chains that have a sequentially continuous structural fragment that is similar to a particular structural template. As shown in Figure Figure2c,2c, such sets can intersect with each other. Furthermore, all the previously mentioned families are subsets of these structural template sets.
The structural templates were derived as described previously (3). From a set of structures derived from the literature, HMMs from Pfam (19) and SMART (20) were obtained. These were then used to identify the equivalent D-HMM sequence families. If additional proteins, which have structures in the PDB, were identified and validated as true DNA-binding proteins with these motifs from the literature, they were added to the relevant motif set and the process repeated until no new structures were added.
Using the CATH database (17), the set of proteins for each motif were clustered into H superfamilies. As discussed previously, representative structures from each H superfamily were selected and denoted as T_HREPs. For each T_HREP, a 3D motif template was derived. The templates are a set of Cα positions for protein structure fragments (taken from the co-ordinates of a PDB file). The templates are sequentially continuous in terms of residue number and comprise all the residues from the start of H1 to the end of H2. The start and end points of each motif are identified from the literature and by visualizing the proteins using Rasmol (21). These templates were scanned against whole protein structures using the algorithm scan-rmsd, based on the Kabsch method (3).
By creating a histogram of the root-mean-square distance (rmsd) of the optimal superposition of template on complete protein over the set of DNA-binding proteins with the relevant motif (TRUE) and all other entries in the PDB (FALSE), we obtain a cut-off for the rmsd to discriminate between the sets. The cut-off for the rmsd can be determined by using the value where Matthew's Φ coefficient takes its maximum value (22).
A summary of the number of S sequence families, D-HMM sequence families, H superfamilies and structural template sets are listed in Table Table11.
Starting with a set of 120 HTH proteins from the literature, this procedure resulted in 86 non-identical HTH proteins clustered into seven H-super structural families.
The starting point for this motif was a list of 146 proteins from the PDB known to contain at least one HhH motif, which had been identified from the literature (9,23–28). The above procedure resulted in 23 non-identical HhH proteins that were initially clustered into six H-super structural families.
The starting point for this motif was a list of 9 proteins from the PDB known to contain at least one HLH motif, which had been identified from the literature (29–31). The above methods resulted in 15 non-identical HLH protein chains that clustered into a single H-super structural family.
The length of the HLH motif is variable (lengths vary from 43 to 85 residues), as can be seen in Figure Figure3.3. As these proteins cluster into a single H-super structural family, the resulting T_HREP template must be as short as the shortest motif. As the structure with the best resolution (PDB code 1hlo, chain B) (32) is not the shortest motif (PDB code 1an4, chain A) (36), then a choice must be made in truncating this motif. The length of the helix regions, starting from the loops, was the same as the length of the helices for the shortest motif. As can be seen in Figure Figure3,3, in the case of the N-terminal edge of the loop, 23 residues of the helix H1 were included and from the C-terminal edge of the loop, 10 residues of the helix H2 were included. This template shall be referred to as the reduced template.
To help in the discrimination of motifs that bind DNA and those that do not, the ASAs of the matched motifs were calculated using the program NACCESS (37). From the analysis of HTH motifs, it was known that DNA-binding motifs had to have a minimal accessibility in order for them to interact with the DNA (3).
The automatic identification of DNA-binding proteins using a positive electrostatic potential on the surface of the binding region has been employed previously (38,15). However, none of the methods combined a measure of electrostatic potential with a structural template. The electrostatic potential is computed for those proteins satisfying the criteria of a sufficiently small rmsd from one of the structural templates and a sufficiently large accessibility. As outlined in Jones et al. (15) the electrostatic score ΔQi, is defined from the potential for each surface accessible atom (labelled i) of the protein. The EMS is defined as
where M is the set of surface accessible atoms that have been identified as being part of the motif and NM are the number of atoms in M.
Matthew's Φ coefficient was used to find the best EMS threshold for each relevant motif.
A structural template library of HTH motifs from seven T_HREPs has previously been defined in (3) (forming seven structural template families). These seven structural templates (each extended by two residues at the start and the end of the motif) were used to scan a non-redundant data set of proteins in the PDB and a set of 86 non-redundant HTH structures known to bind DNA. From the resulting rmsd distribution, a threshold value of 1.6 Å was selected that resulted in 61 false positives. An ASA threshold was selected at 990 Å2, which reduced the false positive set of proteins to 38. Using these cut-offs, there were 10 false negatives. In this work, an analysis of the false positive structures resulted in the identification of three ‘new’ DNA-binding HTH motifs in DNA polymerase I structures (PDB code 1taq0) (39), methyltransferase (PDB code 1mgtA) (40) and histone acetyltransferase (PDB code 1fy7A) (41). Since this analysis, a further two false positive structures were identified as known HTH motifs, namely histone-like protein HU (PDB code 1b8z) (42) and sporulation response regulator Spo0A (PDB code 1fc3) (43). This gives a total of 91 non-identical proteins with a DNA-binding HTH motif. The application of an rmsd threshold of 1.6 Å and an ASA threshold of 990 Å2 identified then 81 non-identical proteins with a DNA-binding HTH motif (TRUE_HTH) and left a false positive set of 33 protein structures (FALSE_HTH).
The EMS was calculated for proteins in the TRUE_HTH and FALSE_HTH data sets and a histogram of these values are shown in Figure Figure4.4. From this figure, it can be seen that the true and false sets can be resolved reasonably well. If the threshold is taken to be 0.05, the number of false negatives increases to 20 and the number of false positives decreases to 7. The true positive fraction is 78%.
A structural library of HhH motifs from six T_HREPs were identified from the PDB. The six structural templates were used to scan 23 non-identical HhH proteins (TRUE_HhH) and a non-redundant data set of the remaining proteins in the PDB (FALSE_HhH). From this initial scan, the results of the templates scanned against the TRUE_HhH set revealed that one template (PDB code 1ci4, chain A, residues 20-36) (27) had an rmsd that is <1.4 Å with all other known HhH proteins (primarily because it is the shortest template, being 17 residues long). As a result, the DNA-binding proteins with an HhH motif form a single structural template family, and this template was used as the single representative template to scan all other proteins for HhH motifs.
The single template was used to scan the TRUE_HhH (1ci4 was eliminated from the set for consistency) and FALSE_HhH sets. A histogram for the rmsd values calculated using the template for the non-identical chains of these data sets are shown in Figure Figure5.5. The optimum threshold would be 1.2 Å, however, this would require the introduction of an additional structural template (i.e. two structural template families would be required to cover the TRUE_HhH set). Given the small size of TRUE_HhH set, it is pointless to arbitrarily increase the number of possible templates. A cut-off of 1.4 Å was used instead. Hence, there are no false negatives but there are 29 false positives.
The ASA was computed for the non-identical HhH chains using an rmsd threshold of 1.4 Å. The range for the remaining true positives was from 404 to 1390 Å2 (with a mean of 875 Å2) and from 541 to 1374 Å2 (with a mean of 926 Å2) for the false positives. The distributions showed that these data sets could not be distinguished in any meaningful way and the use of an ASA threshold for this motif was excluded.
The EMS was calculated for those proteins in the TRUE_HhH and FALSE_HhH data sets that satisfied the rmsd threshold. A histogram of these values is shown in Figure Figure6.6. By employing a threshold of −0.2, 15 of the false positives can be eliminated. Two false negatives are also introduced. Hence, the total number of true positives is 19 (86% of the total number of non-identical chains) and there are 14 false positives.
The limited number of HLH proteins in the PDB meant that there was a single T_HREP identified for this motif (PDB code 1hlo, chain B, residues 17–59) (32). A reduced structural template was constructed for this single representative as described previously and was used to scan the remaining 14 HLH non-identical protein chains (TRUE_HLH) and 11121 non-identical remaining proteins in the PDB (FALSE_HLH), excluding the known DNA-binding HLH proteins. The histogram for the non-identical protein chains of both sets for the scan is shown in Figure Figure7.7. For a threshold rmsd of 3.0 Å, there are no false positives but there are 2 false negatives. The very large rmsd for the 2 false negatives is due to a high variability in the loop region, and increasing the cut-off would introduce an unacceptable number of false positives. As a result, 12 of the 14 HLH DNA-binding proteins are identified with a true positive fraction of 86%.
Computing the ASA and EMS on such a reduced template do not have a physical meaning as long as motifs have residues which contact the DNA that would not be included in such a calculation. As a result, the ASA and EMS were not computed.
It is important to compare these structurally based methods with the sequence-based approach using HMMs. We previously found that the HTH structural templates were generic across homologous families when compared to the sequence-based HMMs that in general only identified members of their own sequence families. This comparative analysis between structure and sequence-based methods is conducted here for the HhH motif.
Our analysis, combined with CATH, suggest that all the HhH motifs occur in a single structural family. In Pfam and SMART, there are four HMMs and hence four D-HMM sequence families. In CATH, there are 10 S sequence families (comprising of clusters with <35% sequence identity between any pair of sequences in different clusters). Each of these 10 S sequence families is represented by an SREP sequence.
The E-value for each SREP sequence of all the S sequence families with an HhH motif was computed, using SAM-T99, for all the HMMs used to identify the HhH structural templates. A successful HMM hit was taken when an HMM for a particular SREP gave an E-value < 0.01. Pairs of SREP sequences were identified when the same HMM hit (a HMM cross hit) both of them, as shown in Figure Figure8a.8a. Likewise, the rmsd for all of the T_SREPS was computed using the structural templates derived from each T_SREP. Pairs of T_SREPs were identified (an rmsd cross hit) when there was a successful rmsd hit of one of the structural templates on the other (a cut-off rmsd of 1.4 Å was employed), as shown in Figure Figure8b.8b. The final numbers of such cross hits are summarized in Table Table11.
As can be seen, almost all the templates can be identified with one HMM, namely the ‘HHH’ HMM from Pfam. However, one protein chain, PDB code 1ci4 chain A, is not identified by any of the other HMMs not in its own S sequence family (the protein chain 1jx4, chain A is also quite marginal, as the E-value is in fact slightly >0.01, but we assume it is a cross hit nonetheless). On the other hand, in Figure Figure8b,8b, it is observed that all the protein chains can be identified by the structural template approach, including the above structures. Two of the structural templates can identify all the other structural templates.
In the current work, it has been demonstrated that how a number of structural features can be employed to determine whether a protein of known structure, and unknown function, is a DNA-binding protein. These structural features are similar to a small number of DNA-binding motifs (HTH, HhH or HLH), the solvent accessibility of the motif and the electrostatic potential in the region of the motif. The relative importance of the similarity, the accessibility and the electrostatic potential vary depending on the motif. It is also important to note that the level of sequence similarity varies enormously between the different types of motif and the optimal type of search (structural or sequence) employed to find such proteins might also vary.
A concern of using structural templates is that it has become clear that many DNA-binding proteins exhibit intrinsically disordered regions, which only became ordered upon binding to DNA (44). One well-known example of this is the leucine zipper protein GCN4 (45). In the case of the three motifs used here, there exist examples of the motifs in complexed and uncomplexed form, and both have been used here, indicating that this is unlikely to occur for these motifs. Furthermore, Stawiski et al. (38) have also demonstrated that a structural approach can distinguish complexed and uncomplexed DNA-binding proteins.
The final results are summarized in Table Table2.2. In the case of the HTH motif, with 133 examples in the PDB (equivalent to 91 non-identical proteins), the cut-offs for the superposition rmsd and ASA of the motif are complemented by the electrostatic potential, reducing the number of false positives from 33 to 7, and identifying 71 non-identical true proteins.
In the case of HhH motif, with 161 examples in the PDB (23 non-identical proteins), the combination of rmsd and the electrostatic potential resulted in 14 false positives and identified 21 of the non-identical true proteins. The ASA did not resolve the true and false data sets reliably, and were discarded. This is not surprising, given that only a small fraction of the motif makes contact with the DNA. The EMS removes approximately half of the false positives.
The analysis of the HLH motif was of limited value as all the known structures are part of the same D-HMM family. Nonetheless, the use of the rmsd from a single structural template, of reduced length gives a quite good resolution, eliminating all false positives and identifying 13 out of a possible 15 true non-identical DNA-binding proteins with an HLH motif.
The true positive rates we have obtained are slightly smaller than those obtained by Stawiski et al. (38), using a neural network based on 12 different parameters (including electrostatics, but not using structural templates), trained on a somewhat smaller data set. In particular, their true positive rate (sensitivity) for DNA-binding proteins with a HTH motif is ~0.81, compared to our true positive rate of 0.78. However, our results have been achieved using only 3 types of parameter as opposed to 12. Indeed, as we have scanned as many possible non-DNA-binding proteins as possible, the accuracy and specificity of our method for any of the motifs is ~1, compared to the total accuracy and specificity of 0.92 and 0.94 respectively using the neural network approach.
It is a concern that when a large number of parameters are used in a machine learning context on a comparatively small data set, the resulting discriminator will be over-constrained, even if cross-validation has been employed. We have demonstrated that in the case of the HTH motif, three carefully chosen parameters can give similar results as 12 parameters, and as a result the former approach is likely to be more robust than the latter. It also presents us with a clear physical picture of the nature of DNA–protein binding, namely an appropriate spatial configuration for the protein and a positive electrostatic potential in the binding region. This is much harder to elucidate from a neural network approach based on such a large number of parameters.
The structural approach also gives us an insight into the evolutionary diversity of these motifs. In the case of the HTH motif, there are a large number of sequence families defined using HMMs or a 35% sequence identity criterion. This may indicate examples of converging evolution. As a result, structural approaches (in conjunction with the electrostatic potential), such as the one outlined here, are the optimal method for detecting new DNA-binding proteins with such a motif.
On the other hand, despite the fact that there are more examples of DNA-binding proteins with an HhH motif in the PDB than those with a HTH motif, there are a considerably smaller number of sequence families. This set of proteins can be identified using a single structural template from an initial set of six H superfamilies. Furthermore, the ‘HHH’ HMM of Pfam can identify proteins from five of the H superfamilies. This is not due to any misclassification of the domains as this also occurs for version 2.5.1 of CATH and we see a similar diversity at the fold and superfamily level of the SCOP database (46). This implies a much smaller amount of evolutionary diversity. Finally, DNA-binding proteins with a HLH motif exhibit very little evolutionary diversity, as one HMM can identify all such proteins.
By approaching the detection of DNA-binding proteins in terms of different structural motifs, we can tease out the relative importance of the observables employed here, which may not be detected from studying all possible DNA-binding protein structures using one model. The above results suggest that future studies should integrate structural and sequence methods to identify future DNA-binding proteins. In the case of proteins with a HTH motif, the methods we have described above will be most useful. On the other hand, those with an HLH, and probably an HhH, motif will be best identified using HMMs.
M.G. was supported by Fundacion Universitaria San Pablo CEU (Spain) fellowship. S.J. was initially supported by a US department of energy grant (DE-FG02-96ER62166) and H.S. was supported by a UK MRC/PPARC training fellowship.