|Home | About | Journals | Submit | Contact Us | Français|
Proteins with insignificant sequence and overall structure similarity may still share locally conserved contiguous structural segments; i.e. structural/3D motifs. Most methods for finding 3D motifs require a known motif to search for other similar structures or functionally/structurally crucial residues. Here, without requiring a query motif or essential residues, a fully automated method for discovering 3D motifs of various sizes across protein families with different folds based on a 16-letter structural alphabet is presented. It was applied to structurally non-redundant proteins bound to DNA, RNA, obligate/non-obligate proteins as well as free DNA-binding proteins (DBPs) and proteins with known structures but unknown function. Its usefulness was illustrated by analyzing the 3D motifs found in DBPs. A non-specific motif was found with a ‘corner’ architecture that confers a stable scaffold and enables diverse interactions, making it suitable for binding not only DNA but also RNA and proteins. Furthermore, DNA-specific motifs present ‘only’ in DBPs were discovered. The motifs found can provide useful guidelines in detecting binding sites and computational protein redesign.
Sequence motifs can help to quickly relate a novel protein sequence to a known protein family (1) and to identify its plausible function. They usually include conserved essential residues involved in catalysis, in ligand binding or in maintaining a specific conformation. Hence, they can be detected by searching homologous protein sequences for the occurrence of invariant or highly conserved residues. Proteins with a sequence motif comprising functionally/structurally crucial residues share not only sequence similarity, but also structural similarity (2,3). In such cases, a structural motif, comprising of conserved 3D conformations of amino acid residues that are crucial for the protein’s fold, stability and/or function, can be associated with the sequence motif.
For proteins that possess insignificant sequence and overall structural similarity, structural/3D motifs as opposed to sequence motifs may be present. 3D motifs have been used to suggest the function of proteins whose structures are known on the basis that similarity in the local structure implies similarity in biological function (4,5); e.g. the helix-turn-helix (HTH) motif has been used to predict proteins with DNA-binding function (6,7). In general, 3D motifs can be constructed manually or automatically using various methods. They have been constructed using conservation of sequence and structural features and compiled in the MegaMotifBase (8). Most methods; e.g. SPASM (9), HTHquery (10), PAR-3D (11), Superimpose (12) and RASMOT-3D PRO (13), require experimental data such as a known 3D motif to search for other similar structures or known active-site or binding-site residues (14).
Our key aim is to present a strategy for automatically discovering 3D motifs ‘across’ protein families based on a 16-letter structural alphabet (15) without requiring a known template motif or essential residues. A 3D motif is defined herein as a locally conserved contiguous structural segment recurring in ≥3 non-redundant proteins sharing <30% sequence identity (Figures 1 and and2).2). Another aim is to illustrate the usefulness of this strategy by applying it to discover 3D motifs in DNA-binding proteins (DBPs). DBPs were chosen for analysis because the HTH motif, which has been found in 16 DBP families, can be used for method validation and comparison with the motifs found. To evaluate the specificity of the 3D motifs found in DBPs, their occurrence frequencies in non-redundant non-DBPs were computed. Our strategy for finding 3D motifs can yield two types of functional motifs: (i) non-specific 3D motifs found in both DBPs and non-DBPs and (ii) DNA-specific motifs found ‘only’ in DBPs. The 3D motifs found can provide useful guidelines in detecting DNA-binding sites in proteins and redesigning DBPs to improve their DNA-binding affinity/specificity.
Four datasets were created by searching the Protein Data Bank (PDB) (16) and the PPI-Pred server (17) for ≤3-Å X-ray structures of proteins bound to dsDNA/RNA and obligate/non-obligate protein including antigens (18). These DNA/RNA/obligate protein/non-obligate protein-binding chains were then grouped according to their CATH (19) codes, and the complex structure with the best resolution in each group was chosen. This yielded 76 DNA-binding, 72 RNA-binding, 88 obligate protein-binding and 77 non-obligate protein-binding non-redundant proteins (20) (Supplementary Table S1).
For each complex structure, the HBPLUS (21) program was used to compute all possible protein−DNA/RNA/protein van der Waals (vdW) contacts and H bonds, which are defined, respectively, by a distance of 4.0 and 3.5 Å between a donor atom and an acceptor atom. An amino acid residue was assigned as binding if its atoms are in vdW contact or are H-bonded directly/indirectly via water molecules with DNA/RNA/protein atoms, and its solvent accessible surface area (SASA) in the free protein is non-zero.
Each protein structure was encoded into its 1D structural sequence according to the structural alphabet (15), which was derived as follows: the backbone of each protein from a non-redundant protein structure database was represented by consecutive 5-residue segments, each described by a vector of 8 backbone dihedral angles . The dissimilarity between two vectors V1 and V2 of dihedral angles was measured by the root-mean-square deviation (RMSD) of the dihedral angle values, defined as:
Using an unsupervised cluster analyzer based on the above RMSDa of the segments, 16 protein blocks/letters were identified and illustrated in previous work (15). These 16 letters comprise the structural alphabet.
The 3D protein structures were converted into strings of structural letters using the PDB reader program (15), as follows. For a given l-residue protein, l − 4 letter assignments were obtained by scanning the sequence using a 5-residue sliding window. The structure of each 5-residue segment was compared with that of each of the 16 letters and the letter that had the closest structure (as measured by the RMSDa) to the 5-residue segment was assigned to the middle residue of that segment (14). The terminal four residues of each protein, which cannot be treated as center residues of the 5-residue segments, were assigned the letter Z.
To discover locally conserved structural segments in the protein, the structural letter sequence of each protein was scanned using an l-letter sliding window yielding various l-mer structural patterns. A recurring l-mer structural pattern was defined when the latter was found in ≥3 non-redundant proteins (14); e.g. the cdehja pattern was found in three DBPs with different overall structures (Figure 1), and is thus a potential 3D motif.
To verify that the backbone structures pertaining to each recurring structural pattern are indeed conserved, they were compared using the MultiProt program (22). MultiProt derives multiple structural alignments from simultaneous superposition of the input protein structures and detects common geometric cores. Each input molecule was treated as a pivot in turn, and the RMSD of the geometric core Cα atoms of a non-pivot input molecule from those of the pivot molecule was computed. The resulting RMSD values were then averaged. Since a continuous l-mer structural pattern encompasses l + 4 residues (the l center residues and two residues each at the N- and C-terminal sides), the structures of the respective l + 4 residue segments of a given recurring structural pattern were compared using MultiProt. As MultiProt does not require all input molecules to participate in the alignment, a 3D motif was defined when the recurring l-mer structural pattern has a geometric core composed of ≥l residues common to ‘all’ the input structures; e.g. although the 6-mer structural patterns, cdehja and ddehjl, are each found in three DBPs, only cdehja was considered to be a 3D motif, as the backbone structures are truly conserved (Figure 2).
The occurrence frequencies of a given 3D motif in non-redundant proteins that bind DNA, RNA, obligate proteins and non-obligate proteins were computed. A DNA-binding motif is defined as an l-mer motif containing ≥l/6 DNA-binding residues whose frequency in DBPs relative to that in all non-redundant proteins is ≥1.5. If the DNA-binding motif is absent in non-redundant proteins that bind RNA, obligate proteins and non-obligate proteins, it was considered to be DNA-specific.
To illustrate our motif discovery strategy, each DNA-bound protein structure was encoded into its 1D structural sequence and scanned using an l-letter sliding window (l = 6, 7,…, 27). The l-mer structural patterns found in ≥3 non-redundant DBPs with conserved backbone segments (Figures 1 and and2),2), which in turn dictate the rotameric state of the respective side chains, are regarded as 3D motifs. Since each letter consists of five residues, a l-mer motif comprises l + 4 residues: the central residues denoting each letter are labeled P1, P2 , … , Pl, whereas the two residues N-terminal to P1 are labeled P−2 and P−1, while the two residues C-terminal to Pl are labeled P+1 and P+2. To discover 3D motifs that may be biologically important but may not be specific to DBPs, we analyzed the ‘most common’ 6-mer motif that persists with increasing motif size. This yielded a novel minimalist functional scaffold, as described below.
The most popular 6-mer 3D motif is afklmm, which is found 75 times in 40 of the 76 non-redundant DBPs (Table 1). It remains conserved as the motif size increases from 6 to 26; Supplementary Table S2 lists all l-mer motifs (l = 6, 7 , … , 26) containing the afklmm segment. This afklmm motif comprises part of a turn connected to a helix (Figure 3A, left panel) and appears at a corner, hence we will refer to it as the ‘corner’ motif. It can be characterized by the following three features (Figure 3B): (i) a helix starting at P4 characterized by two conserved H bonds between the P3 and P4 amide N atoms and the P+1 and P+2 carbonyl O atoms, respectively; (ii) a solvent exposed surface formed by the P2, P3 and P4 side chains; and (iii) H bonding or vdW interactions between the P1 side chain, which points toward the protein interior, and the P5 and/or P6 side chains, resulting in a mean P1 + P5 + P6 SASA (12 Å2) that is generally smaller than the mean P2 + P3 + P4 SASA (71 Å2). In cases where P1 lacks a side chain to interact with P5 or P6 (e.g. P1 is Gly in the Flp recombinase structure, 1flo-A), its role is played by its neighbor P2 whose side chain interacts with P5. In many cases, a third conserved H bond is formed between the P2 amide N and the P6 carbonyl O, in addition to the P+1→P3 and P+2→P4 backbone−backbone H bonds. Despite these conserved structural features, the afklmm motif exhibits little sequence conservation except that P−1 is often Gly, whereas P6 is generally an aliphatic residue, Val, Ile, Ala or Leu (Figure 3C).
To determine if the ‘corner’ motif is part of the HTH motif found in several DBPs, our motif discovery strategy was applied to the HTH group of 16 DBP families reported in previous work (23). These proteins were grouped according to their CATH codes, and the best resolution complex structure in each group was chosen, analogous to our dataset construction. The 11 non-redundant HTH DBPs are listed in Table 2 along with the HTH motif and the corresponding structural letter sequence for each protein. The afklmm motif is found to be part of the HTH motif identified in these proteins, except the heat shock transcription factor, 3hts. The largest common structure that is shared by 9 of the 11 HTH DBPs is mmmmnopafklmmmm or m(4)nopafklm(4) composed of 19 amino acid residues. It is also found in other regions of four proteins (1d3u-B:1116−1134, 1d3u-B:1212−1231, 1ddn-A:83−101, 1gdt-A:150−168, 1lmb-3:22−40 and 1lmb-3:62−80).
The α−α corner is structurally defined by two consecutive, crosswise-packed helices connected by ≥2 residues (31). To determine its relationship with the HTH or ‘corner’ motif, our motif discovery strategy was applied to the proteins reported to contain α–α corners by Efimov (31). Since the latter listed only the protein names and the α−α corner amino acid sequences, the PDB was searched for those proteins containing the reported α−α corner amino acid sequences and structure. This resulted in nine proteins (Supplementary Table S3). In two proteins (1run-A, 1lmb-3), the 4 α−α corner amino acid sequences overlap with the respective HTH amino acid sequences in Table 2. For the other proteins (1d1l-A, 1crn-A, 1rqu-A, 1grt-A, 2mb5-A, 1eca-A, 1ibe-A/B), the α−α corner structural sequences all encompass m(4)nopafklm(3), which is characteristic of the HTH motif.
To determine if the ‘corner’ motif plays a functional role, afklmm segments that contain ≥1 DNA-binding residues were deemed to be functional. The results indicate that the ‘corner’ motif is generally involved in binding dsDNA: 49 of the 75 afklmm segments found in 32 out of 40 non-redundant DBPs possess ≥1 DNA-binding residues. Another nine afklmm segments (highlighted in italics in Table 1) in nine proteins (1ais-B, 1d02-A, 1diz-A, 1dnk-A, 1e3o-C, 1ig9-A, 1mus-A, 1tau-A and 1x9n-A) might have ≥1 atoms close to a DNA atom if the protein had been complexed with a longer DNA, as illustrated in Figure 4A. Six of these nine proteins (except 1d02-A, 1mus-A and 1tau-A) already possess afklmm segments with ≥1 DNA-binding residues. The ‘corner’ motif does not appear to recognize specific DNA sequences, and binds in both the DNA major and minor grooves.
However, 17 afklmm segments appear to be located far from the DNA−protein interface even though the P2, P3 and P4 side chains are solvent exposed. One plausible reason is that the residues in the afklmm segment may be involved in binding protein or RNA rather than DNA. Indeed, 4 of the 17 afklmm segments in oligomeric DBPs (1d02-A, 1p71-A, 1u8r-A, 1xo0-A) contain ≥1 atoms within 5 Å of an atom in another protein chain (Table 1 and Figure 4B). This indicates that the ‘corner’ motif may play a role in protein dimerization or oligomerization. The remaining 13 afklmm segments not involved in binding dsDNA or oligomerization await more new structures to verify if they participate in binding.
In the afklmm motif, the side chain of the center residue of the letter a in afklmm (P1) always points toward the protein interior in order to interact with the P5 and P6 side chains, which are part of a helix. Since alternative interactions could stabilize this ‘corner’ architecture, those proteins lacking the afklmm motif may still employ similar ‘corner’ architecture to recognize DNA. For example, the dfklmm segment of POU domain class 2 transcription factor 1 (1e3o-C), consisting of amino acid 40−49, has a ‘corner’ architecture like the afklmm motif (see Figure 3A, right panel). Both P1 (Phe) and P2 (Ser) interact with P5 (Thr) through vdW contacts and H bonds, respectively. This alternate ‘corner’ architecture also binds DNA via the P1−P5 residues.
To discover novel motifs characteristic of DBPs, DNA-binding motifs, as defined in the ‘Materials and Methods’ section, were identified and listed in Supplementary Table S4. ‘Only’ in DBPs, 76 DNA-binding motifs were found. These fall into two groups: 70 DNA-specific motifs with l = 10−25 were present in ‘non-HTH’ DBPs (1orn-A, 1rrq-A, 2bcq-A), while six with l = 16−21 were found in HTH DBPs (1jt0-A, 1r8d-A and 1tro-A). The longest motif in each of these two groups of DBPs was chosen as the representative DNA-specific motif. The 29-residue cfbfklm(4)ghiafklm(8) motif found in the non-HTH DBPs is structurally defined by (i) a 4−5-residue helix from P5 and a second helix from P16 containing ≥11 residues; (ii) conserved P5↔P23 and P9↔P19 vdW contacts between the two helices; and (iii) conserved backbone−backbone P8→P11 and P11→P14 H bonds in the region connecting the two helices (Figure 5A). On the other hand, the 25-residue cfklm(4)nopafklm(6) motif found in the HTH DBPs is structurally defined by (i) a 8-residue helix from P3 and a second helix from P14 containing ≥7 residues, and (ii) conserved P3↔P17 and P7 ↔P14 vdW contacts between the two helices (Figure 5B). Notably, this motif encompasses the m(4)nopafklm(4) motif found in nine HTH protein families, validating it as a DNA-specific motif. Interestingly, although the afklmmm motif is not DNA-specific, it is common to both cfbfklm(4)ghiafklm(8) and cfklm(4)nopafklm(6) motifs.
Thirty-six non-redundant DBPs have 3D structures solved with and without DNA and the Cα RMSDs of the DNA-bound protein structures from the respective free structures range from 0.4 to 5.9 Å. To assess how protein flexibility and conformational changes upon DNA binding affect the non-specific ‘corner’ motif and the two representative DNA-specific motifs found in the DNA-bound protein structures, the respective free structures were encoded into 1D structural sequences. The ‘corner’ motif or its alternate form was found in both the DNA-bound and free protein structures (Supplementary Table S5). The DNA-specific cfklm(4)nopafklm(6) motif found in the DNA-bound protein structures (1jt0-A, 1r8d-A) is nearly conserved in the respective free HTH proteins: the corresponding structural sequence is cfklm(4)nopafklm(4)no in 1rkw-A, and Zfklm(4)nopafklm(6) in 1jbg-A. The DNA-specific cfbfklm(4)ghiafklm(8) motif was ‘not’ found in any of the DBPs with both bound and free structures. Thus, conformational changes upon binding DNA, unless huge, do not seem to affect the ‘corner’ or representative DNA-specific motifs.
We have presented a general and efficient strategy for discovering 3D motifs across protein families systematically on a large scale. The method requires as input the protein 3D structure, which is converted to a 1D structural letter sequence using a 16-letter structural alphabet. It yields as output a set of 3D motifs of various sizes shared by proteins with insignificant sequence or overall structural similarities. A non-specific motif (the ‘corner’ motif) was discovered in 40 non-redundant DBPs by analyzing the most common 6-mer motif that is conserved with increasing motif size. Furthermore, two representative DNA-specific motifs were found by choosing the largest DNA-binding motifs present ‘only’ in DBPs. One of these two DNA-specific motifs contain the HTH motif as a substructure, validating our strategy for discovering DNA-specific 3D motifs.
The new method of discovering 3D motifs across protein families with different folds complements previous motif discovery methods. A key advantage of our motif search strategy is that it does not require a query structure of a known motif for comparison against the PDB structures, or homologous sequences to identify conserved residues, or experimentally known functionally/structurally crucial residues. However, it is limited to detecting motifs composed of successive residues along the primary sequence. Thus, it complements previous methods (see ‘Introduction’ section), which require a known motif template or essential residues to create 3D templates, but yield 3D motifs composed of spatially interacting residues. A second advantage of our motif search strategy is that it can identify 3D motifs that are smaller than those defined by previous methods such as PROMOTIF (32), which detects motifs comprising of ~20−200 residues. For example, the HTH motif consists of ~20−30 residues and is found in 11 protein families (Table 2), whereas the ‘corner’ motif consists of only 10 residues and is found in 40 protein families (Table 1). A third advantage of our motif search strategy is that it provides a less ambiguous structural definition for 3D motifs by using two similarity measures, RMSDa [Equation (1)] and Cα RMSD (Figures 2, 3 and and5).5). For example, it provided a common structural m(4)nopafklm(4) sequence for the α–α corner connected by two residues and the HTH motif, except for the HTH motifs in type-2 restriction enzyme Fok I and the heat shock factor protein (Table 2).
The afklmm motif discovered herein (Figure 3A, left panel) has an architecture that confers a stable scaffold and enables diverse interactions, making it suitable for binding. Its ‘corner’ architecture enables the P1, P5 and P6 side chains to interact, in addition to the P+1→P3 and P+2→P4 backbone−backbone H bonds of the helix, thus stabilizing the scaffold. The ‘corner’ architecture also exposes the P2, P3 and P4 side chains, allowing for a wider variety of spatial arrangements than an architecture encompassing these side chains in a cavity or flat surface. This feature could help proteins employ the same architecture using different side chains to bind to different DNA targets; e.g. the Staphylococcus aureus multidrug-binding protein QacR (1jt0-A) employs 34SSKGN38 in the afklmm motif (amino acid 32−41) to bind 14Ade-Cyt-Cyt-Gua17 in the 1jt0-E DNA chain as well as 21Gua, 22Ade and 24Cyt in the 1jt0-F DNA chain, but the Catabolite gene activator (2cgp-A) employs a different set of residues 178CSRET182 with the same conserved backbone (amino acid 176−185) to recognize two different DNA triplets, 503Gua-Thy-Cyt505 and 539Thy-Gua-Thy541.
The ‘corner’ architecture provides the following potential applications. It can provide a useful scaffold for computational redesign of DBPs for improved DNA-binding affinity and altered binding specificity: residues at P2, P3 and P4 could be mutated without perturbing the scaffold (Figure 3C). Subsequently, the designed mutants can be computationally screened using free energy calculations (33,34) to predict if they exhibit enhanced DNA-binding affinity and altered binding specificity.
It can also be used in conjunction with DNA-binding residue prediction methods to suggest DNA-binding sites in proteins; e.g. the N-terminal fragment of topoisomerase I (1mw8-X) contains two afklmm segments, shown in magenta in Figure 6A, comprising residues 297−306 and 381−390, none of which contact the single-stranded DNA in the X-ray structure (35). To evaluate if these two ‘corner’ motifs can nevertheless bind DNA, DNA-binding residues were predicted using a method based on detecting a cluster of evolutionary conserved surface residues that are electrostatically stabilized upon mutation to negatively charged Asp/Glu, as described in previous work (36). This yielded two distinct DNA-binding sites (labeled S1 and S2) for topoisomerase I (Figure 6B): the DNA-binding residues in the S1 site (in red) are A282, I285, T288, L289, Q291, S294, T295, M305, D323, L393, Q397, A480, K484 and E487, while those in the S2 site (in yellow) are K302, M305, M306, R321, G492, R493, P494, S495, T496, A498, S499, I500, I501 and S502 (residues in bold comprise the afklmm segment). Notably, the S2 site contains R321, S495, T496 and S499, which are within H-bonding/vdW contact of the single-stranded DNA, and the R321 backbone N is only 3.9 Å away from the catalytic tyrosine Y319 Cδ2. Thus, the 297−306 ‘corner’ motif in conjunction with predicted DNA-binding residues suggest that the S2 site is likely to be the DNA-binding site in topoisomerase I.
The two DNA-specific motifs in Figure 5 may help to annotate proteins with known structures but unknown function. There are 2146 proteins in the Structural Genomics database with ‘unknown function’ in the title. For each of these proteins, the chain A structure was encoded into its 1D structural sequence and scanned using a 21- or 25-letter sliding window. None of the proteins with unknown function contain the 25-letter cfbfklm(4)ghiafklm(8) motif, but six contain the 21-letter cfklm(4)nopafklm(6) motif, out of which, only three possess the characteristic structural features depicted in Figure 5B (Table 3); viz., 2nx4 (amino acid 28−52), 2ia0 (amino acid 21−45) and 2g7u (amino acid 27−51). These three proteins are also predicted to be DBPs with HTH motifs according to HTHquery (10), suggesting that they are likely to bind DNA.
Supplementary Data are available at NAR Online
Funding for open access charge: National Science Council, Taiwan (grant NSC 95-2113-M-001-038-MY5).
Conflict of interest statement. None declared.
We thank Dr Hanna Yuan and Ms Lauren Wang for help in preparing the figures.