We considered three data sets from closely related proteins and three sets from heterogeneous proteins and will discuss results for the two classes in sequence. Among related proteins, we applied the PSIM computation to three different levels of protein structural diversity. The most closely related protein structure set was derived from BindMOAD, containing all protein-ligand complexes with EC number 126.96.36.199 (mitogen activated protein kinases). This comprised 45 structures of three different kinases (p38α, JNK3, and ERK2). We then considered a larger and more diverse set of human protein kinases based on the work of Kinnings et al., composed of 316 structures of 64 different protein kinases. Last, we considered a set of 267 protein structures, all bound to ATP, but spanning highly divergent protein families (e.g. ATP-binding domains of ABC transporters and DNA/RNA polymerases).
The set of 45 MAP kinase structures included 3 gene products: 29 MAPK14 (p38α), 8 MAPK1 (ERK2), and 8 MAPK10 (JNK3). We computed an all-by-all comparison of these structures using PSIM, and shows the resulting tree of similarity. The different kinase families were segregated well, with distinctions also made between different species and mutant proteins. The most similar pair (1WBT and 1WBS) contained wild type human p38α bound to nearly identical ligands, differing only by a carbon/nitrogen swap in a heterocycle.
Following the nodes away from the root of the tree, we continue to see a substructure of the first two ligands bound to 1WBV, with excursions from the 1WBT ligand envelope occurring as we move up the tree. The ligand of 1OVE makes two separate protrusions from the binding envelope of the 1WBT ligand. Pocket conformations of 1WBS, 1WBV, and 1OVE are shown relative to 1WBT at left in . The very different (and rigid) ligand of 1OVE binds to a different DFG configuration of the kinase altogether (the inactive conformational form). However, the structure is still correctly grouped with the p38α exemplars, and it is correctly aligned with respect to the common hinge-binding elements shared by all of the ligands.
To formalize the relationship between the taxonomy represented by the tree in and ligand structures, we computed similarities for different groups of p38 ligands based on the tree structure. The set of protein conformational variants rooted at 1WBT and including those proteins up to depth 4 (7 structures total) defined a group of ligands that were bound to highly similar pockets. The set of p38 variants at depth 10 or greater (12 structures, at the same level as or above 1OVE in ) defined a group of ligands bound to dissimilar pockets from those near 1WBT. Pairwise 3D ligand similarities among the pocket group near 1WBT were higher than those between that group and those distant from 1WBT (ROC area 0.77, p
0.01 by resampling). Considering the ranked list of ligand similarities, pairs at the top of the list were over 20-fold more likely to come from proteins belonging to the group near 1WBT than to the cross-pairs that included a ligand from near the root and one distant from it. This will be further quantified below in comparison to other methods.
The resulting alignments have two desirable properties. First, similarity between protein pocket variants of the same flexible enzyme is related to the similarity of ligands that bind the pocket variants. Second, alignments that optimize local surface similarity preserve the geometry of parts of the protein surface that remain congruent when other parts of the protein binding pockets change significantly. The second property stems from the definition of the similarity metric as one that rewards similarity as opposed to penalizing differences.
Kinase Ligand Binding Profiles: Kinnings Data Set
One of the main goals of the structural analysis of proteins is the ability to differentiate proteins based upon their ligand binding preferences. The 183 structure/26 protein human kinase data set presents a particularly appealing target for differentiating ligand binding because of the difficulty of designing inhibitors with high specificity and the value of designing inhibitors with such specificity for cancer research. The work done in Kinnings et al.10
utilized enrichment testing for judging the success of their similarity metric as a methodology for determining the binding preference of proteins.
Given a query ligand Y, the idea was to identify proteins that would also bind Y by comparing their structures to the structure of a protein bound to Y. Enrichment was calculated based on the number of structures in the top 5 percent of the ranked list whose corresponding protein was inhibited the target ligand with a Ki of less than 10 µM. The enrichment score was defined as ((Ah/Th)/(A/T)): Ah was the number of structures which bind the target ligand in the top 5%; Th, the total number of structures in the top 5%; A, the total number structures which bind the target ligand; and T, the total number of structures. P-values for enrichment scores were computed using a hyper geometric distribution. Computing enrichment in this fashion, PSIM yielded nearly identical results to those of Kinnings ().
Table 1 Comparison of PSIM to results from Kinnings et al. The table shows enrichment for structures whose corresponding enzyme was known to bind the cognate ligand of the query PDB structure. The results were nearly identical (minor differences in the numbers (more ...)
However, it is important to understand that the results for a number of cases are dominated by the presence of multiple alternative protein structures for the cognate protein of the query ligand. The test considered all protein structures as being separate individuals, so even those structures that represented alternate conformations of the same protein were considered as being distinct in the enrichment testing. So, given a structure Z of ligand Y bound to protein X, an alternate conformation of protein X corresponding to structure Z’ would positively influence the enrichment associated with ligand Y. Enrichment scores computed in this fashion can be somewhat misleading. For example the target structure for the ligand SB203580 is 1A9U, which is a crystal structure of p38α bound to SB203580. The top 5% of structures similar to 1A9U are also crystal structures of p38α, entirely dominating the enrichment computation (p38α has 34 representative structures in the data set).
This bias is not apparent in all of the ligands tested. For example, results for Tarceva showed a significant enrichment factor that was obtained by the high ranking of structures from several different proteins. The target structure for the enrichment of Tarceva was 1M17 (EGFR bound to Tarceva). The top 5% most similar structures contained structures of the proteins SRC and LCK. LCK has one of the most similar sequences to EGFR of the proteins in the set, but it was ranked below SRC, which has much less sequence similarity.
We made an alternative analysis that removed the bias of multiple structural exemplars by defining the similarity of two proteins as the maximum similarity obtained using the PSIM method computed over all pairs of structures of the two proteins. We employed Receiver Operator Characteristics (ROC) analysis to determine whether pairs of proteins which both bind the same ligand had significantly higher similarity than protein pairs where one binds a particular ligand and the other does not. Resampling was used to determine the significance of the ROC areas. In 6/10 cases, the ROC area exhibited significant separation (p < .05), showing higher similarity among protein pairs that shared ligands. The ROC plot for Tarceva is shown in (ROC area 0.77, p < .002). The alignment of EGFR with SRC gives a sense of the high local surface similarity that drives the high score (green arrows) along with significant differences that reflect the relatively low sequence similarity of the two proteins (red arrows).
The combination of sequence divergence coupled with surface epitope preservation is a common theme in biology. As sequence divergence increases, such 3D structural motifs are detectable with a similarity metric that focuses on surfaces and ignores sequence information.
Motif Discovery: ATP-Binding Data Set
The detection of motifs in proteins was originally and is still primarily done through the identification of sequence-based motifs or the computation of sequence similarity to known protein domains. These motifs are then used for the annotation and classification of unknown proteins. While these methods are extremely effective for long sequence motifs and proteins that share significant evolutionary history, they have limited ability to detect short discontinuous motifs and protein similarities based on convergent structures rather than amino acid homology from shared ancestry. Utilizing a three-dimensional structure-based method allows for the discovery of motifs that are discontinuous and may reduce false attribution when sequences are similar but a small change has significantly altered the structure.
ATP is an attractive ligand for considering distantly related proteins because of the long history of the molecule with respect to the evolution of life. Protein motifs that bind ATP have been projected to be among the very first to have evolved.13
This timeframe allowed for the development of many parallel ways of utilizing the molecule and also significant time for convergent pathways to produce congruent structures. Such common structural motifs can be discovered using a method analogous to the discovery of protein families with sequence similarity. Instead of grouping proteins together based upon common sequence and extracting a motif, proteins can be clustered based upon common binding site structure to reveal common structural motifs.
The 267 ATP-bound structures from the PDB were a particularly diverse set, including 36 Archaeal, 125 Bacterial, 94 Eukaryotic, 10 Viral, and 2 synthetic proteins, with 95% of the pairwise sequence similarities being less than 20%. Using PSIM, clusters were created from the ATP data set by first forming a fully connected graph with protein structures as nodes and PSIM similarity values as the weights on edges. The edges were then filtered to only retain highly significant (p < 0.001) edges based on a similarity threshold derived from a set of ~100,000 randomly selected unrelated protein pair similarities. The resulting clusters where then annotated based upon known protein functions. depicts the overall cluster and highlights subclusters of particular interest.
One of the prominent clusters contained many of the structures from the work discussed previously. Kinases have been well annotated both for functional sequences and structural motifs. Sequence motifs have been defined for various regions of the binding domain and have previously been used to classify unknown proteins as kinases. Without the aid of this prior information, based purely on local binding site surface similarity, PSIM separated the kinases as a distinct cluster (, lower right). The cluster also included the Pseudo-Kinase STRADα, which contains a known kinase sub-domain but lacks some required catalytic residues, suggesting that the methodology is clustering based on binding modality and not catalytic activity. When the kinases are put into a common alignment induced from their surface similarity, a pronounced pocket is visible () showing the consensus structure used by these kinases to bind ATP. The most conserved portions of this surface are located in the nucleotide-binding region while greater variability exists near the phosphate tail. The similarity in all of the kinase binding sites stems from the common binding mode used by these proteins in their interactions with ATP.
However, different protein families make use of different “tricks” in binding ATP. highlights the binding modes of ATP within three different clusters from the global clustering. For kinases (lower right), disposition of the nucleotide head is largely conserved, mirroring similarities in binding the hinge region of kinases (so-called since it serves as a “hinge” between two protein domains, as seen in ). The phosphate tail exhibits greater variability, apparently reflecting the differing specificity of kinases with respect to transfer of phosphate to substrates. By contrast, the GKST loop proteins are strongly conserved in the phosphate tail region with great variability in the nucleotide portion, and the TRNA Synthetases show a more closely conserved ATP binding mode overall that is different from both. Note that since PSIM rewards surface congruence (as opposed to minimizing overall deviations), those regions of the binding sites that are most similar can be recognized without being disrupted by the more variable regions.
Although many of the known sequence features for kinases are short and discontinuous,14
a proper structural alignment should overlay the common residues making sequence based motif discovery more tractable. Such an alignment also guarantees that the residues not only share similar placement in sequence space but also share a similar relative structural location and enhances the chance that function is also shared. Extraction of sequence-based motifs from sequence alignments derived from the PSIM aligned structures produced many well-known features of kinases. Sequence-based motifs were analyzed using UCSF Chimera15
on the PSIM kinase pocket alignment. Motifs representing the DFG activation loop and the invariant lysine implicated in catalytic rate were found as well as variations on the well-known hinge-binding motifs16
(). These small motifs would be nearly undetectable in a sequence based approach applied to as diverse a group of proteins as our ATP set. Motifs such as GXGXXG would have been particularly difficult due to the lack of specificity but a structure shaped by this motif in these protein contexts provides ample specificity to be discovered with the assistance of a 3D method.
One of the oldest motifs known (both historically and evolutionarily) is the P-loop-containing triphosphate hydrolase fold,13
which has been noted for its conserved GKST motif. The loop is also known as the Walker-A motif. This short and weakly defined sequence motif (GXXXXGK[TS]) would be difficult to extract from a set of proteins as diverse as the ATP set based on sequences alone. We observed, however, that the corresponding structures yield an extremely well defined binding motif for ATP. The largest cluster ( and , green cluster) that PSIM generated corresponded to this motif and exhibited a remarkably closely conserved tail conformation for ATP (). The PSIM clustering correctly grouped proteins with widely varying evolutionary history. shows how the GKST structural loop binds the phosphate tail of ATP in conjunction with a magnesium ion. All of the phosphate tails exhibit a nearly identical orientation with the nucleotide head free to be handled differently depending on the protein involved.
The remarkable structural conservation comes despite very significant sequence variation. shows the difference between the sequence alignment derived from the PSIM structural alignment and that derived by ClustalW17
purely based on sequence. GKST stands out as a significantly conserved motif but we also gain insight into some of the more drastic changes that can occur within this motif. In some cases, ClustalW was able to correctly identify the sequence motif, but in many cases, the motif was too weak. This is most likely due to the shortness of the GKST motif and the presence of many other similar regions in the proteins in question. The example of PDB structure 3FKQ from E. Rectale is particularly striking, since it is the only protein lacking the highly conserved lysine within the entire set. In this case the lysine was replaced with a threonine but ATP still binds, and does so in a manner nearly identical to those proteins bearing the canonical motif. ClustalW yields an unrelated sequence for 3FKQ in the multiple alignment, but the pocket similarity is sufficiently strong (p < 0.001) that PSIM was able to both place the protein in the correct cluster as well as identify the correct alignment correspondence with the other proteins in the family.
Heterogeneous Protein Data Sets
Among heterogeneous proteins, we applied PSIM to three different data sets (see Methods and Data for details), each constructed by different research groups reporting new protein pocket similarity approaches, and with each addressing slightly different considerations. The Kahraman Set considered 100 protein structures, comprising multiple subsets of sequence-dissimilar proteins that each bound the same ligand (e.g. ATP, AMP, PO4, and NAD), with significant diversity present in both ligand size and binding pocket volume. The Hoffman Set was designed specifically to mimic the Kahraman Set, but to limit the effects of diverse ligand sizes and pocket volumes on the binding pocket comparisons. The Yeturu Set included multiple binding sites for each of four ligands, but the alternate binding sites were highly similar, including, for example, symmetry-related sites within a single protein structure (e.g. PDB code 1EGH, with 6 symmetric phosphoglycolic acid binding sites).
Results from application of a spherical harmonic shape method (the “cleft interact” version) due to Kahraman et al.6
and from an atom-cloud comparison method (sup-CK) due to Hoffman et al.7
measured the ability of a method to rank pockets that bound a particular ligand as being similar to a protein pocket that also bound that ligand. Results were reported using standard ROC area analysis, with the average of areas resulting from 100 different rankings (one each for each protein in the data set). The spherical harmonic approach yielded a mean ROC area of 0.77 (no standard deviation was given).6
The sup-CK method,7
using parameters tuned for the Kahraman Set, yielded 0.86 ± 0.14. The PSIM approach yielded 0.79 ± 0.19. None of these methods exhibited differentiable performance from one another, and none performed better than volume alone, which yielded 0.88 ± 0.14.7
For these proteins, average pocket volume was strongly correlated with ligand size, with PO4 having the smallest binding sites (445 ± 118 Å3
) and FAD having the largest (2099 ± 224 Å3
Of the three methods, the PSIM approach was least strongly related to volume, with site comparisons being focused on observations of local chemical surface similarities and the binding pocket scope being limited in this computation to the portion of the binding pockets proximal to their cognate ligands.
Hoffman HD Set
Given the confusing influence of volume on interpretation of results, the sup-CK authors produced a homogeneous data set: ten binding sites for each of ten ligands, but where both the ligand and pocket sizes had far less variance than in the Kahraman Set.7
On this set, volume alone yielded a mean ROC area of 0.65 ± 0.15, sequence similarity 0.58 ± 0.09, and sup-CK and sup-CKL each yielding tuned
performance of 0.71 ± 0.19 and 0.75 ± 0.16, respectively.7
PSIM yielded 0.76 ± 0.15 using precisely the same parameters as used for the Kahraman Set. While the sup-CK method showed a substantial performance decrease when volume was factored out of the comparisons, the PSIM approach yielded nearly identical performance on both sets.
Quite a different question was addressed by one of the data sets in the study by Yeturu et al.8
in which the PocketMatch algorithm was introduced. For this set, four different ligands (citrate, indinavir, phosphoglycolic acid, and methotrexate) in 26 total crystal structures ( from the paper) yielded 51 binding sites, many of which were multiple minor variants within a single crystal structure. Given that the PSIM approach can make fine distinctions among local pocket differences, sensitivity to natural local variation may have posed some difficulty. However, as shown in , PSIM showed perfect segregation of the binding sites based on ligand type (using exactly the parameters as used for the Kahraman and Hoffman Sets).
The methotrexate and indinavir cases generally represented very minor variations on nearly identical binding sites. In those cases, similarity scores were nearly all above 9.0. For phosphoglycolic acid, 6/11 sites were from a symmetric arrangement within a single structure (1EGH) that yielded pairwise similarities over 9.9. The citrate examples fell into three categories: surface-bound pairs of ligands bound to serine proteases (the bulk of cases), a trio of symmetrically bound citrate ligands bound to a macrophage inhibitory factor protein, and a pair bound to unrelated sites of a ribonuclease. Despite the heterogeneity of the last set, all segregated into a single subtree. The clustering tree reported by Yeturu et al.8
showed similar results, but grouped the trio of 1GCZ citrate sites within the phosphoglycolic acid subtree and apart from the remainder of the citrate binding sites.
Relationship Between Protein Pocket and Ligand Similarity
A natural expectation is that there should exist some degree of concordance between the similarity of protein pockets and the similarities exhibited by the ligands that bind them. There are two quite different cases within the data sets examined here. One is the case where many synthetic ligands have been designed to competitively bind a single protein’s active site (e.g. the 29 different ligands of p38α from ). The other is the case where nature has evolved multiple strategies for binding the same naturally occurring cognate ligands, which is the situation in the Kahraman Set. shows both relationships, with ligand similarity computed using the Surflex-Sim approach.2
The plot at left in shows ligand similarity versus protein pocket similarity for all pairs of p38α comparisons. The relationship was statistically significant (p
0.01 by Kendall’s Tau, estimated by permutation analysis), though clearly stronger at the extremes of ligand similarity and dissimilarity. Recalling , pairs of ligands such as those in 1WBT and 1WBS had very high similarity (9.4) with correspondingly high pocket similarity (9.3), whereas 1WBT compared with 1OVE yielded much lower ligand and protein similarity (4.9 and 6.9, respectively).
The plot at right in shows ligand similarity versus protein pocket similarity for the Kahraman data set, where the protein similarity values resulted from average all pairs of comparisons for the cognate proteins of each ligand type and the ligand similarity values were again computed using Surflex-Sim. With the exception of the glucose/phosphate binding site comparison, the relationship between ligand and protein similarity was nearly linear. Overall, the correlation was 0.46 by Kendall’s Tau (p
0.01, by permutation). Further, the relationship between PSIM values and ligand similarity was much stronger than between PSIM values and volume similarities (Kendal’s Tau of 0.30). Even when restricting the set of comparisons to protein pairs where volume differences no longer correlated with protein pocket similarities, the statistically significant relationship between pocket and ligand similarity remained.
The relationship between ligand similarity and pocket configuration within the p38α variants was subtle. To test whether the PSIM approach was unique in its ability to identify this effect, we assessed whether the PocketMatch8
methods yielded correlations with ligand similarity for the p38α subset. shows the plots of ligand and protein similarity for the two methods. Neither algorithm was able to yield correlations between local pocket similarity and the similarities of bound ligands. When restricted to ligand similarities less than 7.0, PSIM yielded a statistically significant correlation (Tau = 0.10, p < 0.01), but both PocketMatch and SOIPPA yielded slightly negative correlations (the former with p < 0.05).
The two cases explored here do not fully examine the issues around the relationship between protein pocket similarity and ligand similarity. The p38α set considered a relatively flexible protein pocket bound to a series of competitive ligands with a range of underlying scaffolding. Clearly, in the case of a very rigid protein, it would be more difficult to discern conformational effects and relate them to ligand structural patterns. There are many examples of proteins that undergo little change on binding structurally divergent ligands. The Kahraman Set was very different in character, with examples of multiple diverse proteins each bound to the same ligand. In this case, the degree of concordance between average pocket similarity and ligand similarity was striking, though the issue of binding site volume differences limits the generality of the observation. It bears mention as well that different approaches to the computation of ligand similarity would also affect the results. Similarity approaches that measure topological structural similarity between ligands, for example, might yield no relationship between pocket and ligand similarity. As with proteins sharing little sequence similarity but exhibiting similar binding pockets, small molecules may share very little topological similarity but exhibit very similar surface shape and polarity.
Relationship to Other Approaches
There are a number of protein structural comparison algorithms, broadly characterized by whether they are global or local, backbone-based or include sidechain information, and the degree to which they make comparisons based purely on shape or also based on polarity. The Surflex-PSIM approach is local, accounts for all protein atoms, and considers the detailed comparison of both shape and protein surface polarity. Other methodologies have approached protein similarity in a variety of ways such as geometric hashing, shape alignment, and fingerprinting. These various methods can be local or global and some produce alignments while others do not.
Direct comparisons were made here with five notable recent examples of pocket comparison algorithms: the SiteBase algorithm of Kinnings et al.,10
the spherical harmonics approach of Kahraman et al.,6
the sup-CK method of Hoffman et al.,7
the PocketMatch method of Yeturu et al.,8
the SOIPPA method of Xie et al.18
Each of these algorithms can compute local similarities, and each are capable of producing alignments between protein binding sites. In each case, comparisons were made either by analyzing the performance of PSIM on sets used by the authors of other methods (the Kinnings, Kahraman, Hoffman, and Yeturu sets), or by analyzing the performance of other methods on a set introduced in this study (PocketMatch and SOIPPA on the p38α set). In the former comparisons, PSIM performed as well as the best reported methods, within the ability of statistics to distinguish among them. In the latter comparison, while limited in scope, PSIM pocket similarities exhibited a unique relationship to bound ligand similarities.
Many more approaches for protein structure comparison have been developed, notably the Dali work of the Sander group19,20
and the combinatorial extension method from the Bourne group21–23
These methods have been developed primarily for the study and characterization of the space of protein structures and the relationship of global structure to function, which is a different focus than the proposed work, which seeks to address local comparisons between protein surfaces. Closer to this concept are methods for 3D protein motif identification (e.g. SeqFEATURE24
These approaches identify local structural features within proteins (e.g. the catalytic triad of serine proteases) that establish a functional motif for proteins with related function. As with the global alignment methods, these have been developed primarily for characterization and annotation of protein function, but not to distinguish, for example, the differences in ligand specificity between two different serine proteases. A very different family of approaches also geared toward protein function annotation describes pockets based on fingerprint vectors, which do not yield alignments. A recent example called FuzCav was reported by Weill and Rognan.26
The approach constructs a cavity descriptor to characterize a protein binding site based upon the counts of 4833 different pharmacophoric features. The approach is very fast, requiring only comparison of pre-computed vectors, and it was effective in distinguishing different classes of proteins. Our approach is geared specifically toward detailed pocket surface comparison based on joint alignment, and we view it as being complementary.
Because of PSIM’s focus on surface shape and charge it can isolate small changes between protein structures that might go unnoticed with other methodologies. It is also the only method that shares its underlying formalism with a mature method for computing small molecule molecular similarity.1,2