In multiple reports, a group of docking methods (Glide, GOLD, and Surflex-Dock) performed close to equivalently with respect to docking accuracy. The absolute performance varied based on the benchmark. Percentage of top-scoring correct poses (≤ 2.0Å rmsd) in the cognate docking problem ranged from 50–60% in a 100 complex benchmark from Vertex (
18;
10). The percentage of correct poses within the top 20 returned (but not necessarily top-ranked) ranged from 75–85%. On an independently run benchmark of 100 complexes from Rognan’s group comparing 8 docking methods (
19), the comparable numbers for the three methods were about 55% and 75–80%. On a benchmark constructed with very careful attention to quality of crystal structures (resolution, density covering the ligands, etc…) from Hartshorn et al. (
25), GOLD performed at 71–87% correct for top scoring correct poses, depending upon the precise conditions (binding site definitions, initial ligand geometry, search depth, etc.). In the much more relevant cross-docking situation, performance for all methods is quite a bit lower, but with the same methods performing well. Warren et al. (
12) studied eight targets using several docking methods, with additional methods tested subsequent to the original publication. (
37) Comparing the average rank-order of performance across the eight targets, among Dock4, Dockit, FRED, FlexX, Flo, GOLD, Glide, Ligfit, MOE, Surflex-Dock, the top three (in order) for top-ranked pose were Surflex-Dock, GOLD, and Glide and for best pose were GOLD, Surflex-Dock, and Flo (with Glide coming in fourth). However, the
absolute performance was significantly worse.
So, in cases where we can guarantee that a protein structure is near-optimal for the particular ligand being docked (e.g. as in the Hartshorn study), we observe very good performance: nearly 80% correct for top-scoring poses. As the quality of structures becomes more variable, even in the cognate docking case, the performance is reduced to about 55% for multiple methods (e.g. on the Vertex data set). As we move to the operationally important cross-docking case, that of docking a novel ligand into a protein whose structure was determined with a different ligand, we see a further significant reduction in prediction accuracy. illustrates this point on the Astex85 docking set and the CrossDock211 set. In the cognate docking case, without any optimization of docking protocol, Surflex-Dock achieved 76% correct for top scoring poses at the 2.0Å threshold, with over 95% of the dockings having a correct solution within the top 20 poses returned. However, in the cross-docking case, top scoring pose accuracy decreased to 25% and best pose success dropped to 60%.
Note, however, that the comparison of cross-docking on the CrossDock211 set to cognate docking on the Astex85 set represents a hardest-case to easiest-case comparison, since the Astex85 set was cleanly curated to include particularly high-quality structures by multiple criteria, apart from being a cognate docking test. Performance of Surflex-Dock was evaluated on the
cognate protein/ligand structures from the CrossDock211 set as well. Success for top-scoring pose was 65% and for best pose of top 20 was 90%, which was lower than that observed with the Astex85 cognate-docking (76% and 95%, respectively), but not statistically significantly so (by Fisher’s exact test at p = 0.05). In a similar comparison, Verdonk et al. (
38), considered cognate docking on their Astex85 set with cross-docking of novel ligands into 65 of the 85 protein structures. They observed cognate docking performance (top scoring poses ≤ 2.0Å rmsd) of 80% for the 65 cognate cases, with a reduction to 61% for the cross-docking performance. This reduction in performance, while significant, was much less than observed here for Surflex-Dock on the CrossDock211 set. Apart from their set containing different targets and different ligands, they also included only those structures that contained the same set of binding site atoms present in the cognate structures and where the novel ligands were bound to protein forms that closely matched the reference structures in terms of protonation states, tautomers, and side-chain flips. Sutherland et al. (
39) published cross-docking results for CDocker and Fred on the set used here, with success rates for top-scoring pose prediction ranging from 16–26%, paralleling what was observed here for
single-structure cross-docking. Both groups considered the improvements possible by making use of using multiple structures, as will be done here in what follows.
As discussed above, there are marked differences in docking accuracy as we vary the degree to which we can expect the protein conformation to be “correct” for the purpose of accurately identifying the binding mode of a ligand. Proteins vary in their degrees of binding pocket flexibility, and some protein conformations can provide an inhospitable geometry for docking particular ligands. In the operational application of docking, we are never docking a ligand into the structure of a protein whose geometry is known, a priori, to be optimally complementary for the bound ligand. shows the degree of conformational variation for PDE4b and CDK2 among five different experimentally determined complex structures. PDE4b is comparatively rigid, but as we saw in , even small motions can influence docking preferences. CDK2 is clearly much more flexible, creating a more significant challenge in the cross docking scenario. Estrogen receptor (not shown) forms a middle ground, with relatively little variation among agonist-bound forms or antagonist-bound forms, but the differences between the agonist and antagonist forms are large.
Effects of Multiple Structures and Fragment Knowledge
shows the effect of moving from a single protein structure to five per target and of making use of placed fragments from the cognate ligands of the five protein structures to help guide docking. There is a statistically significant improvement through the use of multiple protein structures under the same docking protocol as used for single structures (green curves versus red curves in the plots). Success rates at the 2Å rmsd threshold improved to 45% from 27% for top scoring pose and to 82% from 60% for best pose by rmsd of the top twenty returned. Note, however, that median docking times increased by five-fold, since the procedure itself is essentially a sequential docking to each structure, with minimal additional efficiencies.
By making use of sub-fragments whose bound geometry is known, the process of docking is faster, and the space of solutions that share common features with known ligands is searched at a greater relative depth. Using this approach, top scoring pose success (at 2.0Å rmsd) increased to 50% from 45%. This is not significant in a statistical sense at a single threshold (e.g. by using a test of difference of proportions), but the overall shift in the distribution of rmsd values is marginally significant. Interestingly, the distributions of rmsd values for best poses was essentially unchanged under the two conditions. The primary effect of the use of fragment knowledge was deeper search within the space of a priori favorable poses, which resulted in the slight improvement in top scoring pose identification.
Docking using the constraint of multiple fragments is relatively fast, and it eliminates the need for docking from multiple initial ligand conformations (which is done in the standard docking protocol for geometric accuracy). In the multi-structure protocol yielding the best performance in (the blue curves), the overall docking speed was just 1.7-fold longer than the single-protein method. The median time to dock each ligand was just four minutes, with ligand flexibility ranging widely, but with typical ligands having roughly seven rotatable bonds.
The performance levels shown here represent a lower-bound in the sense that, while the docking protocol was designed to mimic that of an actual modeler making use of knowledge of multiple structures and well-understood interactions, the choice of protein structures was arbitrary, and the choice of fragment hints was made using no deep knowledge of the systems. For example, in the case of thrombin, a reasonable modeler would ensure that all common P1 binding elements would be represented among the fragment hints to be used by a docking system. Here, however, neither the very common amidine nor the more recent non-basic chlorophenyl P1 pocket binding elements were among the fragments used in docking, whereas they were very common in the test ligands (e.g. the ligands of 1KTS and 1WAY). In addition, while a number of ligands of thrombin were present that require chelation of a metal ion such as Zn2+ (e.g. the test ligand from 1C1W), none of the five example protein structures contained the required chelation moiety. In a real-world modeling exercise, when designing around a common binding element such as the P1 element in thrombin or any known metal chelation moiety, one would include preferred binding modes for those ligand components.
Verdonk et al. (
38) showed a modest increase in top-scoring pose prediction (to 67% from 61%) in a multiple structure approach on their highly-curated cross-docking data set. Sutherland et al. (
39) showed a more substantial improvement on the data set used here, from 16–26% success for single-structure cross-docking to 36%–46% success for multiple structures depending on the method of arbitration used to choose among the multiple dockings. Results for Surflex-Dock were of similar magnitude in terms of relative improvement (from 27% to 50%). Note that the protocol used here with Surflex-Dock made use just five alternate protein conformations per target, chosen
a priori, whereas that used by Sutherland et al. used an all-by-all cross-docking.
Effects of Protein Pocket Adaptation and Pose Families
The results thus far have deviated from most widely used docking methods and protocols by using multiple protein structures, but these structures have been treated as completely fixed. Further, top scoring poses have been treated as the singular solution to the docking computation. It is well understood that protein/ligand complexes are not accurately portrayed as the singular snapshot one often sees in a high-resolution crystal structure. Even in the case where a single joint configuration dominates others by having substantially lower free energy than significantly different configurations, the complex exists as an ensemble of configurations over short time scales where the coordinate changes may be small but are nonetheless real.
shows the docking of the CDK2 ligand from 1HO8 into the five protein conformations. At left is the ligand and protomol for 1OIU (one of the five structures used), with a particular subfragment of the cognate ligand shown in thicker sticks. That particular fragment was responsible for helping to guide the docking of the test ligand, which is shown in the middle panel in two poses (atom color), along with the fragment (blue), and the two alternative bound poses of the ligand from the experimentally determined structure (green). In this depiction, the effects of pocket adaptation are shown (red protein structure at right) along with the effects of identifying pose families. For the results of docking without pocket adaptation, the top scoring pose was 2.0Å rmsd distant from the further of the two experimentally determined ligand poses. Pose families derived from the initial docking failed to group the two alternate solutions together. The closest poses to the correct pair of experimental ones were too far apart given the original protein coordinates. Rescoring the final pose set with full atomic adaptation within the binding pocket yielded significant movement, especially in the position of a key carboxylate. Generation of pose families from the rescored pockets identified a single pose family as being highly probable, with contributions from modifications of three parent protein structures. This top pose family contained conformations less than 1.0 rmsd from each of the experimentally determined alternatives.
shows all of the poses from the top scoring pose family resulting from pocket adaptation. They exhibit reasonable movement in light of the known variation in the “tail” of the ligand in question. Note, however, that the protocol using full atomic movement identified the correct family, but the protocol that was restricted to protons only identified the incorrect family as most probable (the second most probable contained one of the two correct alternatives). The plot at right in shows the improvement in docking accuracy obtained by making use of top-scoring pose families instead of single top-scoring poses. Baseline performance (no pose families, just the top scoring pose) is shown in red, with some improvement seen in computing pose families without any pocket adaptation (purple line) that is due mostly to the difference in reporting method. For pose families, the minimum rmsd to experimental is computed, so there is a bias toward nominally better results, especially at the lower end of the curve.
All three methods of pose family generation (no rescoring, rescoring with proton movement in the protein pocket, and rescoring with all atom movement) yielded very similar performance at the 2.0Å threshold: approximately 55%. This level of performance approaches that seen in cognate docking on “hard” cognate docking benchmarks (see earlier discussion), and the characterization of results resembles a sensible physical interpretation of protein/ligand binding.
Pose Family Agreement
As illustrated by the example from and , the different scoring methods can yield different results, but their overall performance is close to equivalent. Since the scoring methods are computing only partially related terms, orthogonal agreement might suggest higher confidence. shows the relationship between top scoring pose family agreement among the three methods and prediction accuracy. Pose family agreement was calculated between each of the pocket adaptation top families and the top family from the original docking, with the mean deviation characterizing overall agreement. There is a striking relationship between nominal agreement and the accuracy of the top scoring baseline pose family. In over half of the 211 test cases, the three methods had highly similar top scoring pose families, and within that subset, the proportion of correct predictions was 80%. In an operational sense, this is a helpful feature, since it allows for confidence to be based upon the stability of the original top scoring pose family to protein pocket adaptation. This level of success is comparable to that seen with carefully selected and curated cognate docking sets (e.g. as in , with the Astex85 set).
In the remaining minority set of cases, the top scoring baseline pose family was correct just 25% of the time. However, the correct choice could be found 50% of the time by looking at all three of the top scoring families. Success rates of 50% approach those observed with cognate docking on difficult benchmarks, but the comparable rates for those studies come from consideration of a single top scoring pose instead of the poses from a trio of families.
Inter-Target Variation
The tremendous variation in docking system performance on target choice has been well documented (e.g. (
22;
12)). shows the performance of the multi-structure pocket-adaptation protocol for Surflex-Dock over the set of eight targets. Average success rates for multi-structure docking with no rescoring or pose family computation ranged from 40% for thrombin to 79% for estrogen receptor, with the mean being 61%. Weaker performance for thrombin was primarily due to extreme ligand flexibility in many cases, along with the previously mentioned issues of P1 pocket element variability and the presence of ligands that require metal chelation not present in the protein structures used for docking. CDK2 represents a genuinely difficult case, since the protein motions captured with the five protein conformational snapshots clearly do not encompass finer motions that are important (see ).
| Table 1Performance of Surflex-Dock on a target-specific basis using either the single top pose returned from a multi-protein-conformation docking (including use of fragment-based hints), using the top pose family under different rescoring protocols, or using (more ...) |
Rescoring with protein pocket adaptation had large effects on individual target performance, but due to small numbers of ligands per target, these were not statistically significant. Interestingly, the largest difference in performance between the two rescoring approaches were between the proteins representing the two poles of relative flexibility, with full pocket adaptation performance better on CDK2 and proton-only adaptation performing better on PDE4b/5a. The aggregate mean performance differences were not significant. However, consideration of two pose families (either the original top family and the top family from full pocket adaptation or the former plus that from proton-only pocket adaptation) yielded highly significant performance improvements over performance without any pocket adaptation. In terms of the practical impact on modeling, a requirement to employ judgment given two or three solution sets (and only in the cases where they disagree) does not seem overly burdensome. Note that consideration of the two most probable pose families from the baseline docking (without pocket adaptation) yielded performance levels that were not statistically significantly different than those shown in for two pose families obtained using pocket adaptation (Orig+Heavy and Orig+Protons). However, pocket adaptation allows the computation of pose family agreement (discussed above) since the protocols employ scoring variations. Also, pocket-adaptation can yield significant changes to protein-ligand interactions and pose family composition (see and ).
shows an example from PDE4b, where the top scoring pose family from proton-only adaptation was correct. In this case, the uncertainty in the placement of the chlorophenyl seems warranted in light of the partial density from the crystal structure in that part of the ligand. Fairly subtle protons movements (highlighted in the Figure) were important in proper recognition of the correct pose. In this case, the ligand to be docked shares some commonality with the known cognate ligand structures (see ), but a number of reasonable “flips” are easily confusable, since the core heterocycle is functionalized differently, both in position and content, compared with the nearest known analog. For PDE4b, the proton-only approach appears more reliable, probably due to the a priori fact of relative protein rigidity. The combined force-field within Surflex-Dock in the pocket adaptation protocol is not reliable in this case when moving heavy atoms, adding more noise than signal to the scores.
In the case of estrogen receptor, all three methods worked quite well, with a high level of agreement and with a combined performance of 95% correct pose prediction. shows a typical example for this target, where an antagonist (from 1UOM, shown also in ) was the subject of docking. This ligand represents the type of synthetic variation one would encounter in lead optimization exercises, where the antagonist “arm” is among the structures known, but the core structure that binds in the agonist pocket is quite different from the known ligand structures. The all-atom pocket adaptation approach is robust enough to “rescue” the correct pose of the antagonist when bound to an agonist-form of the receptor. However, as can be seen in (right panel), the pocket adaptation, while making room for the ligand, does not even come close to adapting the pocket to the form seen when binding antagonists. The approach taken here will be most successful in cases where the large protein motions are well-represented among a small set of experimentally determined structures. The only ligand that represented a failure was that from 1ZKY, which binds the agonist binding site but has a complex bicyclic structure. In that case, the top pose family from the protons-only rescoring was still within 3.0Å rmsd, which was the closest solution among all of the dockings returned.
shows the docking of the ligand from 1FPC into the five alternate thrombin structures (see for 2D structures). The original docking contained the correct solution, but it was ranked a full 2.0 units of pKd lower than the incorrect solution shown in the left panel. Rescoring using either pocket adaptation method yielded the correct family as top-ranking (middle panel). While the movement of TRP-86 is helpful to accommodate the larger substituent (compared with the cognate ligands), it is likely that inclusion of non-covalent ligand self-interaction, which is part of the pocket-adaptation rescoring procedure, is beneficial for the entire class of thrombin inhibitors with this typical three-part construction.
Relationship to Other Approaches
The work reported here represents a contribution to real-world docking primarily in four ways. First, the approach is computationally tractable, with typical per-ligand computation times of about 30 minutes. With multi-CPU clusters being common, ligand sets under consideration in lead-optimization exercises can be thoroughly studied with these methods. Second, the benchmark used here contains a small number of pharmaceutically relevant targets, represented each with a small number of conformational snapshots, but the testing was done with a large number of ligands of highly variable structure in many cases. Further, the benchmark itself was not constructed by a methods developer to demonstrate performance of a particular method; rather it was constructed by an independent active modeler in order to measure real-world behavior. Third, the approach offers a way to systematically make use of modeling knowledge in the form of ligand fragments and their key interactions, but to do so in a way that does not lead to undue bias in constraining the prediction space. Fourth, the workflow yields physically intuitive results: related pose families under a small number of scoring conditions that allow for significant protein flexibility including both sidechain and backbone movements. These results represent very significant practical improvements over single-structure non-cognate docking. Single top-scoring pose family predicted performance averaged 64% (baseline multi-structure docking, heavy-atom pocket optimization, and proton-only pocket optimization). When top pose families agreed, 80% correct prediction was observed. Overall, consideration of the best prediction from the two top pose families from two scoring methods yielded correct predictions 75% of the time, averaged across the targets tested. Cognate docking on the same set yielded 65% success, so the results for cross-docking with this multi-pronged approach are competitive.7
This work was positively influenced by much of the work that has been previously reported addressing protein flexibility, particularly including that from the groups of Abagyan, Gilson, Friesner, Goodsell, McCammon, Moitessier, and Shoichet. (
30;
31;
32;
33;
34;
35;
36;
40) The foregoing work has generally focused on elegant studies of single targets or all-by-all cross dockings with small total numbers of ligands (generally twenty five or less). The present work has made use of a very large testing set of realistic construction (211 test ligands for eight total targets, with five starting protein conformations per target). It is difficult to make sensible comparisons in terms of performance levels since the studies are so different, but the results shown here are transparently relatable to real-world use scenarios, and performance levels approach those seen in multiple studies on “hard” benchmarks for
cognate docking. Among the prior reported methods in which true protein flexibility has been explored, processing times spanned multiple hours for single ligands, compared with the 30-minute timings typical in this study (for an initial multi-structure docking, rescoring with all-atom protein pocket adaptation, and pose family generation for the baseline and rescored poses).