The field of small molecule docking was initiated by the pioneering work of Kuntz and Blaney on rigid ligands in the 1980’s [
1]. The first practical and fully automatic methods began to appear in the 1990’s, with AutoDock [
2;
3], GOLD [
4;
5], Hammerhead [
6;
7;
8], and FlexX [
9;
10]. The earliest efforts typically demonstrated successful re-docking of ligands into their cognate protein binding sites, usually with just a handful of examples, frequently including cases such as trypsin/benzamidine (3PTB), streptavidin/biotin (1STP), and DHFR/methotrexate (4DFR). With the publication of the 1997 GOLD validation paper [
5], reporting pose prediction performance on 100 complexes, the scale of validation experiments for ligand pose prediction changed permanently. Publication of the independent benchmarking of docking algorithms by Rognan’s group in 2000 added virtual screening assessment (on thymidine kinase and estrogen receptor) to the types of formal assessments commonly made of docking algorithms [
11]. Development of the Surflex-Dock approach (first described in 2003 [
12]), the descendent of the Hammerhead system, benefited from cognate-docking benchmarks for pose prediction assessment (81 complexes derived from validation of GOLD [
5]) and from benchmarks for virtual screening assessment (2 target systems, known positive ligands, and a decoy set from Rognan’s group [
11]).
The early years of the new millennium saw the introduction and popularization of additional docking algorithms, with independent benchmarking becoming increasingly prevalent. Studies from Perola et al. [
13] and Warren et al. [
14] were particularly influential. During this same period, larger and more diverse virtual screening benchmarks were developed, notably the set of 29 screening target systems for testing scoring function optimization by Pham and Jain [
15] and 40 screening targets forming the DUD set by Huang, Shoichet, and Irwin [
16]. With respect to measuring pose prediction, the importance of high-quality structures was gaining prominence, highlighted by the publication in 2007 of the Astex Diverse set of 85 protein ligand complexes [
17]. At the same time, the limitations of using cognate ligand re-docking were beginning to be recognized, for example by Sutherland et al. [
18] and also by Verdonk et al. [
19] who each developed benchmarks for assessment of non-cognate pose prediction.
A special symposium on evaluation of molecular modeling methods took place at the Fall 2007 National ACS meeting, with special attention paid to the issues governing proper assessment of docking algorithms. The meeting yielded several papers, published in a special issue of this Journal, introduced with an editorial by the symposium co-organizers Nicholls and Jain [
20]. While consensus among the broader community has been elusive, several issues of central importance were identified relating to benchmark construction and statistical methodology. In the area of virtual screening evaluation, some agreement was made as to sensible statistical methods for measuring enrichment, but decoy set design approaches remained controversial. These consisted of two types: “designed” decoy sets chosen to mimic properties of a set of known actives for a particular target and “agnostic” decoy sets chosen to mimic properties of a typical small molecule screening library. In the area of pose prediction assessment, serious problems with cognate docking benchmarks were highlighted involving “memory effects” that develop when optimizing a protein’s pocket structure in the presence of the ligand to be docked as a test [
21].
This paper is part of a collection devoted to a follow-up to the aforementioned symposium that took place in Spring 2011, co-organized by the authors of the lead editorial in this special issue of the Journal of Computer-Aided Molecular Design [
22]. Participants were asked to present comparable data and analyses on pose prediction using the Astex Diverse set of 85 protein ligand complexes for pose prediction and on screening utility using the DUD set of 40 protein targets, along with known positive ligands and designed decoy sets for each target. Both sets involved multiple aspects of manual re-curation, especially as to the protein structures themselves.
Performance of Surflex-Dock on the re-prepared Astex85 set was not statistically significantly different than our previous application to the originally released data set [
23], with success rates for single top-scoring poses within 2.0 Å RMSD ranging from 66–80% depending on input coordinate variations and run conditions and success rates for best of 20 top-scoring poses of approximately 95%. Performance of Surflex-Dock on the re-prepared DUD40 set yielded a mean ROC area of 0.72 (stdev. 0.15) and mean 1% ROC enrichment of 19 (stdev. 14.5). This was not statistically significantly different than what was reported in the independently published report of Cross et al. [
24], which compared results for several docking methods. They concluded that GLIDE and Surflex-Dock were capable of superior performance in both pose prediction and in virtual screening to the other methods tested: DOCK, FlexX, ICM, and PhDock. Use of SP mode for GLIDE and enabling ring flexibility for Surflex-Dock produced the best overall results in that study.
In addition to the baseline benchmarking that provided a comparative platform for the symposium, we addressed four additional questions, two related to pose prediction and two related to virtual screening: 1) to what extent are subtle changes in protein preparation capable of yielding large improvements in nominal pose prediction performance? 2) is it possible to make use of protein pocket adaptation during the docking process to produce high quality pose prediction results? 3) is a multi-pronged strategy for virtual screening, which combines docking, 2D similarity, and 3D similarity, more robust and reliable that one method alone? 4) is it possible to make use of multiple protein conformational alternatives to improve virtual screening performance without requiring ad hoc scoring adjustments?
We observed gains in pose prediction success rates of nearly 20 percentage points by making very small changes to protein structures (typically 0.3 Å RMSD within the protein pocket) prior to docking by joint optimization of protein and cognate ligand. However, we also showed that very high success rates could be obtained using a practical procedure that adapted protein pockets during the docking process and produced pose families based on clustering and a Boltzmann weighting scheme. With respect to virtual screening, we showed that using the combination of docking and similarity approaches produced robust performance, with early enrichment of 15-fold or greater 75% of the time and overall ROC area of 0.80 or greater 60% of the time. Use of multiple alternative protein conformations was also shown to have a significant positive impact in two target systems where data were available to make direct comparisons.