|Home | About | Journals | Submit | Contact Us | Français|
The success of ligand docking calculations typically depends on the quality of the receptor structure. Given improvements in protein structure prediction approaches, approximate protein models now can be routinely obtained for the majority of gene products in a given proteome. Structure-based virtual screening of large combinatorial libraries of lead candidates against theoretically modeled receptor structures requires fast and reliable docking techniques capable of dealing with structural inaccuracies in protein models. Here, we present Q-DockLHM, a method for low-resolution refinement of binding poses provided by FINDSITELHM, a ligand homology modeling approach. We compare its performance to that of classical ligand docking approaches in ligand docking against a representative set of experimental (both holo and apo) as well as theoretically modeled receptor structures. Docking benchmarks reveal that unlike all-atom docking, Q-DockLHM exhibits the desired tolerance to the receptor’s structure deformation. Our results suggest that the use of an evolution-based approach to ligand homology modeling followed by fast low-resolution refinement is capable of achieving satisfactory performance in ligand-binding pose prediction with promising applicability to proteome-scale applications.
Considerable effort has been directed towards developing fast and effective ligand docking algorithms applicable to drug discovery and design 1-. The goal of virtual screening techniques is to limit the size of the screening library to compounds most likely to display the desired biological activity. Among commonly used virtual screening approaches are docking-based techniques that prioritize the testing compounds by predicting the binding mode for a query compound 5-8; this is followed by the prediction of binding affinity 9-11. To achieve satisfactory performance, most ligand docking approaches typically require high-resolution structural information on a potential target receptor, preferably in the ligand-bound conformational state 12. Hence, many studies describe successful self-docking benchmarks 6,13-15. Nevertheless, many proteins exhibit significant motion upon ligand binding 16-18 and even small motions of side chains or loops can have a detrimental effect on docking accuracy.
It has been demonstrated for trypsin, HIV-1 protease and thrombin that almost 90% of initial docking accuracy is lost if the mean protein structural rearrangement is greater than 1.5 Å 19. A notable drop off in the docking accuracy (from 76% to 49%) was reported in cross-docking experiments where ligands were docked to the crystal structures derived from complexes other than their own 20. Another study carried out for 8 protein targets shows considerable deterioration in CDocker’s 21 performance from 39% for native receptor structures to 26% for non-native conformations 22. Further examples of ligand docking applications using non-native receptor crystal structures include the comprehensive benchmarks of the GOLD docking program 15 against the Astex Non-native Set 23, testing the FITTED docking protocol on a set of 33 complexes for 5 drug targets 24 and the assessment of the ICM’s 5 performance for a set of 4 protein kinases 25. In contrast, there are considerably fewer studies that focus on the development and benchmarking docking of methodologies suitable for theoretical receptor structures, particularly those modeled using remote protein homology 26-28.
Despite the continuous growth of the PDB 29, for many important drug targets, high quality crystal structures are still unavailable. In the absence of experimentally determined structures, homology modeling provides receptor models with an accuracy related to the level of sequence identity to the template protein 30-32. Protein models built on template structures with more than 50% sequence identity tend to have ~1 Å root-mean-square deviation (RMSD) of their backbone atoms from the corresponding experimental structures. Medium-accuracy models (with a RMSD of 1.5 Å from native for 90% of the backbone structure) require template structures with a sequence identity of 30-50% to the target. The accuracy of protein models drops considerably for targets sharing less than 30-35% sequence identity to their templates. Notwithstanding the progress of structural genomics projects 33-35, for most proteins, no structural templates with high sequence identity are available; therefore their theoretical models, even when they have the correct global topology, have significant structural inaccuracies in ligand binding sites. Such structural errors interfere with ligand docking and cause critical deterioration in the ability to accurately reproduce binding poses.
For example, docking experiments using deformed trypsin structures with a Cα root-mean-square-deviation, RMSD, from native varying from 1 to 3 Å as targets for docking known trypsin inhibitors revealed that native protein-ligand contacts are rapidly lost with receptor structure deformation 36. Furthermore, as demonstrated for 10 enzyme systems, the performance of the docking calculation is affected by the particular representation of the receptor and decreases from experimental to theoretically modeled structures 12. Finally, large-scale benchmarks carried out for a representative set of protein-ligand complexes revealed that the relatively high accuracy attained in self-docking frequently becomes no better than simple random ligand placement if weakly homologous protein models are used as the target receptors instead of crystal structures 37. This observation clearly reflects the reduced ability to properly accommodate ligand molecules in the estimated one half of targets with weakly homologous protein models that have a RMSD from the native binding site >2 Å 38. Such structural distortions of the binding sites are significantly larger than the differences between the apo and holo structural forms of most proteins.39,40
In that regard, efficient docking methods capable of dealing with structural deviations from ligand-bound receptor conformations frequently observed in ligand-free forms and routinely present in theoretically modeled receptor structures are highly desired. Low-resolution models 26-28,41 and evolution-based approaches 37 have been shown to efficiently tackle this problem. The latter work by detecting remote functional relationships in proteins in order to identify many essential features associated with ligand binding 37,38. These insights can be profitably exploited to develop CPU-inexpensive algorithms for ligand comparative modeling. For example, ligand binding sites can be effectively detected in protein models whose global (binding pocket) Cα RMSD is up to 8-10Å (2-3Å) 38. Furthermore, a pocket-specific potential of mean force can be derived from weakly homologous structure templates to facilitate ligand docking. This potential is often more specific for modeling protein-ligand interactions than generic knowledge-based potentials derived from complexes found in the PDB 27. The analysis of ligand binding modes in evolutionarily distant proteins identified by threading shows that the ligands that bind to the common binding site often contain a set of strongly conserved anchor functional groups as well as variable regions that impart specificity to the family members 42; these observations stimulated the development of FINDSITELHM, a ligand homology modeling approach. Remarkably high structural conservation of the anchor functional groups across weakly related proteins was recently used to perform rapid ligand docking by homology modeling with encouraging results 37. Here, we describe Q-DockLHM, the extension of Q-Dock 27 that performs constrained low-resolution ligand docking simulations in order to refine the binding poses provided by FINDSITELHM. In ligand docking against a representative set of experimental as well as theoretically modeled receptor structures, we compare its performance to that of classical ligand docking approaches 8,43.
High quality protein—ligand complex X-ray structures were taken from the CCDC/Astex set 44. We only include proteins for which at least 5 ligand-bound, weakly homologous threading templates can be identified by protein threading and where the binding pocket can be predicted by FINDSITE 38 within 7 Å from the bound ligand; this results in 204 protein targets of the 305 originally included in the CCDC/Astex set. In addition to the ligand-bound crystal structures used as target receptors to dock ligands (Set #0), we compiled three sets of protein models with different structural accuracy: high, moderate and low (Set #1, Set #2 and Set #3, respectively). Protein models were built by our protein structure assembly/refinement protocol, TASSER 45, from multiple distantly related (<35% sequence identity to the target) template structures identified by PROSPECTOR_3 42. For each target protein, up to 10 models have been generated; three were selected and assigned to a particular set based on the TM-score 46 to the target crystal structure to retain the average TM-scores for Set #1, Set #2 and Set #3 of 0.85, 0.80 and 0.75, respectively.
Out of 204 targets included in the dataset, 54 are dimers with ligand binding pockets formed by both monomer chains. The quaternary structures of these targets were generated using the TASSER models of the monomers and 3D-Dock 47 and assigned to a particular set using similar criteria as for the single chain targets. The average Cα RMSD calculated for individual protein chains is 2.41, 3.44 and 4.56 Å, for Set #1, Set #2 and Set #3, respectively. Furthermore, we selected 135 proteins for which ligand-free structural forms are available in the PDB 29 to conduct docking simulations using receptor apo crystal structures. The average global and local deviations from the ligand-bound crystal structure for the protein models (Sets #1-3) and ligand-free structures are summarized in Table I. The dataset is found at http://cssb2.biology.gatech.edu/skolnick/files/QDOCKLHM.
In addition to the pocket-specific contact potential derived from weakly related template structures 27, Q-DockLHM employs position-specific anchor restraints imposed on the anchor binding mode predicted by FINDSITELHM 37. The consensus anchor-binding pose is derived from weakly homologous (<35% sequence identity to the target) ligand-bound template structures identified by threading. First, upon the global superposition of the threading templates onto the target’s (experimental or predicted) structure using TM-align 48, the template-bound ligands that occupy a top-ranked, predicted binding site are clustered using a SIMCOMP chemical similarity cutoff of 0.7. SIMCOMP is a chemical compound-matching algorithm that provides atom equivalences 49. Subsequently, each cluster of ligand molecules is used to detect the anchor substructure. The equivalent atom pairs provided by SIMCOMP are projected onto ligand functional groups 27, and the anchor substructure is defined as the maximum set of conserved functional groups present in at least 90% of the ligands from a single cluster. Having identified the anchor substructure, the average pairwise RMSD for anchor functional groups is calculated for template-bound ligands upon the global superposition of the template structures. The consensus anchor binding mode and its structural conservation is then incorporated into the Q-DockLHM’s force field as follows:
where the RMSD is calculated for a given ligand pose vs. the consensus anchor binding mode and aRMSD is the average pairwise RMSD of the anchor substructure calculated over the template-bound ligands (anchor structural conservation). Ligand conformations whose anchor functional groups deviate too far from their consensus positions are penalized, and the lowest energy poses are typically localized around the anchor consensus binding mode within the distance proportional to its structural conservation. As shown below, the restraints imposed on the consensus anchor-binding mode improve the sampling of native-like conformations.
In the first step, binding pockets were predicted in the target proteins by FINDSITE 38. This structure/evolution-based approach identifies ligand-bound template structures from a set of distantly homologous proteins detected by the PROSPECTOR_3 threading approach 42 and superimposes them onto the target’s (experimental or predicted) structure using the TM-align structure alignment algorithm 48. Binding pockets are identified by the spatial clustering of the center of mass of template-bound ligands and ranked by the number of binding ligands. Here, we used the best of top five predicted pockets. We note that only ligand binding sites whose centers were predicted within 7 Å from the bound ligand were used to dock ligands. Ligand poses provided by FINDSITELHM 37 were used as the initial conformations for molecular docking/refinement by Q-DockLHM, Q-Dock 27, AutoDock3 8, LIGIN 43 and AMMOS 50. The protocols followed are detailed below.
FINDSITELHM is a fast ligand homology modeling approach that docks flexible ligands by a simple superpositioning procedure 37. It uses a collection of template-bound ligands extracted from binding sites predicted by FINDSITE and clusters them using the SIMCOMP chemical similarity score 49. Subsequently, an “anchor” substructure is identified in each cluster, as defined above. FINDSITELHM superimposes the target ligand onto the consensus binding pose, the anchor conformation averaged over the seed compounds (the largest set of compounds that have their anchor substructures within a 4 Å RMSD from each other) of the identified anchor substructure. If none of the identified anchor substructures is covered by the target ligand, it is randomly placed in the predicted pocket. Ligand flexibility is accounted for by the superposition of multiple conformations of the target ligand. The conformation that can be superposed onto the reference coordinates with the lowest RMSD to the predicted anchor pose is selected as the final model.
Q-DockLHM is a direct extension of Q-Dock 27 (see below) that additionally includes harmonic RMSD restraints imposed on the predicted anchor-binding pose (defined in Eq. 1). Since the sampling of the lowest-energy conformations is generally restricted to the space around the consensus anchor pose, the simulation time was reduced from Q-dock by using 12 replicas, 50 attempts at replica exchange and 50 MC steeps between replica swaps. The lowest-energy conformation was selected as the final docking result.
Ligand poses provided by FINDSITELHM as well as low-resolution models generated by Q-DockLHM and transformed into the all-atom representation were optionally refined by molecular mechanics optimization using AMMOS 50. AMMOS employs the AMMP molecular simulation package 51 to carry out automatic refinement of the complexes. We used the sp4 force field in all simulations. Using the crystal structures of the receptors, only ligand atoms were permitted to move (AMMOS Case 5), whereas for protein models, protein atoms within a 12 Å sphere around the ligand are allowed to be flexible (AMMOS Case 4).
We followed the Replica Exchange Monte Carlo docking protocol 27 that allows the sampling of ligand conformations within a 7 Å radius sphere imposed on the predicted pocket center with the number of replicas reduced to 12, 50 attempts at replica exchange and 50 MC steeps between replica swaps. The final docking conformation corresponds to the lowest-energy pose.
In flexible ligand docking simulations, we used AutoDock 3 8, which is the most frequently used docking software 2. A grid spacing of 0.375 Å was used, with the box dimensions depending on the target ligand size, such that the ligand’s geometric center was not allowed to move more than 7 Å away from the predicted binding pocket center. Each docking simulation consisted of 100 runs of a genetic algorithm (GA) using the default GA parameters. The lowest-energy conformation was taken as the final docking result.
LIGIN is an all-atom docking approach that uses molecular shape complementarity and atomic chemical properties to predict the optimal binding pose of a ligand inside the receptor binding pocket 43. We adopted the idea of ligand docking using conformational ensembles 7,52,53 to mimic the ligand flexibility in LIGIN. For a given target, we used exactly the same ensemble of multiple ligand conformations as in Q-Dock/Q-DockLHM simulations and FINDSITELHM, and docked each of them into the predicted binding site using LIGIN. The docking procedure was repeated 1000 times for each ligand conformer. The final binding mode corresponds to that of maximal complementarity found in the complete set of ligand conformers. Atom types were assigned using LPC 54; no receptor residues were permitted to have steric overlap with the ligand.
The overall quality of the predicted protein-ligand complexes is assessed in terms of the contact distances between protein and ligand atoms. Interatomic contacts are identified by LPC 54. The distribution of the contact distances (atom types are neglected) were analyzed for the complexes modeled by each ligand docking method and compared to that observed in the crystal structure.
Root-mean-square-deviation (RMSD) from native is one of the most commonly used measures to assess the accuracy of ligand binding pose prediction. The classical RMSD measure averages the binding pose prediction accuracy over all ligand heavy atoms. Since the position of a portion of a ligand can be significantly better predicted than the remaining part of the molecule, a simple RMSD evaluation can be very misleading 55. Therefore, we also report the fraction of ligand heavy atoms predicted within 1, 2 and 3 Å from their reference positions.
The fraction of native protein-ligand contacts recovered in the predicted complex structures complements the RMSD calculation in the evaluation of the docking accuracy. Specific protein-ligand contacts are calculated at the detailed level of protein/ligand heavy atoms (high-resolution contacts) as well as at the simplified level of protein residues and ligand functional groups 27 (low-resolution contacts). High-resolution contacts are extracted from the complex structures by LPC, which is based on the interatomic contact surface analysis 54. Low-resolution contacts are calculated using the limiting distances for the centers of mass of ligand functional groups and protein residues 27. In addition to the specific low-resolution contacts, we also consider consensus quasi-specific contacts that are conserved across the set of template protein-ligand complexes (present in at least 25% of the template complexes). Here, ligand functional groups of the same type are equivalent to each other, i.e. quasi-specific contacts are calculated for the protein residues and ligand functional group types.
For the protein crystal structures used as the target receptors to dock ligands, the RMSD as well as the fraction of correctly predicted contacts are calculated using the ligand pose in the experimental complex as the reference conformation. In theoretical protein models, the local geometry of the binding pocket frequently deviates from the experimental structure (see Table I). To assess the ligand binding mode prediction accuracy for protein models, we used ligand poses transferred from the crystal structures upon the superposition of the binding residues as reference ligand conformations. This also roughly estimates the upper bound for ligand docking accuracy against protein models. Due to the structural distortions of binding pockets, in many cases, the requirement of 0 Å RMSD or 100% of the native contacts simply cannot be satisfied. Ligands randomly placed into the receptors binding pockets within a distance of 7 Å from the predicted pocket center delineate the lower bound of docking accuracy.
In this study, we used binding pockets predicted by FINDSITE as the target sites to dock ligands. FINDSITE is a threading-based binding site prediction/protein functional inference/ligand screening algorithm that detects common ligand binding sites in a set of evolutionarily related proteins 38. The results of pocket prediction carried out for the CCDC/Astex set are shown in Figure 1. A remarkable feature of FINDSITE is its high insensitivity to the structural distortions in protein models. Considering a cutoff distance of 4 Å as a hit criterion, the success rates for Set #0 (crystal structures), Set #1, Set #2 and Set#3 (protein models) are 83.8%, 81.4%, 78.9% and 76.5%, respectively, with comparable ranking (Figure 1, inset). Here, we allow a binding pocket center to be predicted within a distance of 7 Å from the center of a bound ligand in the crystal structure and the docking protocols were adjusted to allow the sampling of native-like conformations.
The cumulative distribution of interatomic contact distances calculated for ligand and protein heavy atoms is presented in Figure 2. A significant fraction of close contacts and steric clashes between ligand and protein heavy atoms can be observed in the conformations predicted by FINDSITELHM. This is because the receptor structure is absent in FINDSITELHM that docks ligands by a simple superpositioning procedure using the ligand consensus binding mode derived from template structures 37. Ligand conformations modeled by Q-DockLHM typically contain less close contacts; however some are still present after the reconstruction of all-atom models from the low-resolution structures. High-resolution refinement using AMMOS applied to ligand poses reconstructed from Q-DockLHM‘s conformations removes most of the unphysical contacts, and similar to AutoDock3 and LIGIN, produces ligand-protein complexes that closely follow the crystal structures with respect to the interatomic distances. AMMOS was found to be more effective in removing close contacts when applied to ligand poses refined by Q-DockLHM than those provided by FINDSITELHM. This is especially important for subsequent ligand ranking studies.
Typically, the binding pose prediction accuracy of docking algorithms is assessed by ligand RMSD from the crystal structure. The median RMSD calculated for ligands docked by FINDSITELHM for the CCDC/Astex dataset of 204 proteins using receptor crystal structures is 5.10 Å (we note that for the subset of 47 proteins whose predicted binding pockets are within 2 Å from the native ligand’s center that provide a substantial anchor coverage (≥0.9) for the target ligand, the median RMSD for FINDSITELHM (further refined by AMMOS) is 3.14 Å (2.83 Å), which is in good agreement with our previous benchmarks 37). Using Q-Dock (Q-DockLHM), the median RMSD of the predicted ligand poses decreases to 4.42 Å (4.10 Å). Furthermore, we find that the subsequent high-resolution refinement by AMMOS improves the accuracy of the all-atom conformations reconstructed from low-resolution models provided by Q-DockLHM to a median RMSD of 4.02 Å. Here, AutoDock3 provides the most accurate poses with the median RMSD of 3.68 Å. The RMSD calculated for ligand conformations obtained from LIGIN is 5.14 Å. We note that random ligand placement gives much worse results; the median RMSD is 9.23 Å.
For modeled protein structures used as the target receptors, the results of docking simulations are assessed in terms of the fraction of ligand heavy atoms that have been predicted within 1, 2 and 3 Å from their reference positions as well as the fraction of correctly predicted specific contacts and compared to these obtained for receptor crystal structures. The reference ligand coordinates for receptor models are calculated by transferring ligands from the crystal structures into the modeled structures upon the local superposition of the binding residues. Figure 3 presents the fraction of ligand heavy atoms placed near their reference positions. For the crystal structures (Figure 3A), Autodock3, Q-DockLHM followed by AMMOS and LIGIN give comparably accurate results for the largest threshold; roughly half of the ligand atoms are predicted within a distance of 3 Å from their crystal positions. Clearly, AutoDock3 is the most accurate for the smallest threshold, where one-quarter of ligand heavy atoms are placed within 1 Å on average. For protein models, FINDSITELHM docks the ligands with the highest fraction of their heavy atoms predicted within 1 Å from reference coordinates. Using a distance threshold of 3 Å, in most cases Q-DockLHM gives the most accurate binding poses, but they are often close to FINDSITELHM. These results (and also the analysis of the overall quality of the complex structures) suggest that in many cases due to the structural distortions of the binding pockets in modeled protein structures (Table I), the approximately correct binding pose of a ligand as assessed by the RMSD or the portion of the ligand placed near it’s reference conformation violates the excluded volume causing a strong energy penalty. However, as we show below, the low-resolution refinement by Q-Dock and especially by Q-DockLHM recovers more specific protein-ligand contacts than FINDSITELHM despite the higher RMSD values.
It has been pointed out that comparing the performance of ligand docking programs is a non-trivial task 13,55. The evaluation is even more complicated when docking accuracy is assessed based on ligand binding poses predicted using inaccurate receptor models, where a simple RMSD value may not provide an adequate evaluation metric. For that reason, in addition to the traditional RMSD calculations and the fraction of ligand heavy atoms predicted within various distance thresholds from their reference positions, we also use the fraction of native protein-ligand contacts recovered in the predicted complex structures to assess docking accuracy. Since interaction-based assessment is sensitive to what is defined as an interaction, we consider two accuracy measures based on the fraction of correctly reproduced high- (calculated on interatomic distances between heavy atoms calculated using LPC 54) as well as low-resolution (based on the distance between the centers of mass of protein residues that optimize the overlap with the high resolution definition 27) protein-ligand interactions. We find that the fraction of recovered specific native contacts, both high- and low-resolution, correlates well with a ligand’s RMSD from the native pose (Figure 4). Low-resolution contacts reliably reproduce the real interatomic contacts in all-atom structures with an average Matthew’s correlation coefficient of 0.8 27, yet are less sensitive to small deviations in the ligand and protein coordinates compared to the high-resolution contacts (Figure 4, RMSD range of 0-3 Å).
The fraction of correctly predicted specific high- and low-resolution native contacts is presented in Figure Figure55 and and6,6, respectively. For the receptor crystal structures, the fraction of recovered specific native contacts is highest for AutoDock3 (Figures (Figures5A5A and and6A).6A). However, all-atom docking approaches were found to be considerably less accurate in the prediction of specific protein-ligand contacts than FINDSITELHM and particularly for the case of low-resolution docking when modeled protein structures were used as the target receptors (Figures 5B-D and 6B-D). Furthermore, low-resolution docking/refinement by Q-Dock/Q-DockLHM improves the accuracy of ligand binding pose prediction over FINDSITELHM in terms of the recovered specific high- as well as low-resolution protein-ligand contacts, which is essential for improved ligand ranking. Finally, for receptor crystal structures and high-resolution contacts, Figure 5A, all-atom refinement using AMMOS slightly improves the binding poses predicted by Q-DockLHM.
Benchmark simulations reported here demonstrate that ligand homology modeling by FINDSITELHM followed by anchor-constrained low-resolution refinement by Q-DockLHM outperforms other approaches in ligand binding pose prediction against modeled receptor structures. FINDSITELHM provides an approximately correct binding pose for the highly conserved portion of a ligand, termed the anchor, and evaluates its structural conservation across the set of evolutionarily related proteins. This information is subsequently utilized in low-resolution docking to refine the binding mode and, as it is evident from Figure 7, to recover significantly more specific protein-ligand contacts than a simple all-atom refinement, e.g. using AMMOS, with comparable CPU time (see below).
Here, we assess how the harmonic RMSD restraints imposed on the anchor portion of a ligand in Q-DockLHM simulations affect the predicted binding pose compared to Q-Dock. Table II shows the median RMSD from the crystal structure, the fraction of recovered specific high-resolution contacts, the Matthew’s correlation coefficient calculated for consensus (derived from template protein-ligand complexes) low-resolution contacts and the docked energy obtained from Q-Dock and Q-DockLHM starting from the binding pose provided by FINDSITELHM as well as from random ligand conformations. Using the binding modes predicted by FINDSITELHM as the initial poses, the improvement of Q-DockLHM over Q-Dock is rather minor for the receptor crystal structures. However, as assessed by the RMSD and the fraction of recovered high-resolution contacts, the improvement is statistically significant at the level of 0.05 for theoretically modeled receptor structures, particularly those that are the most distorted (Sets #2 and #3). Additionally, we carried out docking simulations starting from the random ligand conformations instead of those provided by FINDSITELHM. Here, the improvement of Q-DockLHM over Q-Dock is highly significant in all the cases (Table II, values given in parentheses). As one would expect, with an appropriately long simulation time, the results should not depend on the initial conformation. Nevertheless, virtual screening applications require the docking simulations to be limited to at most few minutes per single compound.
Moreover, significantly higher MCC values obtained for the consensus quasi-specific low-resolution contacts recovered by Q-DockLHM result in ligand conformations with lower docked energy (Table II), which is a critical factor for effective ligand ranking. We also find that the sampling of native-like conformations is improved in Q-DockLHM with respect to Q-Dock. Figure 8 shows the median root-mean-square displacement of a ligand geometric center from the ligand center in the crystal structure across all replicas used in Replica Exchange Monte Carlo simulations. For receptor crystal structures (Figure 8A), the median displacement of a ligand center is lower, particularly for middle temperature replicas, when the Q-DockLHM protocol is used. In case of protein models (Figure 8B-D), a difference between Q-Dock and Q-DockLHM is also seen for low temperature replicas. When docking time is of importance, the harmonic restraints imposed on the structurally conserved portion of a ligand rapidly direct toward the native-like conformations. Thus, they improve the sampling of native-like conformations and provide acceptably accurate binding poses.
The very high sensitivity to structural distortions in protein models revealed for AutoDock3 and LIGIN motivated us to examine the accuracy of ligand binding mode prediction for ligand-free crystal structures used as target receptors. Figure 9 presents the docking results obtained for 135 proteins selected from the CCDC/Astex dataset for which both ligand-bound (holo) and ligand-free (apo) structural forms are available in the PDB. Clearly, even small deviations from the holo structures (see Table I) cause a statistically significant deterioration in the ability to reproduce the ligand binding mode for all-atom docking approaches as assessed by the RMSD (Figure 9A) as well as the fraction of ligand heavy atoms that have been predicted within 1, 2 and 3Å from their experimental positions (Figure 9B). In contrast, ligand binding poses provided by low-resolution docking by Q-DockLHM using holo receptor structures are indistinguishable from these obtained using the apo structural forms. Furthermore, we find that all-atom refinement by AMMOS applied to the ligand binding poses predicted by Q-DockLHM improves the accuracy of the final models irrespective of the receptor ligand binding state.
Conformational search efficiency in ligand docking is of particular importance in structure-based virtual screening. The large number of small compounds (usually thousands to millions) subjected to docking simulations in a typical virtual screening experiment requires that a binding pose must be predicted in an acceptable amount of CPU time. Figure 10 shows docking times for the programs and protocols used in this study. FINDSITELHM is the least CPU-expensive procedure with a median docking time of 19 sec. Q-Dock and Q-DockLHM require ~7 min to dock a ligand on average. High-resolution refinement by AMMOS typically uses ~5 min of a CPU time. All-atom docking by AutoDock3 (LIGIN) requires ~60 (~30) min on average. We note that for AMMOS, AutoDock3 and LIGIN, the default sets of parameters were used in this study and the docking protocols have not been optimized with respect to accuracy and simulation time.
Ligand virtual screen is routinely used in drug discovery to accelerate the identification of lead candidates for pharmacologically important targets 56-58. In practice, virtual screening algorithms suffer from a number of limitations. Due to the inherent imperfections of the energy functions, the predicted binding affinity is often strongly correlated with the molecular weight of the ligand, independent of whether or not it really binds 36,59,60. In virtual screening, the high contribution of ligand molecular weight to the predicted binding energy typically favors the selection of large compounds 61. This problem can be addressed by using more specific potentials such as pocket-specific potentials derived from a set of evolutionarily related complex structures 27, targeted scoring functions 62 or normalized energy scores 61. Furthermore, the very high sensitivity to the structural distortions of ligand-binding sites makes most existing ligand docking approaches inapplicable to theoretically modeled receptor structures, particularly those modeled by remote protein homology 26-28.
The analysis of ligand binding mode conservation in evolutionarily related proteins demonstrates that their ligands typically bind in similar fashion. Indeed, across distantly related proteins, the average RMSD of the protein binding pockets is 2-3 Å 37,63. The remarkably strong conservation of the chemical as well as structural aspects of ligand binding across evolutionarily related binding pockets suggests that it should be possible to correctly predict protein-ligand interactions even as the binding sites become somewhat distorted. This would beneficially expand the pool of bona fide drug targets from the small fraction of a proteome for which high quality crystal structures are available to the majority proteins whose requisite quality structures can be modeled using state-of-the-art protein structure prediction approaches 30-32,45.
Here, we demonstrate that the high performance of all-atom docking approaches in self-docking experiments falls of dramatically if modeled protein structures are used as the target receptors to dock ligands. Even moderate structural distortions of the modeled binding pockets that in principle should be tolerated as explained above (experimentally their binding pockets bind similar ligands, yet their RMSD is 2-3 Å), drastically interfere with the ability of the all-atom docking approaches to identify correct docking geometries. This problem might be alleviated by introducing flexibility into the receptor protein 64-66. However, inclusion of explicit receptor flexibility greatly increases the dimension of the conformational space and the simulation time 67,68; thus it is inapplicable in virtual screening experiments that typically involve docking a large collection of drug candidates with a computational effort of minutes per single compound. Furthermore, it has been demonstrated that the protein flexibility in ligand docking against non-native receptor structures typically does not improve the binding pocket RMSD due to the rugged energy landscapes created by all-atom force fields 68. In addition, the generation of ligand conformers is a limiting factor for docking accuracy 68. A somewhat more plausible explanation for the decreasing docking accuracy with the structural distortion of the binding pockets is that many ligand docking algorithms work mainly by shape complementarity; therefore they are missing essential features of binding.
To address this significant problem, we recently developed FINDSITELHM, a novel evolution-based ligand docking approach by homology modeling 37. The underlying basis of FINDSITELHM is that evolution tends to conserve not only functionally important regions in the protein structure, but also conserves a subset of ligand features as well. In addition, the low-resolution description of molecular recognition was demonstrated to recapitulate essential features of ligand-protein and protein-protein complexes by simulating the average effects of conformational flexibility 27,28,41,69. Here, we show how the structural diversity of the binding site captured by protein threading can be exploited in low-resolution modeling of ligand protein-ligand interactions. In fact, very similar techniques are commonly used in template-based protein structure prediction, where the structural information in the form of protein-specific potentials, distance and contact restraints is derived from multiple templates identified by threading and combined with generic energy terms with the goal of building a protein model that is closer to native structure than that provided by the best template 70-73. These basic principles outlined for template-based protein structure prediction also hold for ligand docking. By analogy, ligand poses provided by FINDSITELHM can be considered as the averaged template structures, whose accuracy is subsequently improved in constrained low-resolution docking simulations using Q-DockLHM. Here, the lower resolution description is of importance in that it averages high-resolution structural details and dramatically improves the tolerance to receptor structure deformation 27,28,41,69. Overall, our results suggest that the use of an evolution-based approach followed by adequate low-resolution modeling is capable of achieving satisfactory performance in protein—ligand recognition with a great potential for proteome-scale applications.
This work was supported by Grant Nos. GM-37408 and GM-48835 of the Division of General Medical Sciences of the National Institutes of Health.