|Home | About | Journals | Submit | Contact Us | Français|
Poor performance of scoring functions is a well-known bottleneck in structure-based virtual screening, which is most frequently manifested in the scoring functions’ inability to discriminate between true ligands versus known non-binders (therefore designated as binding decoys). This deficiency leads to a large number of false positive hits resulting from virtual screening. We have hypothesized that filtering out or penalizing docking poses recognized as non-native (i.e., pose decoys) should improve the performance of virtual screening in terms of improved identification of true binders. Using several concepts from the field of cheminformatics, we have developed a novel approach to identifying pose decoys from an ensemble of poses generated by computational docking procedures. We demonstrate that the use of target-specific pose (-scoring) filter in combination with a physical force field-based scoring function (MedusaScore) leads to significant improvement of hit rates in virtual screening studies for 12 of the 13 benchmark sets from the clustered version of the Database of Useful Decoys (DUD). This new hybrid scoring function outperforms several conventional structure-based scoring functions, including XSCOREHMSCORE, ChemScore, PLP, and Chemgauss3, in six out of 13 data sets at early stage of VS (up 1% decoys of the screening database). We compare our hybrid method with several novel VS methods that were recently reported to have good performances on the same DUD data sets. We find that the retrieved ligands using our method are chemically more diverse in comparison with two ligand-based methods (FieldScreen and FLAPLBX). We also compare our method with FLAPRBLB, a high-performance VS method that also utilizes both the receptor and the cognate ligand structures. Interestingly, we find that the top ligands retrieved using our method are highly complementary to those retrieved using FLAPRBLB, hinting effective directions for best VS applications. We suggest that this integrative virtual screening approach combining cheminformatics and molecular mechanics methodologies may be applied to a broad variety of protein targets to improve the outcome of structure-based drug discovery studies.
In recent years, structure-based virtual screening (VS) has become an increasingly more popular strategy for computer-aided drug design.1, 2 Structure-based VS approaches explore available or synthetically feasible chemical databases to identify a relatively small number of high-scoring hits that can be validated experimentally. A successful structure-based VS method can be applied to large data sets of compounds, resulting in significant enrichment of true binders among the top ranking hits.
Employing rigorous scoring functions is essential to the success of structure-based VS campaigns since scoring functions play a critical role in both initial pose generation of compounds (docking) and further ranking of compounds (scoring). Scoring functions can be divided into several types.3 Force field-based scoring functions predict binding affinity by explicitly accounting for intermolecular interactions such as electrostatic, van der Waals, hydrogen bonding, and hydrophobic interactions. However, due to the static nature of the underlying molecular models many important effects influencing the binding free energy such as entropy, micro-environment dependent polarization, π-stacking, and solvent effects, are often not taken into account. On the other hand, knowledge-based scoring functions employ various statistical parameters derived from experimentally determined protein-ligand structures that reflect their total physical interactions taking the molecular environment into account.4, 5 Ideally, knowledge-based scoring functions may implicitly capture binding interactions that are difficult to model in force field-based scoring functions.
Despite the increasing popularity of structure-based VS, recent studies have shown that inaccuracy of scoring functions is the major bottleneck of structure-based VS.6 It has been demonstrated that scoring functions often fail to recognize pose decoys, i.e., ligand poses that are geometrically different from the native binding orientation of the ligand in an experimentally determined crystallographic structure of the protein-ligand complex, yet score better than the native pose. In addition, known non-binders may also score better than true binders; such non-binders are designated as binding decoys.6, 7 Obviously, the presence of both binding and geometrical pose decoys in an ensemble of compound poses resulting from computational docking studies will decrease the accuracy of structure based VS. Perhaps, in part for this reason, structure-based scoring functions are well-known for having target-dependent VS performances.6
Many studies have focused on the development of target-specific customized scoring functions8 by adding expert-knowledge constraints (e.g., the hinge constraint in kinase targets9), or using native pose(s) as references to perform cheminformatics-based similarity ranking (e.g., SIFt10, 11 and SIFt-variant methods12, 13) and to construct pose filters (e.g. pharmacophore filter14) to filter out undesired poses. These post-docking pose treatments can effectively improve the discrimination between true ligands and binding decoys in structure-based VS for the aimed target. Furthermore, several studies included pose decoys or poses of inactives in combination with native structures in order to help tuning the scoring functions against binding decoys, which consequently enhanced the accuracy of virtual screening.15-20
Herein, by combing the merits of concepts mentioned above, i.e., statistical and force-field based scoring functions we devise a target-specific knowledge based pose (-scoring) filter that is trained to distinguish native-like poses from pose decoys. The approach of our knowledge-based pose (-scoring) filter employs protocols that are routinely used in cheminformatics research, e.g., binary quantitative structure activity relationship (QSAR) modeling, with the caveat that we use unconventional descriptors of the protein/ligand interface for pose scoring as opposed to using standard chemical descriptors of compounds. Unique to our approach is that this classifier is developed using both native-like and pose decoys generated from only one unique protein-ligand x-ray complex. The poses used in training are generated by multiple rounds of docking of this single cognate ligand to its binding target leading to an ensemble of different and diverse poses, where each pose can be defined either as native-like or a decoy based on the value of the pose’s Root Mean Square Deviation (RMSD) from the known native pose. In addition, novel protein/ligand interfacial descriptors based on Delaunay tessellation approach and atomic properties derived from conceptual density functional theory (DFT) are applied to represent the poses in training the binary classifier. Unlike the majority of interaction fingerprints used in previous virtual screening studies, our protein/ligand interfacial descriptors represent not only isolated interactions between pairs of atoms, but also atom contact networks at the protein-ligand interface based on tessellation patterns.
Using this filter, we have developed a target-specific hybrid scoring function for structure-based virtual screening in an effort to combine the advantages of both knowledge-based pose scoring and force field-based hit scoring functions. For a VS campaign, multiple poses are generated initially for each compound using one of the standard docking approaches. Then, our filter is used to eliminate poses predicted as decoys; the remaining poses are predicted as native-like with a certain level of confidence assigned by the model. These remaining putative native-like poses are ranked with a hybrid score based on the combination of the pose confidence and the binding score assesses by the physical force field-based MedusaScore developed previously in our group.21
We test the performance of this novel, hybrid scoring function on several benchmark sets available from the Directory of Useful Decoys (DUD).22 DUD is a specially designed data set including multiple targets, their known ligands, and decoys, which are compounds that are physically similar to yet topologically distinct from the known ligands. The recently refined DUD data sets include only lead-like compounds and have the true ligands clustered, making it an ideal benchmark set for testing scaffold hopping capability of VS methods.
We use Fred (OpenEye Scientific Software)23 to dock all compounds to target structures and generate multiple poses for each compounds. We find that for most targets the combination of pose filter and MedusaScore leads to significant improvement in the enrichment of virtual screening hits, as compared with using the MedusaScore scoring function alone. Then we compare the VS performance of several established structure-based scoring functions (XSCOREHMSCORE24, FredChemScore25, FredPLP26, and FredChemgauss327) and several novel VS methods without docking (FieldScreen28, FLAPLBX29, and FLAPRBLB29) that were recently reported to achieve good performances on the same DUD data sets. We find that our structure-based hybrid scoring function outperforms other structure-based scoring functions for majority of the targets. Furthermore, the retrieved ligands are less similar to the cognate ligand in comparison with ligand-based approaches (FieldScreen and FLAPLBX), and are complementary to the ligands retrieved by another hybrid method (FLAPRBLB).
The data sets of true ligands and presumed binding decoys for each target in this study are collected from the publicly available Directory of Useful Decoys (DUD)22. The DUD data sets were designed to minimize the physical biases inherent in the benchmarking of virtual screening schemes against different biological targets. Each ligand was matched with 36 decoy molecules that resemble the true ligands in physical properties, such as molecular weight, LogP, number of hydrogen bonding groups, and number of rotatable bonds but are distinct from the ligand topologically. In total, the DUD database consists of 40 data sets and for each true ligand there are typically 36 decoy molecules. Further refinement of the DUD data sets is done recently by applying a lead-like filter (MW < 450, AlogP < 4.5) to both ligands and presumed binding decoys35 as well as the reduced graph cluster filter to ligands30. These two filters are intended to mimic the real-life virtual screening campaign and to reduce the analogue bias inflating enrichment in virtual screening. We employ all 13 data sets from the refined FUD set, each of which includes at least 15 ligand clusters, for our method validation. The detailed information about the data sets is shown in Table 1. Six of the 13 targets belong to the kinase family (CDK2, EGFR, p38, PDGFrb, Src, and VEGFr2), where the majority of known ligands occupy ATP binding region. The remaining targets include the class of metalloenzymes (ACE, PDE5), serine protease (FXa), and several other enzymes (AChE, COX-2, HIVRT, and InhA). In order to compare directly with other VS methods, we use the protein-ligand complexes provided in the original DUD for pose filter training. For VEGFr2 and PDGFrb targets, the complex structures provided in the DUD data sets are generated by docking ligands to apo protein structures.
For each target, we prepare the x-ray structure using utilities available within the Molprobity31 server to add and optimize hydrogen atoms while correcting potential misinterpretations of amino acid (asparagine, glutamine, or histidine) terminal flips. The crystallographic water molecules located inside the binding pocket are removed in order to avoid biases when generating poses of molecules but cofactors (e.g., NAD in 1p44 protein model) or metal atoms (e.g., Zinc in the 1o86 protein model) are preserved if they are important for enzyme to function or are involved in interactions with the cognate ligand. We add hydrogen atoms of each small molecules using MOE software (version 2007.09)52 under standard protocols.
We employ the Fred docking software (version 2.2.5) from OpenEye Scientific23 to generate an ensemble of poses for each compound. The ensemble is generated by enumerating rigid rotations and translations of each conformer within the binding site. The conformers of each compound are generated by Omega (version 2.2.1)23 based on default parameters and the binding site is defined by a 5 Å grid box centered on the cognate ligand. For kinase targets, it is well-known that hydrogen bond interactions with the protein hinge residues is necessary for both Type I and Type II kinase inhibitors.32 Thus, this constraint is applied during pose generation to improve docking accuracy.
We apply default parameters provided by Fred during docking except for the number of output poses. For pose filter construction, we retain up to 1000 top-scoring poses generated by docking a single cognate ligand in order to ascertain the conformational diversity of poses. For virtual screening, the top 30 poses (ranked by the Fred’s default scoring function, Chemgauss3) of each molecule are preserved for re-scoring by other scoring functions (e.g., MedusaScore).
“Binding decoys” are defined as ligands that do not bind to a specific target experimentally (non-binders) but score as high as (or better than) true ligands. Similarly, we use the terms “pose decoys” to describe the poses generated by docking the cognate ligand against the protein target but score better than native-like poses. In our study, native-like poses are defined as those generated by docking with binding mode(-s) similar to the native pose. The similarity between docking poses and the native pose is often measured using Root Mean Square Deviation (RMSD). For the purpose of pose-filter training, we define a RMSD threshold of 4Å to classify poses into native-like poses and pose decoys. The 4Å threshold is consistent with the observation that there is a gap on the distribution plot (MedusaScore vs. RMSD) of poses generated by re-docking the cognate ligand for most targets (Figure 1, Figure S1).
Earlier, we developed the so called ENTess chemical geometrical descriptors33 of the protein-ligand interface. These descriptors are obtained by using Pauling electronegativity (EN) as an atomic property and Delaunay Tessellation (Tess) to characterize the protein ligand interface as follows. When applied to protein-ligand complexes represented at the atomic resolution level, Delaunay tessellation partitions the protein ligand interface into an aggregate of space-filling, irregular tetrahedra, with both protein and ligand atoms as vertices. Each Delaunay quadruplet is characterized by its unique four-atom composition, which defines the descriptor type (certainly, the same four-body compositions may occur in different, or even the same, protein/ligand interfaces). Furthermore, for each quadruplet we calculate the sum of En values of the composing atom-vertices, which produces the descriptor value. In the previous study,33 we used the ENTess descriptors to build successful quantitative structure-binding affinity relationship (QSBR) models for 264 x-ray characterized protein-ligand complexes with known binding affinity; the modeling approach followed our standard model development and validation workflow.34
In this study, we have developed and employed novel descriptors that are methodologically similar to ENTess descriptors but are theoretically more rigorous.35 These new descriptors employ pairwise atomic potentials for the protein-ligand complexes (PL) based on maximal charge transfer (MCT)36 in place of Pauling electronegativities, called here PL/MCT-tess. The values of PL/MCT-tess descriptors are calculated from the following equation (see also Figure 2):
where PL/MCT-tessm is the potential of the m-th tetrahedron type (i.e., individual descriptor type); n is the number of occurrences of this tetrahedron type in a given pose; p is the vertex index of a protein atom, l is the vertex index of a ligand atom, and dpl is the distance between a pair of protein and ligand atoms found in the same Delaunay tetrahedron. Note that Delaunay tetrahedra at the protein-ligand interface can be classified based on the relative content of protein and ligand atoms, i.e., three protein and one ligand atoms, two from each, or one protein and three ligand atoms; this explains the tetrahedral type counts in the second and third sum in Equation 1.
The MCT characterizes the maximal electron flow between the donor and acceptor atoms at the protein-ligand interface. It is derived from the conceptual DFT37, 38, which provides a theoretical basis for calculating the PL/MCT-tess descriptors. The MCT is calculated as follows, assuming that the total energy of the system is perturbed by the charge transfer up to the second order:
where ΔE and ΔN represent energy change and charge transfer, respectively. When the total energy is minimized with respect to the charge transfer, dΔE/dΔN = 0, we have
where μ and η are the chemical potential (negative of electronegativity) and the chemical hardness respectively, defined by μ = (E/N)ν and η = (2E/2N)ν with ν representing the external potential formed by the framework of atomic nuclei.
As described above, we classify poses generated by docking the cognate ligand against the protein target into native-like and decoy poses based on a certain RMSD threshold. The problem of separating native-like poses vs. pose decoys for a molecule can be treated as a binary classification problem where each pose is characterized by the descriptors of the protein-ligand interface (i.e., PL/MCT-tess descriptors in this study). This representation treats the problem of discriminating native-like from decoy poses as a conventional binary classification problem faced in many conventional cheminformatics investigations that employ binary QSAR modeling.
To train this knowledge-based pose scoring function for each target, we retain up to 1000 poses generated by docking a single cognate ligand against the target (Figure 3). For the VEGFr2 and PDGFrb targets, where the native pose is unavailable (only apo structure and a model structure are available, respectively), the pose with the lowest MedusaScore is considered as a native pose. This is a reasonable assumption since MedusaScore performed well in an earlier benchmarking exercise for the native pose prediction.21 We classify the poses based on the 4 Å threshold as either native-like (RMSD ≤ 4Å) or decoys (RMSD > 4Å) except for the PDE5 pose set where the gap is observed at 3 Å and therefore the 3 Å threshold is used. For the poses from docking the cognate ligand of 1ckp (CDK2), we do not observe a characteristic distribution (as, for example, in Figure 1). Therefore, we regenerate the poses using MedusaDock39 instead of Fred.
For each pose, we generate PL/MCT-tess descriptors to characterize its interactions with the target protein. The degree of similarity of each pose to the native pose is quantified by the Euclidean distance in the PL/MCT-tess descriptor space. Therefore, the pose distribution of each target’s modeling set can be characterized by three parameters: the Euclidean distance to the native pose in the PL/MCT-tess descriptor space (x-axis), the RMSD value (y-axis), and the MedusaScore (color bar). It is desirable that poses with lower RMSD value correspond to smaller distances to the native pose in the PL/MCT-tess descriptor space (e.g., Figure 1).
If a binary data set including native-like (class 1) and decoy (class 2) poses is balanced (the ratio between the two classes is less than two), we randomly exclude 20% poses as the external test set and construct models based on the remaining 80% poses (training set). In the case of imbalanced distribution, we downsize the major class by retaining only those poses that are similar to poses in the minor class, where the degree of similarity is assessed by Euclidean distance in the PL/MCT-tess descriptor space and then use the same model validation procedure. For example, the ACE target has 48 native-like poses and 952 decoy poses; after down-sampling, only 49 decoy poses most similar (in terms of Euclidean distance) to the native-like poses are retained for model building and validation (Table 2 and Figure S1). We note that this approach to down-sampling where the most similar instances of the opposite class are retained naturally makes the classification problem more difficult and therefore it increases the statistical significance of the resulting classification models.
We employ the Support Vector Machines (SVM) software implemented in the open-source LibSVM40 package to build binary classification models (i.e., pose filters). We use all models with eligible CV accuracy for predicting poses in the external test set and evaluate model external predictive accuracy by its ability to classify native-like versus decoy poses in the external set. The validated models (i.e., pose filter) are used further to score any pose generated in virtual screening as either native like or decoy, expecting that binding decoys will be classified as non-native poses. For each pose, we calculate a FilterScore, which is the fraction of models that predict it as native-like. This process of deriving pose filter is followed for each target in the refined DUD dataset.
It should be emphasized again that only one cognate ligand is used for each target to develop a pose scoring function. However, due to the generic nature of PL/MCT-tess chemical descriptors the respective pose filters can be applied to score all poses of all diverse ligands used in target specific docking and VS studies.
MedusaScore21 is a physical force field-based scoring function that describes the major physical interactions between proteins and ligands, including van der Waals interaction, salt bridge, hydrogen bonding and solvation. MedusaScore is an extension of the Medusa force field,41 which was developed originally to describe physical interactions within proteins. The original parameters of the Medusa force field were trained on 34 high-resolution protein crystal structures with diverse sequences. Thus, by default MedusaScore is expected to be transferable and applicable to virtual screening of a variety of chemical compounds. Notably there were no protein-ligand data used in the development of MedusaScore, but it still exhibits remarkable accuracy in both docking pose discrimination and binding affinity prediction.21 During the pose rescoring by MedusaScore, we turn off van der Waals repulsion because this term has been shown to be sensitive to small deviation in ligand poses.21 It is safe to remove the term in this case because all steric clashes have already been considered during the generation of docking poses in eth refined DUD dataset.
In order to combine the FilterScore and the MedusaScore, which are of different nature and scales, we utilize normalized Z-scores based on statistical distributions of respective scores for the ensemble of poses for each ligand. We start by applying the pose filter to all poses generated by docking, and discard poses that are predicted as decoys by all eligible models (i.e., FilterScore = 0). Based on the mean (μ) and standard deviation (σ) of each respective scoring function, the Z-score for each pose is calculated from the respective raw scores (X) using Equation 4.
If the filter is constructed based on the entire sampling space of poses from docking the cognate ligand (for balanced datasets), we apply the same weight for both FilterScore and MedusaScore, and the Z-score for each pose is derived as:
We add a minus sign for ZFilterScore so that lower Z-score will correspond to better ranked pose, consistent with the MedusaScore convention.
If the filter is constructed based on the poses after the down-sampling procedure, we employ a modified scoring strategy based on the concept of the applicability domain.42 We predict poses within the applicability domain using Equation 5 and predict poses out of the applicability domain by adjusting the weight of FilterScore by DistScore (Equation 6).
The ZDistScore is the Z-score of each pose based on the its Euclidean distance to the native pose in the PL/MCT-tess descriptor space; the mean and the standard deviation are derived from the distribution of PL/MCT-tess Euclidean distances to the native pose of all docking poses. Assuming a normal distribution of inter-pose distances similar to that for poses from docking of the cognate ligand, we regard docking poses to be within the applicability domain of the modeling set when ZDistScore < -1. This threshold is defined by inspecting the distribution of distances between poses in the PL/MCT-tess descriptor space of the modeling sets for five targets (ACE, CDK2, COX-2, HIVRT, and VEGFr2 in Figure S1). The final score for each compound is based on the pose with the lowest sum of Z-scores among all poses retained for that compound.
To examine the overall performance of our method for a target data set in virtual screening, we plot the Receiver Operator Characteristic (ROC) curve. We calculate the Area Under the Curve (AUC) value for each ROC curve to estimate the average performance of our method throughout the ranked list. On the other hand, to quantify the performance of each method at the early stage of virtual screening for a target data set, we employ the ROC Enrichment (ROCE) value. Unlike the conventional enrichment factor (EF) metric, ROCE values are independent of the ratio of binding decoys to ligands in a target data set, making them ideal metrics for comparing different methods.43 The ROCE value is defined as the ratio of true positive rates to the false positive rates, for a given percentage of binding decoys (i.e., the slope at each point on the ROC plot). We report ROCE values at 0.5%, 1%, 2%, 5% as suggested43-45 and employed in previous publications28, 29, 46. The meaning of ROCE value at 1% represents the fold enrichment over random performance. In order to emphasize the retrieval of diverse scaffolds, the above metrics (ROCE and AUC) are modified by applying an arithmetic weight to each ligand (awROCE and awAUC)47, which is inversely proportional to the size of the cluster it belongs to.
We estimate the uncertainty of awROCE/awAUC values using the statistical bootstrapping procedure.28 For each ranked list, we randomly exclude 20% of data points and recalculate the awROCE values. This is repeated 10,000 times and the standard deviation of awROCE values is used to estimate the error of awROCE. Due to the nature of pose filter, many true negatives (presumed binding decoys) and some false negatives (true ligands) are eliminated in several data sets (e.g., ACE, p38, and etc). For these data sets, we calculate the awROCE values based on the reduced list, resulting in a larger estimated error at the low percentages.
Several popular structure-based scoring functions, which are reported to have good docking pose discrimination and binding affinity prediction21, 48, 49, are selected to compare against our hybrid scoring function. It is intriguing to test the performance of these scoring functions since it has been suggested that scoring functions should be tailored for virtual screening.19, 50 In total, we have tested five scoring functions including MedusaScore, HMSCORE, Chemgauss3, ChemScore25, and PLP26. HMSCORE is part of the XSCORE51 scoring utility. Chemgauss3, ChemScore, and PLP are scoring functions implemented in Fred. Moreover, we also compare our approach with some methods that have been applied to the same data set including FieldScreen28 and FLAP (both LBX and RBLB protocols)29. FieldScreen28 and FLAPLBX29 are two novel ligand-based virtual screening approaches using grid points derived from the cognate ligand as query; FLAPRBLB approach29 utilize grid points generated from protein target bound to the cognate ligand. It should be noted that the binding decoys in DUD are designed to be physically similar to, yet topologically distinct from the true ligands. Any ligand-based approaches applied to this data set might generate overly optimistic results.
We generate the MACCS structural keys for each compound using MOE software (version 2007.09)52 under standard protocols. We calculate the Tanimoto coefficient (Tc) as the similarity metric between the cognate ligand and compounds in the screening library.
The number of poses used in the construction of the pose filter and the number of models used in predicting external test set or in further virtual screening are shown in Table 2 along with the model statistics for each target. We also present the distribution of poses of modeling set for each target in Figure S1. Depending on the target, the distributions of the native-like poses and pose decoys are either balanced (AChE, EGFR, FXa, InhA, p38, PDE5, PDGFrb, and Src), or shifted towards pose decoys (ACE, CDK2, COX-2, HIVRT, and VEGFr2). The details of modeling techniques to address the imbalanced classes have been described in the Methods. The results show that the overall accuracy for both the training set and the external test set exceeds 90% for all data sets except ACE, HIVRT, and p38. We predict the docking poses generated from each data set using the models which have CV accuracy greater than 90% except for the HIVRT data set which has no models with CV accuracy above 90%. In the latter case, a threshold of 80% is applied.
Initially, we used our standard modeling workflow53, employing five-fold external cross-validation, to build the pose-filter. We found that models of each fold can achieve high and similar CV accuracy (0.92-0.98), and the VS performance using the models from each individual fold is the same (data not shown). This indicates that such data (poses from one protein-ligand complex) are easy to classify, and the 5-fold external CV modeling procedure does not bring extra reliability to resulting models. Therefore, in this study we adopt a simplified workflow to construct pose-filter models using one random external split (see Methods), and the validated models from this split are used in virtual screening.
It should be emphasized again that for each target-specific filter, we use only one cognate ligand to generate multiple docking poses for further model building. Nevertheless, the filter is applicable to diverse compounds during VS due to the generality of the chemical descriptors we use to characterize the protein-ligand interface. As demonstrated below, these single-ligand based pose filters can significantly improve the accuracy of virtual screening and true hit selection in combination with the MedusaScore force field.
We compare the VS performance of MedusaScore and MedusaScore plus pose filter. We apply the protocols to all the 13 targets in the DUD clustered data set. We measure the VS performance of the two scoring protocols using the awROC curves (Figure S2). More specifically, we use awAUC values to measure the overall ligand retrieval of the protocols, and use awROCE values at 1% to measure the ligand retrieval at the early stage of VS (Figure 4a).
We find that VS performance is remarkably improved over the benchmark set by applying the MedusaScore plus pose filter (i.e., the hybrid scoring function). For all the 13 targets from the refined DUD set, the awAUC values using the hybrid scoring function are consistently higher than using MedusaScore alone. The improvements are least significant for EGFR and VEGFr2 targets, where the awAUC value is improved only by about 0.02 in both cases. This is probably due to the fact that using MedusaScore alone already results in high awAUC values for these two targets (0.83 and 0.65, respectively). For the other targets, the average degree of awAUC improvement is 0.15, and we find the highest improvements are for AChE and FXa.
When comparing the awROCE values at 1%, we find the hybrid scoring function is better than using MedusaScore alone for all targets except Src (Figure 4b). The improvement of awROCE at 1% is most significant for target PDE5 and PDGFrb. For PDE5, we are not able to retrieve any active ligand using MedusaScore alone (awROCE@1% = 0), but the value is improved to approximately 26.5 fold over the random at 1% after combining MedusaScore with the pose filter. The pose filter also improves the ligand retrieval for PDGFrb (awROCE@1% = 43.18), even though the original awROCE value is already high (23.49) using MedusaScore alone. In addition, for the two targets (EGFR and VEGFr2) where the least improvement of awAUC is observed, the awROCE values at 1% are also improved significantly.
Therefore, by combining MedusaScore with pose filter, we not only improve the overall VS performance (as measured by awAUC), but also improve the early enrichment (as measured by awROCE values at 1%). The improvement seems to be more pronounced at the early stage, which is a desirable feature because in practice, only a small fraction of VS hits will be experimentally tested.
We also compare the VS performance of our hybrid scoring function with four popular pose scoring functions, including XSCOREHMSCORE, FredChemScore, FredPLP, and FredChemgauss3. We apply those scoring function to the same docking poses and compare their VS performance at the early screening stage (Figure 5).
We find that our hybrid scoring function outperforms others for most of the targets. At a false positive rate of 0.5%, the hybrid scoring function has the highest enrichment for seven out of the 13 targets. In addition, the awROCE values for those targets vary from 21.66 to 86.46. In contrast, other scoring functions have the best performance at no more than 3 targets, with awROCE values varying from 12.07 to 43. We find a similar trend at the 1% level. In this case, our hybrid scoring function has the highest enrichment for six targets, with awROCE values varying from 22.88 to 43.18, while other scoring functions perform best for at most three targets, with awROCE values in the range of 9.67 to 26.56. This comparison demonstrates that our hybrid scoring function has better and more consistent VS performance than conventional scoring functions.
The hybrid scoring function has the worst performance for Src. We will analyze the possible reasons of Src failure in the Discussion part. For this target, using MedusaScore alone gives reasonably good enrichment factor of 24.77, close to that from using ChemScore (25.97). With the aexception of Src, the hybrid scoring function tends to have the best performance on targets where using MedusaScore alone also gives fairly good enrichment.
We select a few recently developed VS methods, for which the benchmark results have been reported on the same DUD Cluster data set. One of the methods available for comparison is FieldScreen28, which is a ligand-based scoring VS method that utilizes molecular fields derived from the cognate ligand as query. Excellent VS performance has also been reported using FLAP29 molecular field-derived pharmacophores. For FLAP, we compare with two different VS protocols: FLAPLBX, similar to FieldScreen, which uses ligand-based molecular field, and FLAPRBLB, which uses both receptor and co-crystallized ligand structure to derive the pharmacophore query. These methods represent the state-of-art VS methods that have been fully tested using the entire DUD clustered set.
The awROC curves of scoring methods for each target are shown in Figure 6 and the awROCE values at each stage are tabulated in Table S6-S10. We find that the VS performance of each scoring method is target-dependent. Our method has the best retrieval for HIVRT, p38, and PDGFrb. RBLB has clearly the best performance for PDE5, Src, and VEGFr2. On the other hand, ligand-based VS methods unquestionably outperform other structure-based methods for COX-2. A close examination of the COX-2 data set reveals that around 47% of true ligands belong to the same cluster as the cognate ligand used as query. To further investigate the chemical similarity of retrieved ligands to the cognate ligand from different scoring approaches, we compare the average Tanimoto coefficient (Tc) values of true ligands from top 20 ranking lists (Table 3). Not surprisingly, we find that ligands retrieved by ligand-based VS methods are chemically more similar to the cognate ligands, as measured by the average Tc values. And the average Tc value of the retrieved COX-2 ligands is 0.88 (e.g., FieldScreen), much higher than the average of those for other 12 targets (0.66). The high degree similarity of COX-2 ligands to the query should result in better performance of any ligand-based VS methods such as FieldScreen and FLAPLBX methods in this case.
We further compare the early enrichment for our hybrid scoring function and FLAPRBLB approach because these two methods seem to have the best VS performance at the early stage (in the 0.5% to 5% range). In addition, both methods take advantage of the 3D structures of the receptor and co-crystallized ligands, albeit using different approaches for VS. We want to identify if the different approaches might retrieve different ligands. In fact, we find the two methods seem to be complementary to each other. Among the top 20 hits retrieved by the two methods, we find little overlap of the ligand types (Figure 7). For example, FLAPRBLB approach is able to retrieve only one cluster for target p38 and PDGFrb, and two clusters for target ACE. In contrast, the MedusaScore with filter approach can retrieve 4, 5, and 7 clusters, for these three targets ACE, p38 and PDGFrb, respectively. Interestingly, the additionally retrieved ligand clusters do not overlap with those obtained using FLAPRBLB approach. This is also the case for target VEGFr2, where MedusaScore with filter approach retrieved additional five clusters with no overlap with ligands retrieved by FLAPRBLB method. For other targets such as AChE, CDK2, EGFR, HIVRT, and InhA, only a small fraction of the newly retrieved clusters overlaps with those from FLAPRBLB approach. Hence, although both methods used receptor and cognate ligand structures for VS, the resulting performance of FLAPRBLB approach and our approach seem quite complementary for different targets. Combining the two methods shall result in most diverse ligands among the top hits for VS application.
The atom types in PL/MCT-tess descriptors are defined based on their unique chemical names. This implementation makes PL/MCT-tess descriptors fairly sensitive to special cases (e.g., when tri-fluoro functional group is present in the ligand). However, using poses with such unique types of interactions to construct pose filter makes it too specific. For example, in the Src data set, the cognate ligand used to construct pose filter is ANP, which has a long phospho-aminophosphoric chain uncommon to any lead-like ligands. Unsurprisingly, the pose filter predicts almost everything as pose decoys and the hybrid scoring function deteriorates the VS performance of MedusaScore against the Src data set. When we employ another cognate ligand obtained from 1yol protein-ligand complex to construct pose filter, the hybrid scoring function has slightly improved VS performance of MedusaScore against the Src data set (awROCE@1% = 27.6 vs. 25.5; awAUC = 0.66 vs.0.62).
Similarly, the hybrid scoring approach only marginally improves the VS performance of MedusaScore against the COX-2 data set, where the pose filter is constructed based on the cognate ligand with a tri-fluoro functional group. For this case, combining MedusaScore with pure DistScore can easier filter out ligands having the distinctive features of the query compound than using MedusaScore plus pose filter approach (awROCE@1% = 8.7 vs. 4.1; awAUC = 0.67 vs. 0.39).
Theoretically, the target-specific pose scoring filter can be used in combination with any other structure-based scoring function since the definition of pose decoys is based on the RMSD threshold, independent from the scoring function’s output. This hybrid scoring approach can improve the VS performance by eliminating binding decoys recognized by pose filter or by increasing weight for the ligands favored by the pose filter. If the ligands favored by the pose filter have relatively poorer scores predicted by the parent scoring function and the high-scoring binding decoys are not completely eliminated, then combining the pose filter and the parent scoring function gives limited improvement. For example, in the CDK2 data set, the Cluster #1, #2, #3, #7, and #8 are favored by both the pose filter and Chemgauss3 but are relatively disfavored by MedusaScore, resulting in better performance when combining pose filter with Chemgauss3 (awROCE@1% = 26.0 vs. 14.4; awAUC = 0.84 vs. 0.71). Another example is the FXa data set, where combining pose filter with Chemgauss3 has better VS performance (awROCE@1% = 15.4 vs. 4.8; awAUC = 0.80 vs. 0.72). However, docking programs/scoring functions are well-known for having inconsistent VS performances across diverse targets.6 Therefore, from the practical point of view, it is more important to improve scoring function performance consistently rather than to achieve ideal results for a few targets. The proposed pose filter is designed with this view in mind.
We find that the 4 Å-threshold seems optimal considering the distribution of RMSD values of poses and the pose filter performance in virtual screening. Lowering the threshold results in fewer native-like poses included, which occupy a smaller portion of the descriptor space; this ultimately leads to a smaller applicability domain of the pose filter. As a result, using this pose filter in VS leads to poorer performance compared with using the pose filter built based on the 4 Å-threshold. The PDE5 data set is an exception, where a clear gap around 3 Å RMSD can be observed on the pose distribution plot. In virtual screening against the PDE5 data set, the performance of the hybrid scoring approach with 3 Å-threshold filter is better than with the filter based on the 4 Å-threshold (awROCE@1% = 26.5 vs 17.2; awAUC = 0.75 vs. 0.72). Moreover, it should be interesting to include the output of a scoring function into the definition of native-like poses and pose decoys (e.g., to train filter only on those native-like poses and pose decoys that are ranked high by the given scoring function), thus, building filters specifically adjusted for each scoring function.
As shown in Figure 1, poses with lower RMSD value typically have smaller distances to the native pose in the PL/MCT-tess descriptor space. It can be assumed that, for a given molecule, its likelihood to be a true ligand is directly related to how close its pose is to the native pose, which can be reflected by the DistScore. We have applied DistScore in combination with FilterScore to virtual screening for the data sets, where down-sampling during filter construction is necessary. Therefore, for the proper comparison, we also perform virtual screening against all data sets using MedusaScore in combination with DistScore alone (Figure S2). We find that using pose filter to eliminate/penalize pose decoys in virtual screening can consistently improve MedusaScore VS performance and the MedusaScore plus pose filter approach has the best performance for all data sets except for the outliers mentioned above.
We have developed a novel knowledge-based pose (-scoring) filter using concepts frequently employed in cheminformatics research such as chemical descriptors of the protein-ligand interface and machine learning techniques for deriving binary pose classification (native-like vs. pose decoys) models. We have combined this novel pose filtering procedure with a recently developed physical force field-based scoring function (MedusaScore) to score the docking poses in virtual screening applications. We have validated this hybrid scoring function using the refined subsets (13 targets) from the DUD database. The refined DUD sets consist of only lead-like compounds and ligands are clustered based on the reduced graph algorithm, making them suitable for testing scaffold hopping capability of VS methods. The validation results demonstrated that our method can consistently improve the VS performance of MedusaScore provided that the protein-ligand complex is suitable for filter training. Comparing with other established structure-based scoring functions, including XSCOREHMSCORE, FredChemScore, FredPLP, and FredChemgauss3, the hybrid scoring function outperforms other methods in six out of 13 data sets at early stage of VS (1% decoys been screened). Moreover, we find that ligands retrieved by the hybrid scoring function are chemically more diverse than those by other two ligand-based VS methods (FieldScreen and FLAPLBX) using the same DUD data sets. Interestingly, we have observed that our method is complementary to FLAPRBLB, which is a high-performance VS method that also utilizes both the receptor and the cognate ligand structures.
In summary, we have demonstrated that the hybrid cheminformatics/molecular mechanics based scoring function affords good enrichments in VS experiments and allows for effective scaffold hopping, suggesting that it could be applied to virtual screening against novel pharmaceutically relevant protein targets to identify promising leads. An interesting and unique feature of our approach is that the pose classifier is formally trained to recognize geometrical decoys of a single ligand; yet, it correctly recognizes (and eliminates) most of the binding decoys of multiple test ligands because they are predicted as geometrical decoys. In particular, this method is suitable for protein targets when only limited ligand binding data is available. A single x-ray protein-ligand complex or, as we have demonstrated for PDGFrb target, a homology based protein model with a known binder is sufficient for constructing a successful target-specific pose filter. Additional improvements can be sought for both the pose (-scoring) filter (e.g., using more than one ligand for training, employing alternative atomic properties or potentials for ENTess-like scoring functions, or incorporating other features, e.g., shape parameters, characterizing Delaunay tetrahedron generated at the protein-ligand interface), as well as approaches for the integration of knowledge-based and physical force field-based scoring functions.
We thank Dr. Alexander Golbraikh for helpful discussions and constructive criticism in the course of this research project. We also thank Dr. Alexander Sedykh for proofreading the manuscript and thoughtful discussions. We are grateful to Chemical Computing Group and OpenEye Scientific for software grants. Finally, we acknowledge access to the computing facilities at the ITS Research Computing Division of the University of North Carolina at Chapel Hill. The studies reported in this paper were supported in part by the NIH research grants GM066940 and its ARRA supplement GM066940-06S1 (AT), the University Cancer Research Fund Innovation Award (AT), the NIH research grant R01GM080742 (ND), and the NC TraCS Institute Pilot Grant 2KR10813 (XW).
Supporting Information Available: the distribution of poses generated from re-docking process for all 13 targets; the awROC curves for comparison of (a) MedusaScore, (b) MedusaScore+DistScore, (c) MedusaScore+Pose filter; and the awROCE values in Figure 4 and Figure 5. This information is available free of charge via the Internet at http://pubs.acs.org/.