|Home | About | Journals | Submit | Contact Us | Français|
Docking scoring functions are notoriously weak predictors of binding affinity. They typically assign a common set of weights to the individual energy terms that contribute to the overall energy score, however, these weights should be gene family-dependent. In addition, they incorrectly assume that individual interactions contribute towards the total binding affinity in an additive manner. In reality, noncovalent interactions often depend on one another in a nonlinear manner. In this paper we show how the use of support vector machines (SVMs), trained by associating sets of individual energy terms retrieved from molecular docking with the known binding affinity of each compound from high-throughput screening experiments, can be used to improve the correlation between known binding affinities and those predicted by the docking program eHiTS. We construct two prediction models; a regression model trained using IC50 values from BindingDB, and a classification model trained using active and decoy compounds from the Directory of Useful Decoys (DUD). Moreover, to address the issue of overrepresentation of negative data in high-throughput screening data sets, we have designed a multiple-planar SVM training procedure for the classification model. The increased performance that both SVMs give when compared with the original eHiTS scoring function highlights the potential for using nonlinear methods when deriving overall energy scores from their individual components. We apply the above methodology to train a new scoring function for direct inhibitors of M.tuberculosis (M.tb) InhA. By combining ligand binding site comparison with the new scoring function, we propose that phosphodiesterase inhibitors can potentially be repurposed to target M.tb InhA. Our methodology may be applied to other gene families for which target structures and activity data are available, as demonstrated in the work presented here.
Molecular docking aims to evaluate the feasible binding geometries of a putative ligand with a target of known 3D structure. Typically, docking algorithms consist of both a search algorithm for the exploration of different ligand (and sometimes protein) conformations, and a scoring function for the calculation of ligand binding affinities. Ideally, the scoring function should be able to identify a solution with the correct ligand binding mode from alternative solutions, and ultimately be able to rank a set of ligands according to experimental binding affinity. In principle, the binding affinity should be calculated based on the first principle of thermodynamics. The most powerful approach is the absolute binding free energy (ABFE) approach 1-6, which uses extensive conformational sampling from molecular dynamics simulation, fully detailed atomic force fields, and a separate simulation of the solvation of the ligand, protein and associated complex. However, ABFE is too computationally expensive to be applied to screen millions of compounds. Furthermore, in spite of its cost, the prediction from ABFE is not always accurate 7. Tremendous efforts have been made to develop physical-based or knowledge-based docking scoring functions to efficiently predict binding affinity. However, docking scoring functions remain notoriously weak predictors of binding affinity. Indeed, following an evaluation of 10 docking programs and 37 scoring functions, Warren et al. 8 concluded that scoring functions are in need of significant improvements for predicting binding affinity. The principal reason for failure is the inability of the scoring function to reliably rank optimal ‘native-like’ ligand conformations above ‘non-native’ orientations 9. Thus, although in most cases the correct binding mode can be retrieved during the conformational search, assigning the lowest energy score to the correct binding pose has proved to be more challenging. This inevitably leads to poor correlation with experimentally determined binding affinities. In general, the prediction of binding affinity is a challenging task since it is not only the result of collective weak noncovalent interactions, but it also includes the ability of the ligand to access the binding site, the desolvation free energy of the ligand and the binding site, and entropy and enthalpy changes in the ligand, protein, and solvent 10. A realistic goal for docking scoring functions may be to discriminate active and inactive compounds and to rapidly filter out likely inactives in high-throughput screening campaigns.
Almost all existing docking scoring functions, including physical-based force fields, involve the fitting of data from experiments and calculations based on quantum mechanics. Docking scoring functions typically assign a common set of weights to the individual energy terms that contribute to the overall energy score, however, the weights assigned to the individual energy terms are, in reality, gene family-dependent. In order to achieve more accurate results, the docking score must be trained to derive a unique set of weights for each individual family. Furthermore, many docking algorithms incorrectly assume that individual interactions contribute towards the total binding affinity in an additive manner, and therefore derive their predictions of binding affinity from a linear combination of individual energy terms. However, this method is not theoretically sound, since it fails to consider the cooperative effects of noncovalent interactions. Such cooperativity has only been considered recently. These studies all highlight the need for docking scoring functions that are based on nonlinear models 10-13. Thus, the development of target-dependent docking scoring functions based on nonlinear models would significantly enhance their predictive ability, facilitating the rational design of ligands that bind to specific sites on target proteins 11.
The ability of machine learning methods to produce effective scoring functions has been demonstrated in a number of recent studies. For instance, both Deng et al. 14 and Ballester and Mitchell 15 characterized protein-ligand complexes using feature vectors comprising the number of occurrences of specific protein-ligand atom type pairs interacting within predefined distance thresholds. While Deng et al. trained Kernel a PLS (partial least squares) to predict protein-ligand binding affinity, Ballester and Mitchell trained a random forest classifier for the same purpose. The algorithm by Ballester and Mitchell (‘RF-Score’) was shown to outperform 16 existing scoring functions when using a benchmark consisting of 195 diverse protein-ligand complexes. In 2005, Springer et al. 16 presented PostDOCK, a post-processing filter to distinguish true binding protein-ligand complexes from docking artefacts generated by the popular docking program DOCK 4.0.1 17. PostDOCK uses biochemical descriptors to characterize the protein-ligand interaction, including nonbonded van der Waals and electrostatic interaction force field terms from the DOCK score, terms for hydrogen bonding, metal bonding, lipophilicity, and a series of buried solvent accessible surface area descriptors. The authors used a random forest classifier to separate the binding and the nonbinding ligands from a test set of 44 structurally diverse protein targets, and showed that PostDOCK was able to outperform both the DOCK and ChemScore 18 scoring functions. More recently, Sato et al. 19 developed Pharm-IF, a novel atom-based, pharmacophore-based interaction fingerprint. The fingerprint is calculated from the distances of pairs of ligand pharmacophore features that interact with protein atoms. Hydrogen bonds, ionic interactions and hydrophobic interactions are considered. Four machine learning algorithms (naïve Bayesian classifier, random forest, support vector machine and artificial neural network) were used to build prediction models based on Pharm-IF, and each was shown to outperform the GLIDE 20 docking algorithm across all five targets investigated. Durrant and McCammon 21 also recently demonstrated that neural networks can be used to successfully characterize the binding affinities of protein-ligand complexes. Their neural networks, which were trained by characterizing 4,141 protein-ligand complexes across 194 dimensions, were used to rescore the docked poses of a selected set of ligands. The authors demonstrated that their networks were able distinguish between well-docked and poorly docked ligands, and even between true ligands and decoy compounds when both were well-docked.
Classification support vector machines (SVMs) separate positive and negative data in multidimensional space by drawing an optimal hyperplane with a maximum distance between the two classes. As such, SVMs can potentially be applied in developing nonlinear models of docking scoring functions using high-throughput screening (HTS) data. Unfortunately, data imbalance, which thwarts SVMs, is a common feature amongst a wide range of experimental data, including the data here. It occurs when one of the classes in the dataset, commonly the class of interest, is represented by a significantly smaller number of samples than the other class(es). For instance, of the tens of thousands of compounds typically investigated in HTS, it is likely that only a small fraction will show activity against the target of interest. As a result, HTS data are commonly imbalanced, with the vast majority of compounds belonging to the negative class. The use of such imbalanced data in the training of classification SVMs tends to push the optimal hyperplane towards the positive class, resulting in a tendency of the model to classify observations into the negative class, and therefore have high precision but low recall 22.
Sampling strategies, such as oversampling the minority class, or undersampling the majority class are popular choices when addressing the issue of data imbalance 23. In a recent study, Tang et al. 24 proposed a granular sampling strategy for data rebalancing, which undersamples the majority class, while minimizing information loss. Their method treats all positive samples as important due to their rarity, and discards uninformative (i.e., non-SV) negative samples. Indeed, since only SVs are important for SVM model classification, the removal of non-SV samples has no significant effect on model performance. In this way, the granular sampling method identifies samples that are adjacent to the border between the two classes, and uses them to draw the separating hyperplane. Those samples that are far away from the hyperplane are discarded during model building. Li et al. 22 recently applied this granular sampling method to facilitate the analysis of highly imbalanced luciferase inhibition bioassay data. While the data imbalance associated with HTS data has been noted in a number of other studies 25-27, to our knowledge, there have been no methods reported to tackle this problem effectively.
To address the issue of data imbalance in SVM training during docking, we have developed a multiple planar SVM scheme to balance precision and recall. The multiple planar SVM randomly partitions the overrepresented negative cases into n subsets. n SVM models are then trained using each of the partitioned negative subsets, plus a common set of positive cases. The prediction is based on the combined score of the n SVM models (Figure 1). Compared to the granular sampling method, the multiple planar method is more computationally efficient. Indeed, the main drawback of the granular sampling method is that it is not possible to extract all informative negative samples using a single SVM. For instance, in the aforementioned work of Li et al. 22, 67 sampling rounds, each involving the modelling of a new SVM, were required before the results converged and the sampling process was considered complete. While the multiple planar method also requires the training of multiple SVM models, each of these models is a small fraction of the size of the overall dataset.
In this paper we show how the use of a SVM, trained by associating sets of individual energy terms retrieved from molecular docking with the known binding affinity of each compound in the training set, can be used to improve the correlation between known binding affinities and those predicted by a popular docking program for a series of known inhibitors. The docking program eHiTS 28 was selected for this study due to its ability to output a large number of individual energy terms contributing to the overall energy score. Although we only used eHiTS in this study, our methodology can be applied to any other docking software provided that it outputs individual energy terms. More generally, it is applicable to all docking algorithms if 3D structural fingerprints are used as feature vectors.
We have applied this methodology to the development of an enoyl acyl carrier protein reductase (InhA)-specific docking scoring model. InhA is the target of the first-line anti-tubercular drug isoniazid. The therapeutic effect of isoniazid relies on it forming a covalent conjugate with an NAD co-factor. Since the extensive use of isoniazid has prompted the emergence of drug-resistant M.tb strains, the development of direct inhibitors of InhA is highly desirable 29. We are particularly interested in repurposing existing pharmaceuticals to directly inhibit InhA. Indeed, in a recent study, we have shown that the safe pharmaceutical Comtan, a catechol-O-methyltransferase (COMT) inhibitor, which is used in the treatment of Parkinson's disease, is able to directly inhibit InhA 30. In this report, based on the SVM model and ligand binding site similarity, we propose that phosphodiesterase inhibitors could potentially be direct inhibitors of InhA.
Firstly, we developed a regression support vector machine (SVM), which was calibrated using IC50 data from 80 InhA inhibitors in BindingDB 31 (see Methods section). The individual energy terms from the eHiTS docking software 28 were used as features to train the model. The initial correlation between the eHiTS-predicted binding affinities (which were derived from a linear combination of energy terms) and the experimentally determined IC50 values of 80 InhA inhibitors from BindingDB was very poor (Spearman's rank correlation coefficient ρ = 0.117).
The eHiTS output provides a total of 20 different energy terms that contribute to the overall energy score (H_bond, Lipophil, pi_stack, solvent, steric, strain, Coulomb, entropy, other, Rcharge, Rshape, RlogD, Rcover, Lcharge, LlogD, Lcover, metal, lig_int, family and depth). In order to determine the optimal combination of energy terms (i.e., features) for the regression model, different combinations were tested. All 20 energy terms were extracted from the output resulting from the docking of 80 different InhA inhibitors into InhA. As stated above, these inhibitors were selected because the IC50 values of their binding to InhA had been experimentally determined, and could therefore be associated with their respective feature vectors during SVM training.
The terms metal and lig_int were discarded since they were of zero value for all inhibitors. The terms family and depth were also discarded since they relate to the protein structure only, and since this was constant, they would simply add noise to the data. Of the remaining 16 energy terms, nine were regarded as compulsory. These included terms that are directly involved in the molecular interaction, such as hydrogen bonding (H_bond), hydrophobic (Lipophil), pi-pi stacking (pi_stack), solvation (solvent), van der Waals (steric and strain), electrostatics (Coulomb), and entropy (entropy). The seven remaining terms were regarded as optional (Rcharge, Rshape, RlogD, Rcover, Lcharge, LlogD and Lcover). Each combination tested consisted of the nine compulsory energy terms plus a different combination of the seven optional energy terms, resulting in a total of 128 different combinations.
Five-fold cross-validation was then used to compare the relative accuracies of the 128 different combinations of features. For each combination, there were five iterations where, each time, four subsets of equal size were used as training data, and the remaining subset was used as test data. Spearman's rank correlation coefficient (ρ) was calculated for each iteration, and the mean correlation coefficient across all five iterations was assigned to each combination. The combination of energy terms that resulted in the highest mean correlation coefficient was chosen for the model. The best combination of features consisted of the nine compulsory energy terms, plus Lcover and Rcharge.
The best combination of features resulted in a significant improvement in the correlation coefficient (ρ = 0.607), when compared with that achieved by the eHiTS scoring function (ρ = 0.117). To test if the model was over-fitted, y-randomization 32 was carried out. The highest correlation coefficient obtained when carrying out 100 y-randomization trials (see Methods section) was 0.384 (mean = 0.079), which is significantly lower than ρ = 0.607 (Figure 2). This suggests that the regression model would be able to more accurately predict the binding affinities of a series of compounds with InhA than the eHiTS scoring function would alone. Indeed, unlike the eHiTS scoring function, the SVM does not assume the addition of the individual energy terms. Instead, for each molecule, it uses an experimentally determined IC50 value to weight these energy terms according to their likely contribution to the overall binding affinity. In this way, the cooperative effects of noncovalent interactions may be taken into account.
Huang et al. recently constructed the Directory of Useful Decoys (DUD) 33, in which the decoys were specifically selected to be physicochemically similar yet topologically dissimilar to the active ligands, thereby providing a more realistic scenario for the validation of ligand-based virtual screening. The DUD provides a useful resource for training SVM classification models, because the feature vector of each molecule can be labelled as either an active (+1) or a decoy (-1). However, since it contains actives and decoys in a ratio of approximately 1:36, there is a strong bias towards negative examples in the training data. In order to overcome this problem, a multiple-planar classification model was developed. After randomly dividing the actives and decoys into five subsets for the five-fold cross-validation, each of the five subsets of decoys was randomly partitioned into a further n subsets. In order to determine the optimal value of n, five-fold cross-validation was carried out for various values of n from five to 36. Five-fold cross-validation was carried out in the same way as for the regression model, except that for each iteration, n models were trained using a different subset of decoys.
In a classification SVM, two classes are separated by the value zero, and a positive score indicates a positive class, whereas a negative score indicates a negative class. The absolute value of the score is comparable to the distance from the hyperplane. Since each of the n models predicted a score for each of the molecules in the test set, a weighted voting method was used where the sum of the n scores was assigned to each molecule. If the sum of the n scores was greater than zero, the molecule was predicted to be an active, whereas if the sum of the n scores was less than zero, the molecule was predicted to be a decoy. The F-score was then calculated (see the Methods section) for each of the five iterations, and the mean F-score was assigned to that particular value of n. Figure 3 shows that the mean F-score increases rapidly as n is increased to 15, where it hits a peak of 0.36, before it starts decreasing steadily until n reaches 36. Therefore, the number of partitions in the decoy set was set to 15 for the classification model.
The performance of the eHiTS scoring function and that of the classification SVM at discriminating actives from decoys is shown in the ROC curve in Figure 4. Although eHiTs provides some initial early enrichment with respect to random, with an enrichment factor of 4.9 at the 1% level, it actually performs worse than random throughout the entire remainder of the dataset. The classification model, on the other hand, performs well across the entire dataset, with enrichment factors of 6.0, 8.8, and 6.5 at the 2%, 5% and 10% levels, respectively. It is true that only 0.01-1% compounds are selected from a huge compound database in the drug discovery process. However, in our test case, only hundreds of compounds are screened. Therefore, 1% of the test data set is not statistically significant, as suggested by Sato et al. 19. Furthermore, in order to directly compare the performance of the classification SVM with that of the regression SVM, the regression SVM was used to rank the DUD InhA data set according to predicted logIC50 values. The resulting ROC curve (Figure 4) shows that while the regression model significantly outperforms the eHiTS scoring function, it is unable to match the performance of the classification model. It is likely that the higher accuracy of the classification model, when compared to that of the regression model, is due to the inclusion of negative data in the training set. Indeed, since the regression model was developed purely on the basis of positive data, the energy terms for improper interactions would have received little weight during parameter estimation. Therefore, by incorporating negative training data, the classification model allows for the rigorous estimation of all scoring function parameters 34.
A feature selection procedure was carried out (see Methods section) in order to compare the relative contributions of the nine compulsory energy terms to the predictions made by both the regression and classification models. The resulting Spearman's rank correlation coefficients (ρ) and F-scores are shown in Table 1 for the regression and classification models, respectively. There is considerable overlap between the relative contributions of the nine different energy terms for the regression and classification models. The hydrophobic, pi-pi stack, solvation and van der Waals steric energy terms were determined to make strong contributions towards the overall binding affinity prediction in both models. Indeed, their exclusion from the optimized models resulted in a significant decrease in performance when compared with that of the optimized models (labelled ‘none removed’ in Table 1). Conversely, the removal of the electrostatics, entropy, other and van der Waals strain energy terms from the optimized models had a much smaller effect on performance, therefore suggesting that these energy terms made less important contributions to the predictions made by both models.
Interestingly, removal of the hydrogen bonding energy term had a much larger effect on the performance of the regression model than on that of the classification model. Indeed, its removal resulted in a significant decrease from ρ = 0.637 to ρ = 0.110 for the regression model, yet a relatively smaller decrease from F = 0.232 to F = 0.166 for the classification model. Many existing direct InhA inhibitors form a hydrogen bond with the Tyr158 residue of InhA, and since the regression model was trained using only direct InhA inhibitors, this may explain the importance of the hydrogen bonding energy term to the regression model. In addition to direct InhA inhibitors, the classification model included a large number of decoys in the training data. Therefore, the energy terms contributing towards improper interactions would have received a greater weight in the classification model than in the regression model, thus explaining the reduced importance of the hydrogen bonding energy term in the classification model.
Overall, these results suggest that the eHiTS scoring function gave too little weighting to the important hydrophobic, pi-pi stack, solvation and van der Waals steric energy terms, yet too much weighting to the less important electrostatics, entropy, other and van der Waals strain energy terms. The results also suggest that the regression model gave too much weighting to the hydrogen bonding energy term, probably due to the fact that it was developed purely on the basis of positive data. This difference in the weighting of the hydrogen bonding energy term may account for the reduced performance of the regression model, relative to the classification model, as observed in Figure 4.
InhA is the primary molecular target of isoniazid, which has been used as a frontline anti-tubercular agent for the past 40 years 35, 36. As a prodrug, the activity of isoniazid is dependent on activation by KatG, a catalase/peroxidase enzyme. Unfortunately, it is this activation requirement that allowed M.tb to acquire resistance to the drug. Indeed, mutations in the katG gene account for around half of all isoniazid-resistant clinical isolates. Direct InhA inhibitors that avoid this activation requirement are not susceptible to this resistance mechanism. Triclosan and the diazoborines are well-known InhA inhibitors that do not require activation, but they are not suitable for human treatment due to their respective poor solubility and toxicity30, 37. Therefore, novel, direct inhibitors of InhA are urgently required.
To identify potential lead compounds from existing drugs that are able to directly inhibit InhA, we used our binding site similarity comparison software SMAP 38-40 to search InhA against 962 protein structures that are co-crystallized with 274 FDA approved drugs 41 in the RCSB Protein Data Bank (PDB) 42. With SMAP P-values as low as 4.36e-8 (Table 2), some of the strongest connections to InhA were with the phosphodiesterase type 5 (PDE5) inhibitors, which are used in the treatment of erectile dysfunction. Although only PDE5 is co-crystallized with approved drugs in the PDB, a number of inhibitors have been developed to target other members in the PDE gene family. Therefore, 303 PDE inhibitors from BindingDB were docked into the InhA binding site (including the NAD co-factor as a part of the receptor) to identify the most potent inhibitors.
Three drugs vardenafil, tadalafil and sildenafil all showed highly favourable eHiTS energy scores of as low as -11.74 pKi (Table 2), which ranks them in the top three among 303 PDE inhibitors. When using the SVM regression model they only ranked at 83/303, 258/303 and 299/303, respectively. An enzyme kinetics assay was carried out in order to test the activity of vardenafil against InhA. No inhibition was observed up to 200 □M. This implies that the SVM model is able to improve the ranking accuracy of the original eHiTs scoring function. However, the fact that vardenafil has a predicted IC50 of 1,545 nM (1.545 □M) when it has been experimentally determined to be a non-binder, suggests that the absolute IC50 values predicted by the SVM may not be entirely accurate. Even so, the best predicted IC50 values among the 303 PDE inhibitors are more than two-fold lower than that of vardenafil, and so they may still be potential inhibitors of InhA (Table 3). Interestingly, all top 10 ranked compounds are PDE4 inhibitors. When the 303 PDE inhibitors from BindingDB were ranked using the classification model (Table 4), there was a considerable overlap between the top ranked compounds and those with the best predicted IC50 values from the regression model. In particular, there are 6 PDE4 inhibitors that rank within the top 10 compounds when using either model; L-869299, L-869298, triazolothiadiazine 8, phthalazinone 22, V11294 analog 5u and V11294 analog 5t. Experimental testing is required to verify whether or not these PDE4 inhibitors are active against InhA. Note that with an original eHiTs energy score of -9.83 pKi, triazolothiadiazine 8 would not have been ranked particularly highly when using the eHiTs scoring function.
The binding pose of L-869298 within the InhA substrate binding site, as predicted by eHiTS, is shown in Figure 5. It overlaps well with the crystallographically-determined binding pose of the direct, high affinity InhA inhibitor (3S)-1-cyclohexyl-N-(3,5-dichlorophenyl)-5-oxopyrrolidine-3-carboxamide (PDB code: 2h7m, PDB ligand code: 641, IC50: 390nM). Note that information about the cocrystallized conformations of known inhibitors was not used in any part of the docking process. Further investigation of the predicted binding pose of L-869298 revealed two potential hydrogen bonds, one between a fluorine atom (F13) of L-869298 and the amide nitrogen of a leucine residue (Leu197) in InhA, and the other between the same fluorine atom and the amide nitrogen of an alanine residue (Ala198) in InhA. Hydrophobic interactions were also observed between L-869298 and the following InhA residues; Gly96, Met103, Tyr158, Met161, Thr196, Leu197, Ala198, and Met199.
Following the PDE5 inhibitors, the next most significant connection to InhA predicted by SMAP was with the sex hormone estradiol, which binds to estrogen receptors (ERs) α and β. Furthermore, at a SMAP P-value threshold of 1.0e-5, InhA was also connected to the selective ER modulator raloxifene. These findings suggest that compounds that bind to ERs may also have the potential to directly inhibit InhA. There are a total of 223 compounds in BindingDB that have been confirmed experimentally to bind with any of ERα, ERβ, estrogen-related receptor (ERR) α or ERRγ. Both the regression and the classification model of InhA were used to rank these compounds according to the probability of their activity against InhA (see the Supporting Information, Tables S1 and S2). As with the PDE inhibitors, there was a strong agreement between the two models, therefore providing further support for the results. Interestingly, the most highly ranked compounds were analogs of 4-hydroxytamoxifen, the active metabolite of tamoxifen, which were predicted by the regression model to fall within the nanomolar range. Again, these compounds would be ideal candidates for experimental testing against InhA.
Since the DUD provides sets of actives and decoys for a number of different targets, multiple-planar classification models were developed for 12 additional targets, in order to test the robustness of the methodology. For target selection criteria, please refer to the Methods section. Figure 6 shows ROC curves comparing the performance of the different classification models (purple lines) with that of the eHiTS scoring function (green lines) for the 12 selected targets. The classification models can be seen to outperform the eHiTS scoring function across every single target. In all cases except for HIV reverse transcriptase (hivrt), the classification models provide significant early enrichment. For hivrt, the eHiTS scoring function performs worse than random (black line), while the classification model barely performs above random across the entire dataset. The eHiTS scoring function, on the other hand, can be seen to perform close to, or worse than, random across the majority of targets. The eHiTS scoring function gives its best performance for cyclooxygenase-2 (cox2), but even then, it is significantly outperformed by the classification model, which also gives its best performance for this target. It is interesting to note that hivrt is the target with the least number of actives (15) in the dataset, while cox2 is the target with the greatest number of actives in the dataset (43). The classification model also performs extremely well for epidermal growth factor receptor (egfr), which is the target with the second greatest number of actives (40), however, the eHiTS scoring function performs worse than random for this target. This may be because larger training sets cover more extensive spaces of both positive and negative examples.
We have trained two different SVMs for an important anti-tubercular target, InhA, by associating sets of eHiTS-predicted individual energy terms with the known binding affinity of each compound in the training set. In this way, the docking scoring function was trained to derive a unique set of weights specific to InhA. Since the methodology does not require cocrystallized structural information, it can take full advantage of the vast amounts of data generated by HTS experiments. The abundance of data from HTS means that it is now feasible to iteratively train a scoring function for individual gene families. In fact, we have extended the classification SVM to 12 different targets, and shown that it is able to outperform the eHiTS scoring function across every target. Although there are a limited number of targets for which there is sufficient experimental training data at present, many of these targets are important validated drug targets. Furthermore, the potential of our method can only improve as the availability of experimental data increases in the future.
Unfortunately, a major hurdle in the use of HTS data to train classification SVMs is the low ratio of active to inactive compounds. Such imbalanced data results in a strong bias towards the prediction of negative cases, therefore resulting in a model with high precision but low recall. We have presented a novel and computationally efficient multiple-planar approach to overcome this previously acknowledged but frequently neglected issue 22. Our multiple-planar approach may be applied to train classification SVMs using any set of imbalanced data. Imbalanced data is not only an issue for bioinformatics, but also for applications as wide-ranging as e-business, information security and national security 23.
Furthermore, by using SVMs to model the individual energy terms in a nonlinear fashion, it may be possible to implicitly simulate their cooperative behaviour. Indeed, noncovalent interactions are frequently either positively or negatively cooperative in nature: They exhibit positive cooperativity when the binding energy derived from them acting together is greater than would be derived from the sum of them acting separately, and exhibit negative cooperativity when the opposite situation is true 13. Unfortunately, docking scoring functions, which are primarily designed for high throughput analyses, tend to ignore this fundamental concept of cooperativity in favour of the simpler and more convenient additivity approach 11. A number of recent studies have highlighted the need for docking scoring functions that incorporate descriptors that model cooperative effects 10-13. In particular, Muley et al. 10 suggested that the potential error in binding affinity predictions resulting from ignoring the cooperativity between a hydrogen bond and a hydrophobic side chain of 100 Å2 contact surface could be in the region of one order of magnitude. It is noted that cooperativity in protein-ligand interactions appears at the atomic level. Unfortunately, the use of energy terms derived from docking software does not fully capture the atomic details of protein-ligand interactions. Therefore, the incorporation of atomic interaction features into the computational framework presented in this paper may provide a better model of cooperativity in the future.
Although our method has only been applied to eHiTS thus far, the promising results obtained from this study indicate that other docking scoring functions or interaction fingerprints would benefit from such treatment. Indeed, the fact that scoring functions are notoriously weak predictors of binding affinity can be partly attributed to their neglect to derive a unique set of weights specific to each gene family, and also to the additive approach that they employ to derive overall energy scores from their individual contributions. We hope that our endeavours in this paper will promote a much needed review of the theory behind current docking scoring functions, therefore paving the way for their long overdue improvement.
We were able to increase the accuracy of virtual screening of direct InhA inhibitors by using SVMs, in which the features of the scoring matrix were energy terms retrieved from molecular docking using eHiTS. We constructed two prediction models for InhA; a regression model, which was trained using InhA experimental IC50 values from BindingDB, and a classification model, which was trained using the InhA DUD dataset. The two models gave a good consensus. Our results based on y-randomization indicate that the improved ranking performance is not due to over-fitting, and that our model may capture new information that is missed by the eHiTs docking scoring function, which is considered to be the state-of-the-art, but incorrectly assumes that individual interactions contribute towards the total binding affinity in an additive manner. The purpose of our new model is to capture the non-linear cooperative features of noncovalent interactions. Although there is still room for improvement as far as our methodology is concerned, i.e., the fact that the features in our model do not explicitly capture the atomic details of protein-ligand interactions, we hope that our work may inspire future endeavors in developing new scoring functions that correctly model the cooperative effects of protein-ligand interactions. Furthermore, our multiple-planar SVM training procedure for the classification model successfully addressed the issue of overrepresentation of negative data in the high-throughput screening data set. Using our SVMs we were able to produce a prioritized list of PDE inhibitors for experimental screening against InhA. In addition, our SVMs were used to prioritize a set of compounds known to bind estrogen receptors and estrogen-related receptors. In principle, our methodology could be directly applied to other targets of interest, as demonstrated in the work presented here, and also to other docking programs.
Since the SMAP software had aligned each drug binding site with the InhA binding site, the aligned coordinates of each drug molecule were used to define the search space for docking each drug into the M.tb protein (i.e., the aligned drug molecule was used as the clip file with a default search space of 10 Å3). As recommended by the manual, the eHiTS accuracy level was increased to 6 (default = 3), in order to improve the accuracy of the predictions. Following all docking, the binding pose with the highest estimated binding affinity was selected for further investigation. Note that the NAD cofactor was added as the last residue in the InhA structure prior to docking.
In order to normalize a docking score, 500 Lipinski-compliant43 molecules with the same number of carbon atoms, the same number of nitrogen atoms and the same number of oxygen atoms as the drug of interest were randomly selected from subset 13 of the ZINC database 44, and docked into the target of interest using the same protocol. The means and standard deviations of the three sets of 500 resulting energy scores were then used to derive a Z-score for the energy score of each of the three drugs according to the following formula:
The P-value was then derived from the Z-score using a normal distribution with a mean and standard deviation of the 500 energy scores.
BindingDB 31 (http://www.bindingdb.org/) provides a rich repository of experimentally-measured data for protein-ligand interactions, including IC50 values for 98 different direct InhA inhibitors (October 8, 2009). The IC50 values of 21 of these inhibitors are not exact (i.e., >100,000 nM) and so were excluded. In addition, IC50 data for four other direct InhA inhibitors 45 that are present in the RCSB PDB but not BindingDB were included, resulting in IC50 values for 81 different direct InhA inhibitors. For the 14 of these inhibitors that were cocrystallized with InhA in the RCSB PDB, FATCAT 46 (http://fatcat.burnham.org/) was used to structurally align the ligand-bound chains of the InhA structures onto 1BVR, a structure of InhA bound with a fatty acid substrate (PDB ligand code: THT). The aligned inhibitor molecules were then extracted and docked into 1BVR using eHiTS 28, using the THT substrate to define the docking search space. For each inhibitor, the energy terms of the binding pose with the smallest RMSD to the cocrystallized ligand were retained. The 67 remaining inhibitor molecules were downloaded from BindingDB in sdf format, and also docked into 1BVR using the fatty acid substrate to define the docking search space. For each of these inhibitors, the energy terms of the binding pose with the greatest binding affinity (i.e., lowest energy score) were retained. Note: One inhibitor, 14PP, failed to dock. Spearman's rank correlation coefficient was then calculated between the actual IC50s and the eHiTS predicted energy scores of the 80 InhA inhibitors.
SVM-light 47 was used to train a regression model where each of the 80 InhA inhibitors was labelled with the logarithm of its IC50 value and was assigned a feature vector comprising the eHiTS energy terms. The optimal combination of features was determined using five-fold cross-validation (see Results section for further details). In order to determine the optimal kernel function for this combination of features, three other kernel functions (polynomial, radial basis function and sigmoid tanh), besides the default linear kernel, were tested and mean correlation coefficients were calculated. However, the linear kernel gave the best results and so was chosen for the model. All five subsets were then merged to create the regression model using the chosen combination of energy terms and the linear kernel function.
All 303 PDE inhibitors present in BindingDB (on October 16, 2009) were downloaded in sdf format, and docked into InhA using eHiTS. Again, the 1BVR structure and the THT substrate were used to define the docking search space. Note that one inhibitor, CI-930, failed to dock. The chosen combination of energy terms was extracted from the eHiTS output for each inhibitor. SVM-light was then used to predict the log IC50 value of each PDE inhibitor with InhA, based on the regression model. The 303 PDE inhibitors were then ranked based on their predicted log IC50 values when binding to InhA. This process was repeated using 223 molecules from BindingDB that bind estrogen receptors and estrogen-related receptors.
All 85 active molecules and 3,035 decoy molecules for InhA were downloaded from the DUD 33 (http://dud.docking.org/). These molecules were docked into the 1BVR structure of InhA, using the THT substrate as a clip file. Note that 11 decoy molecules failed to dock, leaving a total of 3,024 decoys. SVM-light was used to train a classification model where each molecule was labelled as either an active (+1) or a decoy (-1) and was assigned a feature vector comprised of eHiTS energy terms. The classification model was trained using the combination of features and kernel function that were selected previously for the regression model. The optimal number of partitions for the decoys was determined using five-fold cross-validation (see the Results section for further details).
In order to compare the performance of eHiTS (using the default energy scores) with that of the classification model, receiver operator characteristic (ROC) curves were plotted. ROC curve analysis is considered to be the best approach for characterizing the performance of virtual screening methods 48. However, since the DUD is subject to analog bias amongst the actives, this can lead to falsely high enrichment rates. Indeed, in its raw state, there are many targets for which nearly all actives are trivial analogs of a central structure 49. In a recent study, Good and Oprea 50 filtered the DUD actives for lead-likeness (molecular weight (MW) < 450 and logP < 4.5) before clustering them according to chemotype using a reduced graph representation. In this way, they not only removed large molecules with unsuitable physicochemical properties, but they also addressed the problem of analog bias. For InhA, this procedure resulted in the removal of 28 actives, leaving 57 actives, which were then clustered into 23 clusters (http://dud.docking.org/clusters/, accessed November 11, 2009). In order to prevent analog bias in the ROC curves, the filtered and clustered set of InhA actives was used in combination with an arithmetic weighting procedure where each active molecule was given a weighting that was equal to the inverse of the size of the cluster to which it belonged 51, 52. For instance, if five active molecules with similar chemotypes had been assigned to the same cluster, each active within that cluster would be given a weighting of one-fifth. Exactly the same procedure was used to evaluate the ability of the regression model to correctly rank the actives and decoys in the InhA DUD dataset.
In the same way as before, SVM-light was used to predict whether each of the 303 PDE inhibitors would be able to bind InhA, based on the classification model. The 303 PDE inhibitors were then ranked based on the absolute values of their scores. This process was also repeated using 223 molecules from BindingDB that bind estrogen receptors and estrogen-related receptors.
y-Randomization is a popular model validation tool whereby the performance of the original model is compared to that of models built from randomly shuffled responses, based on the original descriptor pool and the original model building procedure. In this way, it is possible to protect against the risk of chance correlation 32. In order to implement a y-randomization procedure, the y values (i.e., logIC50 values) were randomly re-ordered for all 80 inhibitors in the training set for the regression SVM. The training procedure was then carried out in exactly the same way as before, i.e., five-fold cross-validation for 128 combinations to determine the optimal combination of features. When calculating the correlation coefficient for each of the five subsets in each of the 128 combinations, the ranks of the new y-randomized logIC50 values were compared with the ranks of the predicted logIC50 values from the y-randomized model. This process was repeated 100 times, and each time the highest mean correlation coefficient from five-fold cross-validation, achieved across the 128 combinations of features, was reported.
F-score = (2 × precision × recall) / (precision + recall) Where; precision equals the number of correct positive results divided by the total number of predicted positive results, and recall equals the number of correct positive results divided by the total number of results that should have been predicted as positive.
A feature selection procedure was carried out in order to compare the relative contributions of the nine compulsory energy terms to the predictions made by both the regression and classification models. In each case, a model was created using the optimal combination of features from the entire set of training data, and this model was then used to make predictions about the entire set of training data. Note that for the classification SVM, the multiple-planar method was employed with the optimal number of partitions. 10 different models were created for each of the regression and classification SVMs; one where all 11 predetermined features were included in the model, and nine where a different compulsory energy term had been removed from the model. All 10 models were used to make predictions about the same set of training data, and Spearman's rank correlation coefficients and F-scores were calculated for the regression and classification SVMs, respectively.
The DUD contains actives and decoys for 40 different targets. As mentioned previously, Good and Oprea 50 filtered and clustered each set of actives in order to remove large molecules with unsuitable physicochemical properties, and also to address the problem of analog bias. To restrict our analysis to targets with a sufficient number of chemotypes, only those targets possessing at least 15 different clusters of actives were selected (as in the work by Cheeseright et al. 53 and Kinnings and Jackson 52). The sets of active and decoy molecules corresponding to each of the 12 selected target proteins (ace, ache, cdk2, cox2, egfr, fxa, hivrt, p38, pde5, pdgfrb, src and vegfr2) were downloaded directly from the DUD (http://dud.docking.org/r2/; DUD release 2, 22/10/2006) in mol2 file format. For each target, the protein structure used in the DUD benchmarking process 33 was selected, and the corresponding ligand was used to define the docking search space. All actives and decoys were docked into their relevant target structures, and the energy terms of the binding pose with the greatest binding affinity (i.e., lowest energy score) were retained for each molecule. For each target, the optimal combination of energy terms and the optimal number of partitions of decoys was determined for the classification model, using the methods described previously for the InhA classification model. Again, in order to prevent analog bias in the ROC curves, an arithmetic weighting procedure was used, where each active molecule was given a weighting that was equal to the inverse of the size of the cluster to which it belonged.
We appreciate the constructive suggestions of the anonymous reviewers and the editor in improving the manuscript. We are grateful for financial support from the National Institutes of Health grant GM078596. S.L. Kinnings was funded by a BBSRC studentship.