|Home | About | Journals | Submit | Contact Us | Français|
One of the unaddressed challenges in drug discovery is that drug potency determined in vitro is not a reliable indicator of drug activity in vivo. Accumulated evidence suggests that in vivo activity is more strongly correlated with the binding/unbinding kinetics than the equilibrium thermodynamics of protein–ligand interactions (PLIs). However, existing experimental and computational techniques are insufficient in studying the molecular details of kinetics processes of PLIs on a large scale. Consequently, we not only have limited mechanistic understanding of the kinetic processes but also lack a practical platform for high-throughput screening and optimization of drug leads on the basis of their kinetic properties. For the first time, we address this unmet need by integrating coarse-grained normal mode analysis with multitarget machine learning (MTML). To test our method, HIV-1 protease is used as a model system. We find that computational models based on the residue normal mode directionality displacement of PLIs can not only recapitulate the results from all-atom molecular dynamics simulations but also predict protein–ligand binding/unbinding kinetics accurately. When this is combined with energetic features, the accuracy of combined kon and koff prediction reaches 74.35%. Furthermore, our integrated model provides us with new insights into the molecular determinants of the kinetics of PLIs. We propose that the coherent coupling of conformational dynamics and thermodynamic interactions between the receptor and the ligand may play a critical role in determining the kinetic rate constants of PLIs. In conclusion, we demonstrate that residue normal mode directionality displacement can serve as a kinetic fingerprint to capture long-time-scale conformational dynamics of the binding/unbinding kinetics. When this is coupled with MTML, it is possible to screen and optimize compounds on the basis of their binding/unbinding kinetics in a high-throughput fashion. The further development of such computational tools will bridge one of the critical missing links between in vitro compound screening and in vivo drug activity.
Target-based and cell-based screenings are the two major approaches in the early stage of drug discovery. In both of these technologies, one of the unaddressed fundamental challenges is that drug potency measured in vitro may not be a reliable indicator of drug efficacy and toxicity in vivo. In compound screening and lead optimization, equilibrium thermodynamic constants such as half-maximal inhibitory concentration (IC50) or dissociation constant (Kd) have been used as the measures of drug potency for years. As molecules in the human body are in a nonequilibrium condition, the activity of a drug depends on not only how strongly it interacts with the protein but also how easily it hits the target and how long it resides in the target. An increasing body of evidence suggests that drug activity in vivo is not defined by equilibrium conditions measured in vitro but rather depends on the residence time (τ = 1/koff) of the receptor–ligand complex in vivo in a number of cases.1 The longer residence time increases the efficacy of the drug. For example, geldenamycin has a low affinity for heat shock protein (Hsp90) in vitro, with IC50 ~ 1 μM, in contrast to its nanomolar effects in vivo.1,2 Copeland et al.3 analyzed the results of experiments on mutation-based resistance to inhibitors of HIV-1 protease and concluded that the essential factor for sustained drug efficacy in vivo is the residence time and not the binding affinity. Pan et al.4 reported that the residence time is highly correlated with the functional efficacy of a series of agonists of the A2A adenosine receptor (r2 = 0.95) but that there is little correlation with binding affinity (r2 = 0.15). Furthermore, Dahl and Akerud5 disclosed that on-target side effects could be reduced by reducing the drug residence time. Thus, a drug with optimal efficacy and toxicity profiles should have a balanced kon and koff. Since IC50 and Kd depend on the measurement of the combined effect of kon and koff, they are actually insufficient to explain the impact of binding/unbinding kinetics on drug action, as the same value of Kd can come from an infinite number of combinations of kon and koff. Additionally, since Kd is dependent on the free energy difference between the bound and unbound states but is independent of the transition state of protein–ligand interaction (PLI), it is inadequate to explain the binding/unbinding process of PLI.4,6
Experimental techniques for the study of PLI kinetics are not only expensive and time-consuming but also insufficient to provide detailed molecular characterization of the PLI kinetics process.7–9 Computational modeling plays an increasing role in elucidating the binding/unbinding process of PLIs. Molecular dynamics (MD) simulations have been reported to be capable of capturing the binding process from beginning to end in full atomic detail.10 Unfortunately, the power of MD simulations is limited by the fact that protein–ligand binding events take place on time scales ranging from microseconds up to hours and days. Thus, MD simulations are infeasible for the majority of binding processes. Therefore, metadynamics and other conformational sampling techniques have been developed to improve sampling in MD simulations of systems where ergodicity is hindered by the form of the system’s energy landscape.11–13 For example, Gervasio et al.11 successfully applied a metadynamics method to the docking of ligands on flexible receptors in water solution. The method is able not only to find the docked geometry and to predict the binding affinity (ΔGbinding) but also to explore the entire docking process from the solution to the docking cavity, including barriers and intermediate minima. Even though these advances are remarkable, existing techniques for MD simulations to study the whole binding/unbinding process of a PLI on a large scale are not yet feasible. In addition, the choice of collative variables in the metadynamics simulation is not a trivial task.
With the increasing availability of protein binding kinetics data,14,15 data-driven modeling provides an alternative and efficient solution for studying the PLI kinetics. Several predictive models for kinetic constants of protein–protein interactions (PPIs) have been developed.16,17 However, the molecular attributes in these models cover only static structural characteristics such as the percentage of residues in α-helixes, the buried surface area of the protein, the proportions of charged residues and polar atoms at the interface, and the energetic features such as hydrogen-bonding potential and the interfacial electrostatic interaction energy between interfacial residues. These features may not sufficiently capture long-timescale conformational dynamics of the PLI kinetic processes. In addition, existing methods predict kon and koff independently. As a matter of fact, however, they could be dependent in nature. To our knowledge, few methods are available for large-scale modeling of the binding/unbinding kinetics of PLIs with explicit conformational dynamics features while predicting kon and koff simultaneously.
To tackle the above problems, we integrate energetic and conformational dynamics features derived from efficient molecular modeling with a state-of-the-art multitarget machine learning (MTML) approach. In this study, ligand-bound HIV-1 proteases are used as an example to build models. In addition to energetic features derived from pairwise decomposition of the residue interaction energy18,19 and environment-dependent electrostatic potential energy,20 relative movement of ligand–residue (RMLR) and relative movement of residue–residue (RMRR), which represent the dynamics impact of ligand binding on the amino acid residues, are derived from coarse-grained normal mode analysis (NMA) and used to train MTML models. RMLR is the dot product of the ligand displacement vector after normalization and the residue displacement vector, and RMRR is the dot product of two displacement vectors of a residue upon ligand binding. Our multifacet statistical analysis consistently shows that conformational dynamics features such as RMLR are as important as energetic features in recapitulating the results from all-atom MD simulations as well as predicting kon and koff accurately. On the basis of these findings, we propose that coherent coupling of conformational dynamics and thermodynamic interactions between the protein and the ligand may play a critical role in determining the kinetic rate constants of PLIs. Furthermore, we demonstrate that the relative movement of normal modes of amino acids upon ligand binding can serve as a kinetic fingerprint to capture long-time-scale conformational dynamics features of the protein binding/unbinding process. In combination with MTML, we can use the kinetic fingerprint to predict combined kon and koff with an accuracy of 74.35% and to screen compounds with in vivo drug activity with an area under the receiver operating characteristic curve of 0.8668. Thus, it is possible for us to screen and optimize compounds on the basis of the binding/unbinding kinetics of PLIs in a high-throughput fashion. The further development of such computational tools will bridge one of the critical missing links between in vitro drug potency and in vivo drug efficacy and safety, thereby accelerating the drug discovery process.
Figure 1 depicts the workflow of the computational procedure in this study, which includes four phases: Phase 1 concerns building the structure of the three-dimensional (3D) ligand-bound HIV-1 protease complex. Phase 2 addresses the identification of ligand-binding-site residues. Phase 3 targets the construction of the five principal data sets. Phase 4 is the MTML computation.
In brief, when no cocrystallized structures existed, the 3D conformation of the ligand was docked in the HIV protease using the protein–ligand docking software eHiTS.21 NMA was performed for both apo and ligand-bound structures (from the Protein Data Bank (PDB) complex structure or protein–ligand docking) without further energy minimization using iMod.22 RMLR and RMRR, which represent the conformational dynamics impact of ligand binding on the binding-site residues, were derived from NMA analysis. RMLR is the dot product of the ligand displacement vector after normalization and the residue displacement vector, and RMRR is the dot product of the displacement vectors of a residue for the apo and ligand-bound structures. Both RMLR and RMRR use the first 10 lowest-frequency modes.22
Pairwise decomposition of the residue interaction energy was performed to compare the relative significance of conformational dynamics features with energetic features in determining the protein binding/unbinding kinetics. To compute the pairwise decomposition of the residue interaction energy, 39 ligand–HIV-1 models were first minimized by MD simulations. The MD simulations were carried out in the canonical NVT ensemble using Nanoscale Molecular Dynamics (NAMD)19 program with the CHARMM27 force field for HIV-1 protein and the CHARMM general force field for the ligands. All of the models proceeded through a minimization process of 4 ps, an equilibrium process of 150 ps, and a production period of 50 ps. The trajectories in the production phase were stable (Figure S1 in the Supporting Information). During the simulations, the temperature and time step were set at 310 K and 2 fs, respectively, and the benchmarked NAMD generalized Born implicit solvent (GBIS) method reported by Tanner et al.23 was adopted to reduce the simulation time. The capability of the NAMD GBIS model was illustrated by simulation of the ratcheting Escherichia coli ribosome involving ~250 000 atoms. The conformation in the final production phase was used for the calculation of the residue-decomposed pairwise interaction energy (PIE) as well as its two components, the van der Waals energy (VDWE) and the electrostatic energy (EE) between the ligand and amino acid residues. VDWE and EE were calculated using CHARMM force fields and an environment-dependent electrostatic potential energy.20
Energetic features (PIE, VDWE, and EE) and conformational dynamics features (RMRR and RMLR), individually or combined, along with experimentally determined kon and koff data24 (Table S1 in the Supporting Information) were used to train MTML models. We evenly discretized the log10(kon) and log10(koff) values of the 39 HIV-1 protease inhibitors into four different binary classes, with the labels (0,0), (0,1), (1,0), and (1,1) on the two-dimensional space of log10(kon) and log10(koff), using the criteria of log10(koff) = −2 and log10(kon) = 5.6 (Figure 2 and Table S2), where class (0,1) was enriched with five FDA-approved drugs.
To evaluate whether the proposed conformational dynamics features can capture essential information on the protein binding/unbinding kinetics, we first evaluated their accuracies in predicting the kinetic rate constants kon and koff simultaneously. The accuracy was calculated as follows:
where Ai is the prediction accuracy for each case and N is the total number of cases. If both kon and koff are predicted correctly, Ai = 100%. If one of them is correct, Ai = 50%. If neither is correct, Ai = 0.0%.
We also evaluated the performance of the method in high-throughput screening of molecules with in vivo drug activity based on kon and koff using the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC). As all of the FDA-approved drugs fell into the class (0,1), the cases in this class were considered as positives. All of the other cases in the remaining three classes were considered as negatives.
More details on the computational workflow are given in Materials and Methods in the Supporting Information.
In this study, we use HIV-1 protease complex structures to investigate the conformational dynamics and develop a predictive model of the ligand binding/unbinding kinetics. HIV-1 protease is an excellent model system for our purpose. First, 39 HIV-1 protease inhibitors have experimentally determined kon and koff values under the same conditions,24,25 providing a reasonable number of high-quality data points for data-driven modeling. Second, abundant data on HIV-1 protease inhibitor resistance mutations (PIRMs) are available. PIRMs are mutations that facilitate resistance of HIV to protease inhibitors.26 They can be used to validate the predictive model. Third, both unbound and complexed structures of HIV-1 protease have been released in the PDB.27 The apo and holo conformations are the basis for our analysis. Table S3 shows the resolution and the status of apo/holo conformation for the 10 HIV-1 complexes used in the experiment.
The ultimate goal of this work is to develop a new method for high-throughput screening and optimization of chemical compounds with in vivo drug activity on the basis of their kinetic properties. Therefore, we format the problem as a multitarget classification problem to predict whether a chemical is drug-like. In this regard, we group FDA-approved drugs in a single class. In addition, the machine learning algorithms are in general sensitive to imbalances in the training data. In order to avoid the bias of machine learning algorithms in handling an imbalanced small training set, we evenly discretize the log10(koff) and log10(kon) values of 39 HIV-1 protease inhibitors into four different binary classes, with the labels (0,0), (0,1), (1,0), and (1,1) on the two-dimensional space of log10(kon) and log10(koff), using the criteria of log10(koff) = −2 and log10(kon) = 5.6 (Figure 2 and Table S2), where class (0,1) is enriched with five FDA-approved drugs.
Ten inhibitors have solved HIV-1 complex structures in the PDB with resolution varying from 1.7 to 2.29 Å (Table S3). For the remaining 29 inhibitors whose complex structures have not been experimentally determined, the protein–ligand docking software eHiTS21 is applied to predict their binding poses. The receptor is chosen from one of the ligand-bound HIV-1 complexes with a cocrystallized ligand structure similar to the docked ligand structure, as determined by the superimposed ligand substructures. Whenever possible, the common fragment of the cocrystallized ligand and the docked ligand is used as a constraint to select the final binding pose of the docked ligand, such that the root-mean-square deviation (RMSD) of the superimposed common fragments is minimal. An example is shown in Figure S2.
Binding-site amino acid residues that are involved in the HIV-1 protease inhibitor interactions are determined using the change in solvent-accessible surface area (SASA) upon ligand binding (see section SI 1.2 in the Supporting Information). As depicted in Figure 3, there are 22 amino acids on each of the two chains of the HIV-1 dimer, giving a total of 44 amino acids. The 22 amino acids include one nonpolar alanine (A28), one basic polar arginine (R8), two nonpolar leucines (L10 and L23), four nonpolar glycines (G27, G48, G49, and G52), three acidic polar aspartic acids (D25, D29, and D30), five nonpolar isoleucines (I47, I50, I54, I76, and I84), two nonpolar valines (V32 and V82), one basic polar lysine (K45), one nonpolar phenylalanine (F53), one polar threonine (T80), and one nonpolar proline (P81).
NMA is a powerful computational method to identify and characterize the slowest molecular deformational motions with large amplitude, which are widely involved in biological functions of macromolecules but are inaccessible by other methods. Protein binding/unbinding events often occur on long-time-scales ranging from milliseconds to days, far beyond the current capability of MD simulations. Coarse-grained NMA may allow us to extract important information on the dynamics of protein–ligand binding/unbinding processes. Since the presence of solvent damping dramatically slows the large-amplitude motions of biomolecules, the time scales of molecular motions in reality are much longer than what can be estimated from the eigenvalues of NMA that are calculated in vacuum. In other words, solvent damping causes a time-scale discrepancy between NMA and real molecular motions. However, the study conducted by Ma revealed that the presence of solvent has a minor impact on eigenvectors, which are determined by the potential surface only.28,29 Thus, the information provided by the eigenvectors for the directionality of conformational transitions can be used to study dynamic processes on the time scale of real situations.
In this study, NMA was conducted using iMod. The directionality of normal modes of the residues in the binding site is used to characterize the conformational dynamics features of binding and unbinding events. Specifically, two data sets, one for RMLR (DS-RMLR) and one for RMRR (DS-RMRR), were derived from NMA analysis. As an example, Figure 4A depicts the superposition of the 44 residue eigenvectors of the aligned apo structure (PDB code 3IXO) and the A045-bound structure for the first normal mode. It illustrates the shift of the eigenvectors of the 44 residues of HIV-1 protease upon ligand binding. Figure 4B illustrates the relative displacements of the 44 ligand–residue pairs in the DMP-bound HIV-1 complex. Each of these two data sets comprises 1716 values for 39 ligands and 44 amino acids. Figure 5 depicts the RMLR and RMRR values for L10 (leucine) and D25 (aspartic acid) of chain A induced by the 39 ligands.
PIE and its two constituent components EE and VDWE between the ligand and the binding-site residue of HIV-1 protease are calculated from CHARMM force fields and an environment-dependent electrostatic potential energy. The values of PIE, EE, and VDWE, which characterize various energetic aspects of the ligand–residue interaction, are used to build three data sets: DS-PIE, DS-EE, and DS-VDWE. Each of these three data sets comprises 1716 values for 39 ligands and 44 amino acids. Figure 6A,B depicts the PIE, EE and VDWE values for D25 and L10 in chain A, respectively.
We use the energetic and conformational dynamics attributes derived from the residue-decomposed binding energies and NMA to train an MTML model for the classification prediction of kinetic rate constants. In total there are five principal training data sets: DS-PIE, DS-EE, DS-VDWE, DS-RMLR, and DS-RMRR. Each of them comprises 39 cases, with each case comprising 44 attributes.
MTML is defined as follows: Given a set of learning examples D of the form (X, Y), where X = (x1, x2, …, xk) is a vector of k training attributes and Y = (y1, y2, …, yt) is a vector of t target attributes, learn a model that, given a new unlabeled example X, can predict the values of all of the target attributes Y simultaneously. When yi is categorical, the problem is known as classification. In this study, the yi are binarized values of kon and koff as shown in Figure 2.
Random forest predictive clustering (RF-Clus) is applied for the task of MTML. RF-Clus outperforms other MTML algorithms in the benchmark studies.30 In addition, it can handle high-dimensional features, e.g., in the situation where the number of attributes is much larger than the number of cases, and can select the importance of attributes (amino acid residues) that contribute to the accuracy of kon and koff prediction. The model was run on the iteration numbers of 100, 200, 250, and 500 in the leave-one-out (LOO) cross-validation experiment. Except for the value of FTest, which was set at 0.1, default values were chosen for all of the other parameters, including PruneSet = None, MinimalWeight = 1.0, EnsembleMethod = RForest, and VotingType = Majority.
Table 1 shows the selected residues in descending order of score of importance. Residues with frequencies of occurrence greater than 25% in the LOO cross-validation experiment of the binary-target random forest classification algorithm with iteration number = 500 were chosen. Consequently, 16, 15, 13, and 14 residues were selected from DS-RMLR, DS-RMRR, DS-EE, and DS-PIE, respectively. These identified key residues consist of three motifs: an N-terminal motif (R8 and L10), a charged motif (L23, D25, G27, A28, D29, D30, and V32), and a motif corresponding to the flap region (residues 43–58), as shown in Figure 7. Both the N-terminal motif and the charged motif are common to DS-PIE, DS-RMLR, DS-RMRR, and DS-EE. The flap region is identified only by DS-RMLR and DS-RMRR.
All-atom MD simulations have shown that the conformational dynamics of the flap region plays a key role in the ligand binding process of HIV-1 protease.31–33 Consistent with this observation, the residues in the flap region are identified as key kinetic features by DS-RMLR and DS-RMRR. The recapitulation of the findings from the all-atom MD simulations supports the conclusion that the directionality displacement of normal modes upon ligand binding can serve as a kinetic fingerprint to capture the long-time-scale conformational dynamics of protein binding/unbinding events. The charged motif in the active site participates in substrate peptide recognition. Specifically, D25 and D29 form hydrogen bonds with the substrate peptide. Additionally, R8 and D30 can interact with polar side chains or distal main-chain groups in longer substrate peptides. Moreover, the mutations of L10, L23, and V32 lead to drug resistance of HIV protease inhibitors.
We examine the impacts of the energetic and conformational dynamics features on the accuracy of kon/koff prediction by building different MTML models. First, we use energetic features (the DS-EE, DS-PIE, and DS-VDWE data sets) to build the MTML model. Second, in order to evaluate whether the normal mode directionality features can be used to predict kon and koff, we apply DS-RMRR, DS-RMLR, and DS-RMLR + DS-RMRR, comprising 88 training attributes in the feature vectors to build the MTML model. Third, we integrate the properties of conformational dynamics and energetics by adding the RMLR features to DS-EE (data set DS-EE + DS-RMLR) to build the MTML model.
For the models trained by DS-EE, DS-PIE, DS-VDWE, DS-RMLR, DS-RMRR, DS-RMLR + DS-RMRR, and DS-EE + DS-RMLR, the highest prediction accuracies and standard deviations (SDs) of the combined four-class log10(kon)/log10(koff) are 71.79% (SD = 34.01%), 66.66% (SD = 36.87%), 47.43% (SD = 36.18%), 69.23% (SD = 37.37%), 57.69% (SD = 35.42%), 70.51% (SD = 31.68%), and 74.35% (SD = 27.79%), respectively (Figure 8). Moreover, among the three models trained by the energetic features, the prediction accuracies of the combined four-class log10(kon)/log10(koff) given by the DS-EE and DS-PIE models are significantly higher than that given by a random guess (50%) by 21.79% (t test p value = 2.0 × 10−5) and 16.66% (t test p value = 6.0 × 10−4), respectively, whereas the accuracy given by the DS-VDWE model is lower than that for the random guess by 2.57% (t test p value >0.05). These results suggest that in the case of HIV-1 protease, the EE features are more accurate in predicting kon and koff than the features of VDWE and PIE. This implies that EE plays a more important role than VDWE but may be not the only factor in driving the protein–ligand binding/unbinding kinetics, as it cannot capture conformational dynamics information, as shown by the feature selection results.
For all three models trained by the normal mode directionality features, the prediction accuracy of the combined four-class log10(kon)/log10(koff) is higher than that for the random guess. Although the accuracy given by the DS-RMRR model is only slightly higher than the random guess value by 7.69%, the accuracies given by the DS-RMLR and DS-RMLR + DS-RMRR models are significantly higher than for the random guess by 19.23% (t test p value = 3.0 × 10−4) and 10.51% (t test p value = 2.5 × 10−3), respectively. These results suggest that the normal mode directionality can capture information on the ligand binding and unbinding processes, further supporting the conclusion that they can be used as a kinetic fingerprint. Comparison of the prediction accuracies of the combined four-class log10(kon)/log10(koff) given by the DS-EE and DS-EE + DS-RMLR models shows that integrating the conformational dynamics features with the energetic features increases the accuracy by 2.56%, from 71.79% to 74.35% (t test p value = 0.02) (Figure 8). This implies that the electrostatic interactions and conformational dynamics are jointly responsible for the binding/unbinding kinetics of HIV protease. The overall prediction accuracy comes from three situations: correct prediction of both kon and koff (100%), one of them (50%), or neither of them (0%). Figure 9 shows the breakdown of these three situations. The numbers of 100% correctly predicted cases are the same for DS-EE (κ = 0.38), DS-RMLR (κ = 0.37), and DS-EE+DS-RMLR (κ = 0.34). The performance gain of DS-EE + DS-RMLR mainly comes from an increase in 50% accuracy.
For the multitarget classification model with the highest accuracy, which is based on DS-EE + DS-RMLR, Table 2 shows the four-class confusion matrix. Class (0,0) has the highest classification error. This may be due to the fact that the log10(koff)/log10(kon) values of several drugs in class (0,0) are close to those in classes (0,1) and (1,0). Figure 10 shows ROC curves for prediction of in vivo active compounds that fall into class (0,1), i.e., log10(koff) < −2.0 and log10(kon) > 5.6. The predictions are ranked by the number of votes for kon minus the number of votes for koff from the random forest model. The five FDA approved drugs saquinavir, indinavir, nelfinavir, amprenavir, and ritonavir are ranked at top 1, 2, 3, 8, and 10, respectively. The corresponding AUC value is 0.8668. It is noted that conventional affinity-based virtual screening may not distinguish compounds in class (0,1) from those in classes (0,0) and (1,1), as they may have similar Kd values. On the other hand, predicting kon or koff alone is not sufficient, as a compound with in vivo drug activity should have a relatively low koff and medium kon in the case of HIV protease inhibitors.24
The performance of the proposed predictive model may depend on how to discretize the log10(kon) and log10(koff) values. We performed LOO cross-validation for another discretization using the cutoffs to log10(koff) = −1.5 and log10(kon) = 5.0 (model of −1.5/5.0 cutoffs). The set with the −1.5/5.0 cutoffs includes 4, 18, 3, and 14 ligands in the four imbalanced binary classes (0,0), (0,1), (1,0), and (1,1), respectively (Table S4). As expected, the imbalanced class distribution slightly deteriorates the performance of the multitarget classifier. The accuracy of the combined four-class log10(kon)/log10(koff) prediction drops to 71.79%, which is lower than the accuracy of 74.35% based on the −2.0/5.6 cutoffs with DS-EE + DS-RMLR. However, patterns similar to those obtained using the −2.0/5.6 cutoffs are observed: the prediction accuracy of DS-EE + DS-RMLR is the best (70.15% for DS-EE only and 69.23% for DS-RMLR only), and most of residues in the flap region are identified only by DS-RMLR. Thus, the directionality displacement of normal modes upon ligand binding/unbinding is a robust fingerprint to capture long-time-scale conformational dynamics. From the machine learning point of view, the unbalanced data can be handled using under/oversampling. This strategy could be used to improve the model’s performance.
In spite of the recognized importance of protein–ligand binding and unbinding kinetics in drug discovery, few efficient computational tools are available to screen and optimize chemical compounds with in vivo drug activity on the basis of binding and unbinding kinetics in a high-throughput fashion. With the increasing availability of experimentally determined kon/koff data,16,17 a data-driven approach is an appropriate choice for the development of a high-throughput predictive model of ligand binding and unbinding kinetics.34 Moreover, recently developed multitarget classification algorithms such as RF-Clus can be adopted to train machine learning models that can predict coupled kon/koff simultaneously. To train a high-quality machine learning model with minimum impacts of overfitting and false correlation, it is important to select relevant features; in this case, these are the molecular determinants of the ligand binding and unbinding kinetics. For the first time, we have shown that the changes in the directionality of normal modes that are efficiently derived from coarse-grained NMA can capture the long-time-scale conformational dynamics of the ligand binding and unbinding kinetics. Thus, they can serve as a kinetic fingerprint to identify the molecular determinants of the binding/unbinding kinetics and enhance the performance of the machine learning model. It is expected that the residue normal mode directionality replacement fingerprint can be applied to other areas in structure-based drug design, such as protein–ligand docking.
It is known that electrostatic interactions between a charged drug and a charged receptor impact the kinetic rate constants.35,36 Specifically, kon is sensitive to long-range electrostatic interactions, and koff tends to be influenced more by short-range interactions such as hydrogen bonds, salt bridges, and van der Waals contacts.37 The majority of the residues involved in the HIV protease–ligand interaction are hydrophobic; the exceptions are the catalytic D25 and D29, which are able to form hydrogen bonds with the main-chain groups of substrate peptides, and R8, D30, and K45 which can interact with polar side chains or distal main-chain groups in longer substrate peptides. The MTML model identifies these charged residues. In addition, the MTML model can achieve higher prediction accuracy using the electrostatic energy than the van der Waals interaction. These results indicate that electrostatic interactions play a more important role than van der Waals interactions in determining the binding/unbinding kinetics of HIV protease.
Interestingly, in addition to the electrostatic interactions, the directionality of ligand-binding-site residue movement also has strong correlations with the kinetic constants. Consistent with the all-atom MD simulation results, the MTML model trained with the relative directionality of normal modes between residue and ligand recapitulates the role of the flap region in the binding kinetics of HIV protease, which cannot be fully recovered by the energetic features. Moreover, not only are similar residues selected by DS-RMLR and DS-EE, but also, the best-performing MTML model is obtained from the combination of DS-RMLR and DS-EE. On the basis of these observations, we propose that coherent movement between the ligand and the receptor may play a critical role in determining the ligand binding and unbinding kinetics. As shown in Figure 11, even if two protein–ligand complexes have the same non-covalent interactions with equal intensities, they may have different kinetic constants because of the different relative movements between the ligand atom and the receptor atom. This is not surprising, as the non-covalent interactions, especially hydrogen bonding, depend on the relative directionality of atomic pairs. The relative movement may change the directionality of the interaction, thus weakening (even breaking) or strengthening the interaction. Thus, the coherent conformational dynamics coupling could be one of the key structural determinants of protein binding/unbinding kinetics. This has not been observed before.
Although this proof-of-concept study demonstrates the potential of integrating coarse-grained normal mode analysis with multitarget machine learning in understanding the molecular determinants and developing a high-throughput predictive model of ligand binding and unbinding kinetics, there is plenty of space to improve the methodology.
There are three different models of the conformational ensemble of a protein–ligand complex. They are the model of the induced-fit mechanism, which is adopted by HIV-1 protein–ligand complexes,1,38 the model of the selected-fit mechanism,39,40 and the model of the three-step mechanism.41,42 The mechanism of the model determines the on-rate and off-rate equations. For example, the induced-fit on rate is limited by the diffusional rate of encounter complex formation of the proteins in their unbound conformational ensemble, but the off rate is dependent on the equilibrium between the ground-state complex and the excited-state complex.38 Since all of the training data sets in this study cover only the characteristics of the ground-state HIV-1 complex, the ignorance of the characteristics of the excited-state HIV-1 complex could induce deficiencies in the predictive model.
Since the solvation effect causes a time-scale discrepancy between real molecular motion and NMA, which is calculated in vacuum, it is expected that NMA coupled with an implicit or explicit solvation model may provide more information on the conformational dynamics of the ligand binding process. As water plays a critical role in ligand binding, explicit incorporation of water molecules in the binding site may improve the accuracy of NMA. The global and local geometries of the binding pocket could be another important feature.43,44 In the current study, the ligand was treated as a single rigid body. As a matter of fact, the flexibility of the ligand may have an impact on the kinetic rate constants. A survey of over 2000 drugs binding to G-protein-coupled receptors, protein kinases, and other enzymes found that higher-molecular-weight drugs tend to have lower off rates and that a greater number of rotatable bonds tend to have longer residence times.4,45 This suggests that the performance of the MTML model could be further improved by incorporating ligand properties. We group the kon/koff value into four classes and use the classification model to predict the class and to select features. In practice, it could be more useful to predict the real values of kon and koff simultaneously. This would require a multitarget regression model, which is an active area of research in machine learning.
For the first time, we have demonstrated that residue normal mode directionality displacement derived from coarse-grained NMA of protein–ligand complexes can capture the long-timescale conformational dynamics of protein–ligand binding/unbinding events. In certain molecular systems, the kinetic rate constants kon and koff may be determined by coherent coupling of the conformational dynamics and thermodynamic interactions between the receptor and the ligand. Thus, the residue normal mode directionality displacement may serve as a kinetic fingerprint to study the kinetics of protein–ligand interactions. Combining the kinetic fingerprint with state-of-the-art MTML enables us to screen chemical compounds with in vivo drug activity accurately and efficiently. The further development of such computational tools will bridge one of the critical missing links between in vitro compound screening and in vivo drug activity.
We sincerely thank the reviewers for their constructive criticisms and suggestions. This research was supported by the National Library of Medicine of the National Institutes of Health (Grant R01LM011986), the National Science Foundation (Grants CNS-0958379, CNS-0855217, and ACI 1126113), and the City University of New York High Performance Computing Center at the College of Staten Island.
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.5b00632.
Materials and Methods; association and dissociation rate constants (kon, koff) for the interactions between 39 inhibitors and HIV-1 protease (Table S1); results of discretization given by the model of −2/5.6 cutoffs (Table S2); PDB crystal structures used in the experiment (Table S3); results of discretization given by the model of −1.5/5 cutoffs (Table S4); PDB structures and corresponding HIV-1 protease inhibitors used for eHiTS docking (Table S5); equilibrium trajectory and the RMSD of the backbone atoms of HIV-1 bound with A045 during the 204 ps MD simulation (Figure S1); docking of ligands into HIV-1 protease (Figure S2); SASA TCL script with a probe radius of 1.4 Å for the A045–1AJV complex (Figure S3); average dielectric constants of different types of amino acids (Figure S4); NAMDEnergy graphical user interface (Figure S5) (PDF)
The authors declare no competing financial interest.