Docking results for the original ligand in the crystal structure complex
Although numeric applications have demonstrated the success of GLIDE in docking simulations for reproducing experimental binding structures, it remains a question whether it works for the particular biological system of one s interest. In order to examine the reliability of the docking method, a series of docking simulations were carried out to dock the original DPN ligand back to the ligand binding site of the cathepsin B protein complex structure (PDB code: 1GMY). The structure data obtained from the PDB entry contains three sets of coordinates corresponding to chain A, B and C respectively. The first and the first two sets of coordinates for the protein and DPN inhibitor were extracted and used in the monomer and dimer docking simulations, respectively. Water molecules were excluded in the docking simulations.
A docking grid was generated with a geometric center at the center of the ligand. It enclosed the whole binding site region, and covered an area about 5 Å larger than the dimension of the original ligand on each side. The DPN ligand was removed prior to the grid calculation, it was then docked back into the binding pocket based on newly-generated docking grid. The same grid were used in later docking simulations for all other inhibitors. The docked results were clustered and ranked by Gscore. The 20 binding conformations were outputted for each compound. The Gscore, Emodel, and binding energy calculated through the docking simulation for the set of compounds are listed in
Table S1 of the supporting material along with experimental bioactivity, molecule formula, name, docking scores and some other information. The structure superposition between the conformation from the docking simulation and that of the crystallographic structure complex is shown in . The root-mean-square deviations (RMSDs) of all heavy atoms of DPN between the docked poses and the crystallographic coordinates range from 0.4 Å to 1.1 Å (when six carbons of the methyl-phenyl ring are not distinguished). The biggest variance is observed at the nitrile group and the methyl-benzyl group which flips over 180 degree compared to the original conformation. Apart from that, the docked pose is overall very close to the experimental structure. These results show that the docking method is able to reproduce the experimental binding structure of cathepsin B- inhibitor complex.
Docking of other ligands from crystal complexes into the cathepsin B
The docking protocol employed in this study is further evaluated by taking the advantage of the availability of abundant cathepsin B crystal structures in complex with various ligands. 11 additional ligands taken from the crystal structures of cathepsin B complexes (PDB codes: 2DC6, 2DC7, 2DC8, 2DC9, 2DCA, 2DCB, 2DCC, 2DCD)
22 were docked into the inhibitor binding site of 1GMY, the crystal structure of cathepsin B in complex with DPN(see the previous section). The availability of these experimental binding structures gives a unique opportunity to evaluate the docking protocol in the prediction of Cathepsin B-inhibitor binding structures. Prior to the comparison, all other complex structures were superimposed to the cathepsin B-DPN complex where protein alpha carbon atoms were optimally overlapped. The docking results () show that the docked poses of these compounds are close to their corresponding bound structures in crystallographic complexes. The RMSDs of heavy atoms between the docked and the crystallographic conformation for these ligands range from 0.5 Å to 1.2 Å. These results again indicate that the docking simulation method is able to reproduce precisely the bound structure of experimental complex for ligands from different chemical structure classes.
Docking of ligands from crystal complexes into the dimer of cathepsin B
In the experimental structure from the crystallographic study, NDP binds in between a protein dimer, where part of the binding sites of NDP consist of residues from the first protein chain, while the rest binding sites are contributed by the second protein chain. To evaluate the effect of the second protein on the docking results, further docking experiments have been conducted for the inhibitors in the known crystal structure complexes, where the dimer of the cathepsin B crystal structure was used in all simulations. The docking results reveal that these compounds bind well in the binding pocket of the dimer. By comparing the docking results of the same compound in monomer and dimer, it is noticed that docked poses overlay largely. The dominating cluster of the docked poses obtained from the dimer calculation for each ligand is similar to that from the monomer calculation. In addition, these docked poses were compared with the crystal structure by superposing the alpha carbon atoms of the proteins; again the docked poses are similar to the corresponding crystal structures and the docking simulation based on the dimer of cathepsin B protein structures reproduced the experimental binding structures for these inhibitors.
As shown in , the compound is properly located in the binding site interacting with the primary (in red) protein as well as with the second (in green) protein. Over ¾ part of the inhibitor directly interacts with residues from the primary protein, while less than ¼ part of the inhibitor directly contacts with the second protein. The primary protein has made major contacts with the inhibitor, meanwhile the second one also makes contributions to the binding by closing up the solvent-exposed cleft to form a hydrophobic-liked pocket in favor of a hydrophobic substituent in an inhibitor, which may benefit the overall ligand binding affinity of the inhibitor by decreasing solvent-exposed surface area.
Binding modes of the 37 active inhibitors in cathepsin B
The 37 active compounds with confirmed inhibition activity identified from PubChem BioAssay database(AID:820) were docked into the cathepsin B protein structure to explore the binding modes and the interactions between the enzyme and these small molecules. The PubChem chemical structure identifier (CID), biological activity (IC
50), IUPA name, molecular formula, and molecular weight of these compounds are listed in and
Table S1 in supporting materials. Some other information can be obtained from the PubChem website based on the CID and AID. The IC
50 was calculated based on a dose-response experiment where 75 compounds were further verified for their biological activity for inhibiting the catalytic activity of the human liver cathepsin B. The assay was conducted by measuring the release of the fluorophore aminomethyl coumarin (AMC) from the hydrolysis of an AMC-labeled dipeptide. These 75 compounds were initially found to exhibit inhibitory activity to cathepsin B in a high throughput primary screening assay (AID: 453) out of 6332 tested compounds.
Both the monomer and dimer of the cathepsin B protein structure were used in the docking of the 37 compounds. The same docking protocol introduced in the previous experiments was used in these docking simulations. The best docked pose was picked for each compounds based on docking score (Gscore) and visual examination from the dominating cluster out of each respective docking simulation. Docked poses were visually checked to evaluate the interaction modes ( and ). All compounds can be docked into the binding sites and the overlap of these poses provides a general picture of the binding mode, from which it can be seen that the core of the binding site defined by original inhibitors in crystal complexes are occupied with much higher density by the compound atoms. A cluster of ligand is shown in , where the docked poses are overlapped. These ligands occupy in the roughly same binding pocket. The overlap () of the 5 most active compounds shows that they were docked in a very similar pattern in which part of the molecules occupy the binding sites where the original DNP ligand is located. The second protein in the dimer () shows direct interactions with all of these inhibitors and have contributed to the binding. When smaller molecules, such as CID 653297 and CID 647501, bind to the active site of the protein dimer, they have direct interactions with residues Gly27, Cys29, Gly198, Gly73, and Gly74 of the primary protein as well as with resides Gly197, Gly198, and Phe174 of the second protein (). For bigger molecules, such as 77B and CID 3243128, when binding in the active site (), additional interactions were observed between these compounds and residues His199 and Trp211 of the second protein.
The docking scores of these compounds are listed in
Table S1 in supporting materials. The Gscore, which is used in GLIDE to evaluate the docking results, ranges from −3.44 to −7.04. The
logIC
50 of the 37 compounds also fall in a similar range between −4.35 to 7.34, for these compounds. However, with close examination by plotting docking scores and binding energy vs. the experimental activity data (
logIC
50), no apparent correlation was observed. Also there is not apparent correlation observed between Emodel, or the interaction energy (E
vdW + E
coul) from docking calculation and the experimental
logIC
50.
Binding affinity calculations
A number of methods and approaches have been suggested and applied to many applications for the calculation of free energy of binding (FEB). MMPBSA approach, which has approach involves calculation of several energy terms, including been applied in many studies, was elected in the work. This van de Waals(vdW) energy, electrostatic energy between the ligand and receptor, and the solvent energy upon ligand s movement from solvent to binding site. The vdW and electrostatic energies were calculated using the GLIDE program based on OPLS force field, and the solvation energy for binding was calculated using Poisson Boltzmann method of the ZAP program. The docked complex structures were used for the calculation of the free energy upon ligand-receptor binding. The complex structures were subject to a coarse minimization to optimize the binding structures prior to energy calculations. During the relaxation and the successive calculations, computations for three complexes failed due to the non-compatibility between ligands and the force field parameter. The computational results for the rest 34 ligand-receptor complexes were used in the following calculations.
Two sets of calculations were carried out based on docked structures obtained respectively from the monomer and dimer based ligand/cathepsin B docking simulations. The following energy calculations were applied to both sets of docked results. The calculated energies are listed in . The sum of these four energy terms does not have apparent correlation with the logIC50, which indicates that the MM-PBSA approach itself does not work for this system. Thus, the LR-MM-PBSA approach, a combination of MMPBSA and linear response approach, was resorted as a further step to calculate the binding affinity for this set of inhibitors. The optimized results are listed in –.
| Table 4The statistics information for the models built based on cathepsin B protein monomer and dimer. |
The correlation coefficients between the experimental logIC50 and the calculated energies for the monomer and dimer based docking models are 0.555 and 0.791 (), respectively. The calculated results show that the binding affinity calculations from the dimer based docking models produced better results comparing to those from the monomer based docking models (see supporting materials). The best models were chosen from dimer results and will be detailed in following sections. The calculated binding affinity energies for the 34 active compounds based on protein-ligand complex dimer are listed in . The van de Waals energy between the ligands and receptor ranges from −29.00 kcal/mol to 57.28 kcal/mol, while the electrostatic energy of the binding ranges from −5.64 kcal/mol to 52.54 kcal/mol. In addition, a Liaison score (LiaScore) was calculated. An examination revealed that there is no apparent correlation between this score and logIC50.
The predicted results from all compound model and the model excluding 4 outliers are listed in . The “All Compounds” column in is for the model based on all compounds and the “4 outliers” column is for the model excluding 4 outliers as recognized in the “All Compounds” model. The predicted activity and the error (the difference between the calculated value and the observed value) for each compound in the respective models are listed in the table. The correlation coefficient between the energies and the observed logIC50 for the ”All compounds” model is 0.791 (). Leave-one-out (LOO) validation method was used to validate the model. The validated activity of each compound was calculated based on a model built excluding this inhibitor to eliminate a compound having influence on its own prediction in order to minimize the over fitting phenomenon. The plot of the linear correlation between the predicted activity and observed activity is shown in . It is seen that the data dots fall no far away from the perfect fitting line which indicates that the predicted activity from validation has very good correlation with the observed activity. The validation coefficient is 0.695 (), which indicates that an acceptable correlation has been reached for this set of compounds.
| Table 2The predicted bioactivities vs. the observed activities based on the best model. |
By examining the error of the prediction, it is noticed that two compounds have prediction errors larger than 0.8 unit. When the two were treated as outliers, the model built by excluding the two compounds show significantly improved performance with correlation coefficient of 0.885 and cross validation coefficient of 0.839(). If using a threshold of prediction error larger than half unit (0.5), four compounds were identified as outliers When these four compounds were excluded from the building set, a regression model was reached with much better performance with a correlation coefficient of 0.919 and cross validation coefficient of 0.887 ().
To further examine the predictability of the approach, three validation models were constructed by randomly dividing all compounds into training set and test set. A model was constructed based on the compounds in the training set and then the activity is predicted for each of the compounds in the test set. The three regression models are designated as Valid1, Valid2, and Valid3, with different division of compounds in the training and test sets. The predicted activity for the training compounds and the testing compounds are listed in , the statistical results for the three models are listed in . As seen in , Valid1, Valid2, and Valid3 have 19, 21, and 19 compounds in training set and 15, 13, and 15 compounds in test set, respectively. The predicted results show that all compounds, except one in Valid1 model and two in Valid2 and Valid3, have predicted activity less than one unit (1.0). The plot for the correlation between the predicted and observed activities from the Valid1 is depicted in . Apparent correlation has been observed in the figure. This is a satisfactory result considering the range (~3) of the observed logIC50 values. All the three independent validation models demonstrate consistent performance and show that they are robust models for predicting biological activities.
| Table 3The predictability of the models based on different training sets measured by the predicted bioactivities compared with the observed activity. |