|Home | About | Journals | Submit | Contact Us | Français|
Cathepsin B, a ubiquitous lysosomal cysteine protease, is involved in many biological processes related to several human diseases. Inhibitors targeting the enzyme have been investigated as possible diseases treatments. A set of 37 compounds were recently found active in a high throughput screening assay to inhibit the catalytic activity of Cathepsin B, with chemical structures and biological test results available to the public in the PubChem BioAssay Database (AID 820). In the present study, we compare these experimental activities to the results of theoretical predictions from binding affinity calculation with a LR-MM-PNSA approach based on docked complexes. Strong correlations (r2 = 0.919 and q2 = 0.887 for the best) are observed between the theoretical predictions and experimental biological activity. The models are cross-validated by four independent predictive experiments with randomly split compounds into training and test sets. Our results also show that the results based on protein dimer show better correlations with experimental activity when compared to results based on monomer in the in silico calculations.
Cathepsins such as cathepsin B, D, H, K, L, and S are responsible for intracellular proteolysis in mammalian cells and are involved in biological processes related to variant diseases and disorders.1–4 Cathepsin B, a member of the papain superfamily,3 has been found involved in biological processes related to important human diseases, including neurodegenerative disorder, cardiovascular disease, cancer, inflammation, rheumatoid arthritis, and Alzheimer s disease.5–12 As one of the most abundant proteases found in mammalian cells, the enzyme was found to facilitate cell migration, thus to promote tumor metastasis and angiogenesis by mediating the dissolution of extracellular barriers.13–15 Researchers also found that the proteolysis function of the enzyme is essential to the entry and replication of several viruses such as Ebola and SARS.16 For these reasons cathepsin B has been selected as an important molecular target in drug development,17–19 and some inhibitors have been reported to show effect in the treatment of rheumatoid arthritis in animal experiments.5–12
Biology studies have shown that most cathepsin inhibitors can bind to the target protein irreversibly, while some bind reversibly. The irreversible inhibitors include dipeptidyl nitriles20, vinyl sulfones, expoxysuccinates, acyloxymethyl ketones, hydrazides, bis-α-amidoketones and fluoromethyl ketones.21 The reversible inhibitors include some α-ketoamide and aldehyde compounds.21 Experimental structure biology studies reveal that cathepsin inhibitors react with the catalytic thiol group and form a covalent bond with the enzyme.20,22 The structure complexes of cathepsin B and several inhibitors are available in the public domain. For example, the inhibitory mechanism of the dipeptidyl nitriles to the enzymatic catalysis of cathepsin B protein has been explored by structural biology study, which provided valuable information to the inhibition of active compounds to the enzyme.23–25
An alternative way to study the binding structure of a newly identified inhibitor with its receptor is by docking simulations. This methodology has been successfully used in many applications, including the study for the cathepsin L inhibitors.26 On the other hand, the general application of binding affinity calculation still remains a challenging task after many years of development. MM-PBSA, a method for computing binding affinity based on molecular mechanics (force field) and electrostatic calculation (Poisson Boltzmann (PB) or Generalized Born (GB)), was proposed and has been successfully applied to several studies.27–30 However it does not work universally. LR-MM-PBSA was proposed to address the problem of poor correlation between calculated and observed binding activity with the realization that there is no need to pursue the calculation of absolute free energy of binding for a set of compounds in drug development. In such situation a relative binding affinity is enough to rank a given set of compounds to facilitate molecular design.27–33 Both approaches will be compared in the current work.
The NIH Molecular Libraries project was aimed to discover small chemical as molecular probe, and to make such research tool available to the scientific community for the study of cellular biological function. A number of national wide screening centers were funded to perform industrial level HTS screening experiments. The resulted biological testing results from this US government initiated effort are made available through PubChem (http://pubchem.ncbi.nlm.nih.gov), a database containing data of chemical structures and the associated biological activity information. PubChem BioAssay database now contains about 30 million biological test results for over 200 protein targets. Such information is made freely accessible to world wide research community. PubChem also links the chemical structures and biological test results to PubMed articles and protein 3D information. Altogether these information provides unprecedent challenges as well as opportunities for researchers to explore the molecular mechanism between enzymes and their small molecular inhibitors.
With the HTS screening efforts of the NIH Molecular Libraries project, several series of compounds have been identified which demonstrate various inhibitory potency against cathepsin B and other cathepsins. The availability of the molecular structures and their biological activities in the PubChem database provides a unique opportunity for researchers to conduct structure-activity study in an effort to identify the key structure features responsible for the observed biological activity, and to explore the biological mechanism of these inhibitors. Such study will doubtlessly help the discovery of new lead compounds and facilitate the development of new inhibitors for therapeutic purposes to treat diseases related cathepsin malfunctions. Several works were reported to use computational approaches, such as docking and virtual screening, to study the binding mode of cathepsin L26 and cathepsin S26 inhibitors to their receptors. Given the importance of cathepsin B in disease treatment and the availability of highly selective inhibitors, we performed docking and binding affinity studies for a set of cathepsin B inhibitors in an effort to understand the binding and interaction mechanisms of the inhibitors in the protein, and to establish a model for binding affinity evaluation to facilitate the development of novo cathepsin B inhibitors.
The small molecule dataset of Cathepsin B inhibitors and their inhibition activity data were taken from the PubChem database. The biological activity of these compounds was measured based on a dose response confirmatory screening experiment (PubChem BioAssay ID: 820), where the inhibitory activity for a total of 75 compounds was tested by the Center for Molecular Discovery of University of Pennsylvania, and 37 compounds were confirmed with inhibitory activity against the catalytic activity of the human liver cathepsin B. Detailed information for the assay protocol and materials is available on the PubChem website.
The 2D structure of this set of small molecules was obtained from PubChem chemical library. 3D structures were generated and optimized using the Molecular Operating Environment (MOE) program (Chemical Computing Group, Montreal, Canada) and OPLS force field.34–37 The structure and biological activity of the 37 active compounds are listed in Table 1.
The three dimensional complex structure of cathepsin B with Dipeptidyl Nitrile (DPN) was obtained from the Protein Data Bank (PDB entry: 1GMY).20 To prepare for the docking experiment, the original protein coordinates were extracted from the PDB file and were submitted to Glide following the protocol for adding hydrogen atoms and assigning partial charges using the OPLS-AA force field. 34–37 A coarse energy minimization with a small number of steps was performed for the complex structure to relax amino residue side chains and the added hydrogen atoms. The ligand was used to determine the location of a docking grid box and was then removed prior to grid generation in next step.
The ligand compounds were docked into the ligand binding site of the prepared protein receptor using the GLIDE program (FirstDiscovery suite)38,39 and the OPLS-AA force field34–37. Monte Carlo sampling method were used with GLIDE. Small molecules are treated flexibly by GLIDE docking through an extensive conformational search while the receptor is held fixed.
Compounds with less than 200 atoms and less than 35 rotatable bonds were docked into the binding site by GLIDE. Conformational flexibility of the ligand was considered by generation of rotamers from explicit rotation of single-bond torsion angles. All rotamers for a given ligand were docked and oriented in the site, and grid-refinement was executed for 5000 poses. The top 400 refined poses per each ligand were subjected to 100 steps of conjugate gradient energy minimization. Poses of a given compound were not saved unless they differed from those obtained previously by more than 0.5 Å rms deviations in heavy atom coordinates, or involve atomic displacements by greater than 1 Å.
The LR-MM-PBSA approach, which has been applied to the study of HIV-1 RT non-nucleoside inhibitors and heparanase inhibitors by the author31, was used to calculate the FEB based on the docked cathepsin inhibitor structure complex. This approach is based on linear response (LR) and a FEB calculation using the molecular mechanics, Poisson-Boltzmann, and solvent accessible surface area (MM-PBSA) method27–32. With this approach, the binding affinity was estimated by combining empirical MM-PBSA energy calculation with linear response optimization of coefficient against known activity.
The protein-ligand complex obtained from docking calculations was partially minimized by allowing the ligand and the side chains of the protein receptor within 7Å of the ligand atoms to move while all other atoms were held fixed. The electrostatic energy (Eele) and van der Waals (vdW) energy were calculated using molecular mechanics (MM) based on the OPLS-AA force field. The polar part of the solvation energy was calculated using the Poisson-Boltzmann method and the non-polar part of the solvation energy was estimated by the change of the solvent accessible surface area (SASA) as determined by the ZAP program.40,41 After all energies were calculated, factor analysis (FA) and multiple regression analysis (MRA) were performed to derive a LR-like equation:
where a, b, c, and d are the weighting factors, ΔE is the energy difference between ligand/receptor complex and the free protein plus the ligand:
Although numeric applications have demonstrated the success of GLIDE in docking simulations for reproducing experimental binding structures, it remains a question whether it works for the particular biological system of one s interest. In order to examine the reliability of the docking method, a series of docking simulations were carried out to dock the original DPN ligand back to the ligand binding site of the cathepsin B protein complex structure (PDB code: 1GMY). The structure data obtained from the PDB entry contains three sets of coordinates corresponding to chain A, B and C respectively. The first and the first two sets of coordinates for the protein and DPN inhibitor were extracted and used in the monomer and dimer docking simulations, respectively. Water molecules were excluded in the docking simulations.
A docking grid was generated with a geometric center at the center of the ligand. It enclosed the whole binding site region, and covered an area about 5 Å larger than the dimension of the original ligand on each side. The DPN ligand was removed prior to the grid calculation, it was then docked back into the binding pocket based on newly-generated docking grid. The same grid were used in later docking simulations for all other inhibitors. The docked results were clustered and ranked by Gscore. The 20 binding conformations were outputted for each compound. The Gscore, Emodel, and binding energy calculated through the docking simulation for the set of compounds are listed in Table S1 of the supporting material along with experimental bioactivity, molecule formula, name, docking scores and some other information. The structure superposition between the conformation from the docking simulation and that of the crystallographic structure complex is shown in Figure 1. The root-mean-square deviations (RMSDs) of all heavy atoms of DPN between the docked poses and the crystallographic coordinates range from 0.4 Å to 1.1 Å (when six carbons of the methyl-phenyl ring are not distinguished). The biggest variance is observed at the nitrile group and the methyl-benzyl group which flips over 180 degree compared to the original conformation. Apart from that, the docked pose is overall very close to the experimental structure. These results show that the docking method is able to reproduce the experimental binding structure of cathepsin B- inhibitor complex.
The docking protocol employed in this study is further evaluated by taking the advantage of the availability of abundant cathepsin B crystal structures in complex with various ligands. 11 additional ligands taken from the crystal structures of cathepsin B complexes (PDB codes: 2DC6, 2DC7, 2DC8, 2DC9, 2DCA, 2DCB, 2DCC, 2DCD)22 were docked into the inhibitor binding site of 1GMY, the crystal structure of cathepsin B in complex with DPN(see the previous section). The availability of these experimental binding structures gives a unique opportunity to evaluate the docking protocol in the prediction of Cathepsin B-inhibitor binding structures. Prior to the comparison, all other complex structures were superimposed to the cathepsin B-DPN complex where protein alpha carbon atoms were optimally overlapped. The docking results (Figure 2) show that the docked poses of these compounds are close to their corresponding bound structures in crystallographic complexes. The RMSDs of heavy atoms between the docked and the crystallographic conformation for these ligands range from 0.5 Å to 1.2 Å. These results again indicate that the docking simulation method is able to reproduce precisely the bound structure of experimental complex for ligands from different chemical structure classes.
In the experimental structure from the crystallographic study, NDP binds in between a protein dimer, where part of the binding sites of NDP consist of residues from the first protein chain, while the rest binding sites are contributed by the second protein chain. To evaluate the effect of the second protein on the docking results, further docking experiments have been conducted for the inhibitors in the known crystal structure complexes, where the dimer of the cathepsin B crystal structure was used in all simulations. The docking results reveal that these compounds bind well in the binding pocket of the dimer. By comparing the docking results of the same compound in monomer and dimer, it is noticed that docked poses overlay largely. The dominating cluster of the docked poses obtained from the dimer calculation for each ligand is similar to that from the monomer calculation. In addition, these docked poses were compared with the crystal structure by superposing the alpha carbon atoms of the proteins; again the docked poses are similar to the corresponding crystal structures and the docking simulation based on the dimer of cathepsin B protein structures reproduced the experimental binding structures for these inhibitors.
As shown in Figure 4A, the compound is properly located in the binding site interacting with the primary (in red) protein as well as with the second (in green) protein. Over ¾ part of the inhibitor directly interacts with residues from the primary protein, while less than ¼ part of the inhibitor directly contacts with the second protein. The primary protein has made major contacts with the inhibitor, meanwhile the second one also makes contributions to the binding by closing up the solvent-exposed cleft to form a hydrophobic-liked pocket in favor of a hydrophobic substituent in an inhibitor, which may benefit the overall ligand binding affinity of the inhibitor by decreasing solvent-exposed surface area.
The 37 active compounds with confirmed inhibition activity identified from PubChem BioAssay database(AID:820) were docked into the cathepsin B protein structure to explore the binding modes and the interactions between the enzyme and these small molecules. The PubChem chemical structure identifier (CID), biological activity (IC50), IUPA name, molecular formula, and molecular weight of these compounds are listed in Table 1 and Table S1 in supporting materials. Some other information can be obtained from the PubChem website based on the CID and AID. The IC50 was calculated based on a dose-response experiment where 75 compounds were further verified for their biological activity for inhibiting the catalytic activity of the human liver cathepsin B. The assay was conducted by measuring the release of the fluorophore aminomethyl coumarin (AMC) from the hydrolysis of an AMC-labeled dipeptide. These 75 compounds were initially found to exhibit inhibitory activity to cathepsin B in a high throughput primary screening assay (AID: 453) out of 6332 tested compounds.
Both the monomer and dimer of the cathepsin B protein structure were used in the docking of the 37 compounds. The same docking protocol introduced in the previous experiments was used in these docking simulations. The best docked pose was picked for each compounds based on docking score (Gscore) and visual examination from the dominating cluster out of each respective docking simulation. Docked poses were visually checked to evaluate the interaction modes (Figure 3 and and4).4). All compounds can be docked into the binding sites and the overlap of these poses provides a general picture of the binding mode, from which it can be seen that the core of the binding site defined by original inhibitors in crystal complexes are occupied with much higher density by the compound atoms. A cluster of ligand is shown in Figure 3, where the docked poses are overlapped. These ligands occupy in the roughly same binding pocket. The overlap (Figure 3) of the 5 most active compounds shows that they were docked in a very similar pattern in which part of the molecules occupy the binding sites where the original DNP ligand is located. The second protein in the dimer (Figure 4A) shows direct interactions with all of these inhibitors and have contributed to the binding. When smaller molecules, such as CID 653297 and CID 647501, bind to the active site of the protein dimer, they have direct interactions with residues Gly27, Cys29, Gly198, Gly73, and Gly74 of the primary protein as well as with resides Gly197, Gly198, and Phe174 of the second protein (Figure 4C). For bigger molecules, such as 77B and CID 3243128, when binding in the active site (Figure 4B), additional interactions were observed between these compounds and residues His199 and Trp211 of the second protein.
The docking scores of these compounds are listed in Table S1 in supporting materials. The Gscore, which is used in GLIDE to evaluate the docking results, ranges from −3.44 to −7.04. The logIC50 of the 37 compounds also fall in a similar range between −4.35 to 7.34, for these compounds. However, with close examination by plotting docking scores and binding energy vs. the experimental activity data (logIC50), no apparent correlation was observed. Also there is not apparent correlation observed between Emodel, or the interaction energy (EvdW + Ecoul) from docking calculation and the experimental logIC50.
A number of methods and approaches have been suggested and applied to many applications for the calculation of free energy of binding (FEB). MMPBSA approach, which has approach involves calculation of several energy terms, including been applied in many studies, was elected in the work. This van de Waals(vdW) energy, electrostatic energy between the ligand and receptor, and the solvent energy upon ligand s movement from solvent to binding site. The vdW and electrostatic energies were calculated using the GLIDE program based on OPLS force field, and the solvation energy for binding was calculated using Poisson Boltzmann method of the ZAP program. The docked complex structures were used for the calculation of the free energy upon ligand-receptor binding. The complex structures were subject to a coarse minimization to optimize the binding structures prior to energy calculations. During the relaxation and the successive calculations, computations for three complexes failed due to the non-compatibility between ligands and the force field parameter. The computational results for the rest 34 ligand-receptor complexes were used in the following calculations.
Two sets of calculations were carried out based on docked structures obtained respectively from the monomer and dimer based ligand/cathepsin B docking simulations. The following energy calculations were applied to both sets of docked results. The calculated energies are listed in Table 1. The sum of these four energy terms does not have apparent correlation with the logIC50, which indicates that the MM-PBSA approach itself does not work for this system. Thus, the LR-MM-PBSA approach, a combination of MMPBSA and linear response approach, was resorted as a further step to calculate the binding affinity for this set of inhibitors. The optimized results are listed in Table 1–4.
The correlation coefficients between the experimental logIC50 and the calculated energies for the monomer and dimer based docking models are 0.555 and 0.791 (Table 4), respectively. The calculated results show that the binding affinity calculations from the dimer based docking models produced better results comparing to those from the monomer based docking models (see supporting materials). The best models were chosen from dimer results and will be detailed in following sections. The calculated binding affinity energies for the 34 active compounds based on protein-ligand complex dimer are listed in Table 1. The van de Waals energy between the ligands and receptor ranges from −29.00 kcal/mol to 57.28 kcal/mol, while the electrostatic energy of the binding ranges from −5.64 kcal/mol to 52.54 kcal/mol. In addition, a Liaison score (LiaScore) was calculated. An examination revealed that there is no apparent correlation between this score and logIC50.
The predicted results from all compound model and the model excluding 4 outliers are listed in Table 2. The “All Compounds” column in Table 2 is for the model based on all compounds and the “4 outliers” column is for the model excluding 4 outliers as recognized in the “All Compounds” model. The predicted activity and the error (the difference between the calculated value and the observed value) for each compound in the respective models are listed in the table. The correlation coefficient between the energies and the observed logIC50 for the ”All compounds” model is 0.791 (Table 4). Leave-one-out (LOO) validation method was used to validate the model. The validated activity of each compound was calculated based on a model built excluding this inhibitor to eliminate a compound having influence on its own prediction in order to minimize the over fitting phenomenon. The plot of the linear correlation between the predicted activity and observed activity is shown in Figure 5A. It is seen that the data dots fall no far away from the perfect fitting line which indicates that the predicted activity from validation has very good correlation with the observed activity. The validation coefficient is 0.695 (Table 4), which indicates that an acceptable correlation has been reached for this set of compounds.
By examining the error of the prediction, it is noticed that two compounds have prediction errors larger than 0.8 unit. When the two were treated as outliers, the model built by excluding the two compounds show significantly improved performance with correlation coefficient of 0.885 and cross validation coefficient of 0.839(Table 4). If using a threshold of prediction error larger than half unit (0.5), four compounds were identified as outliers When these four compounds were excluded from the building set, a regression model was reached with much better performance with a correlation coefficient of 0.919 and cross validation coefficient of 0.887 (Table 4).
To further examine the predictability of the approach, three validation models were constructed by randomly dividing all compounds into training set and test set. A model was constructed based on the compounds in the training set and then the activity is predicted for each of the compounds in the test set. The three regression models are designated as Valid1, Valid2, and Valid3, with different division of compounds in the training and test sets. The predicted activity for the training compounds and the testing compounds are listed in Table 3, the statistical results for the three models are listed in Table 4. As seen in Table 3, Valid1, Valid2, and Valid3 have 19, 21, and 19 compounds in training set and 15, 13, and 15 compounds in test set, respectively. The predicted results show that all compounds, except one in Valid1 model and two in Valid2 and Valid3, have predicted activity less than one unit (1.0). The plot for the correlation between the predicted and observed activities from the Valid1 is depicted in Figure 5B. Apparent correlation has been observed in the figure. This is a satisfactory result considering the range (~3) of the observed logIC50 values. All the three independent validation models demonstrate consistent performance and show that they are robust models for predicting biological activities.
Biological activity of a set of active cathepsin B inhibitors discovered by one HTS experiment (PuBChem AID:820) has been predicted by the binding affinity calculation through LR-MMPBSA approach based on docked complexes. The binding structure and interaction modes have been examined and the effect of protein dimerization on the binding has been explored. LR-MMPBSA approach has then been successfully applied for binding affinity calculations and the calculated binding affinity has shown strong correlation with the experimental activity (logIC50).
All the active compounds under this study can bind into the active site of cathepsin B protein monomer and dimer based on the docking simulations, where similar binding patterns were observed. Based on the docking results, it is shown that the second protein in a dimer has direct interactions with a bound inhibitor. The protein closes up partial solvent accessible open space observed in monomer complex. The close-up turns solvent-accessible area into partial hydrophobic binding area and would benefit ligands which have less hydrophilic groups bound to the part of the active site. Enhanced regression results were obtained from dimer-based calculations, which suggests that the protein dimer could be a better model for structure-based cathepsin B inhibitor design in the work, though the role of molecular interaction in the biological system between cathepsin B molecules need to be further investigated.
Satisfactory correlation coefficients have been obtained between the calculated binding affinity through the LR-MM-PBSA approach and the observed biological activity (logIC50). The predictability of the model is evaluated by the LOO method, cross-validation, and three independent training-test experiments. The predicted activities for the compounds in these validation experiments exhibit good correlations with the observed activity with a low rate of errors. The excellent performance of the models demonstrates its robustness, predictability, and feasibility when applied to the evaluation of cathepsin B inhibitors. Together with previous studies, the application shows that this method is a promising approach in structure-based drug development by meeting the need to rank and prioritize new compounds in silico design based on their binding affinity to the protein receptor.
This research was supported by the Intramural Research Program of the NIH, NLM. The authors thank the developers of Pymol software for sharing the program to prepare the molecular figures used in the paper.